Abstract
Thanks to a code contribution from VMware, version 1.8 of the Telegraf metric collector fully supports pulling metrics from vSphere. Since Telegraf is the underlying metric collector for many of the metric sources available to Wavefront, this brings the full set of vSphere metrics to Wavefront.
In this article we will discuss why this is important and how various personas in an organization can use this to quickly troubleshoot and optimize their applications.
What is Wavefront?
Wavefront is a Software-as-a-Service (SaaS) offering from VMware for time series analysis. Thanks to its virtually infinitely scalable architecture, it can consume, analyze and correlate millions of data points per second. Wavefront supports data from virtually any source and is used by organizations for tasks ranging from advanced troubleshooting to application optimization and business process optimization. More information is available here: https://www.wavefront.com/
What is Telegraf?
Telegraf is an open source performance and health metric collector that has become something of an industry standard for metric collection. It owes much of its success to its policy of welcoming contributions from third parties and supports, at the time of writing, over 130 data sources ranging from hardware to application frameworks and middleware. Read more about it here: https://www.influxdata.com/time-series-platform/telegraf/
Doesn’t vRealize Operations already monitor vSphere?
Let’s address this one upfront: There is no doubt vRealize Operations is the gold standard for vSphere monitoring, alerting, capacity planning and cost analysis. Going after those use cases would be insane and is certainly not the reason we are doing this. Instead, it’s about augmenting the monitoring we are already doing in Wavefront with metrics from the virtualization infrastructure. One of the sweet spots of Wavefront is application monitoring. But applications don’t live in a vacuum. They all depend on some kind of infrastructure. And when we are troubleshooting an application, it’s sometimes useful to correlate application behavior against events in the infrastructure. The mantra of Wavefront is “the more data you send to it, the more useful it becomes”. The main strength of Wavefront is the ability to correlate huge sets of time series data to find patterns leading us to a root cause. Having access to virtualization infrastructure data is an important piece of that puzzle.
Overview of the Solution
The basic idea is very simple: We can point Telegraf to a vCenter and collect as many or as few metrics as we like. All we need is an address and login credentials to vCenter and the address of a Wavefront proxy and we’re ready to do some monitoring!
Metric points are tagged to be mapped to the typical vSphere concepts, such as host, cluster and datacenter. Thanks to this, we can easily write queries that e.g. look at sums and averages across clusters or datacenters. If you are familiar with Wavefront, you know the power it gives you by allowing you to slice the data across different dimensions in seconds. Let’s take it for a spin!
First, let’s just run a simple query showing the CPU run time across all of my VMs. Go ahead, move your mouse over the diagram! It’s real data.
That’s an awful lot of data! Upon closer inspection, find that it’s showing us data for every virtual CPU core. Since we don’t need that, we can add a filter to the query that only shows the average produced by vSphere. We do this by adding a “cpu=instance_total” to the query.
But what if that’s not what we want? Maybe we’re only interested in the virtual core that works the hardest. No problem, we can use the interactive Query Builder to aggregate the core metrics into a metric reflecting the busiest core.
Correlating Behavior
Looking at the chart above, it’s clear that something is causing CPU spikes about once an hour. Upon further inspection, it turns out this is isolated a VM called “freenas-01”, which happens to be running parts of the storage in my lab. Could it be that some behavior of a VM is putting load on freenas-01 once every hour, causing those spikes? The data is very noisy and the correlation is bound to be weak, but let’s give it a try. We are correlating CPU usage on freenas-01 with CPU usage of every VM in the system and picking the top one. In fact, we’re picking the top two, since freenas-01 will of course have perfect correlation with itself.
It turns out we have a fairly weak but significant correlation with vc-01, which happens to be my vCenter. This is probably due to vCenter performing some housekeeping tasks every hour. The correlation is bound to be fairly weak, since there are a lot of workloads using the FreeNAS, but by singling out the top correlation, we’ve found a candidate for further exploration.
Using vSphere Metrics to Find Application Issues
The post Telegraf 1.8 brings vSphere Metrics to Wavefront appeared first on VMware Cloud Management.