Netdata, Telegraf, and Prometheus Exporters
For a while now I’ve been using Netdata as my default option when collecting metrics, but I’ve heard a bit about other options and got curious. I couldn’t find a direct comparison so I decided to do a quick one myself. My primary candidates are Netdata, Prometheus Exporters, and Telegraf. My approach is essentially to install each on a separate EC2 instance on Amazon Web Services (AWS), scrape the metrics using Prometheus, and do some exploration in Grafana.
Installation is predictably easy for each, all three are offered as single, static binary download. I’m starting with the default configuration for each without plugins. Note that all expose metrics in the Prometheus standard format by default. As of writing Node Exporter provides ~379 metrics, Telegraf provides ~193, and Netdata provides a whopping ~1833. Suffice to say I’m feeling vindicated right out of the gate. While comparatively, Netdata may seem excessive, it still doesn’t consume any significant amount of CPU (<1% on average) or RAM (<100MB) on the t3.micro (2vCPUs, 1GB memory) class of EC2 instances, and I would argue that it’s better to have too many metrics than not enough. And, of course metrics can simply be dropped or the fidelity reduced as appropriate for your use case.
Both Telegraf and Netdata use a plugin model to augment monitoring capabilities, and they both have plenty options to choose from with over 200 plugins advertised for each. However, Telegraf has input, output, aggregator, and processor plugins while, as far as I’m aware, Netdata plugins are strictly integrations (i.e. they tell Netdata to collect more metrics and that’s about it). The Prometheus project takes a slightly different approach, expecting exporters to be implemented for each case. There are around 150 exporters advertised, and applications may be instrumented directly to expose metrics in the Prometheus format. The Telegraf style seems much more powerful in isolation, allowing you to delegate data transformations to the agents and enable either a push or pull model, but without a clear vision for your overall architecture it seems like you could get into trouble by over-complicating the architecture for your monitoring stack. From this perspective, each option is pretty evenly matched, and any choice would be an excellent one. I personally favor my current approach of exposing metrics, supplemented with exporters, and scraping via Prometheus. I would probably avoid getting too fancy with Telegraf unless I had real reason to.
Netdata still stands out to me by providing so much by default: a ton of metrics, a dashboard with a graph for each one, an attractive presentation, easy plugin installation, auto-updates, and stellar documentation on the project page ( seriously, check it out: https://github.com/netdata/netdata ). In particular, I find it handy to have a dashboard with every single metric available on each node by default, especially if your monitoring stack, ironically, isn’t terribly reliable or you just don’t have good graphs written for a specific case. It’s a solid fallback to have available, and I’ve found it useful on several occasions. A nice thing about Prometheus style exporters, though, is that they are light and intended for a specific use case. For example, I have used the JMX exporter alongside Netdata to expose metrics from a Java application running on a server. It may also make more sense if you are writing an application to instrument it to expose metrics in the Prometheus format, rather than write a separate plugin and maintain two separate, but coupled code-bases. The exporter style is a simpler, lighter solution, so if you just want to put some metrics into Prometheus without any bells and whistles, or package some metrics instrumentation into a container image or something, but keep it limited to the bare minimum, you might go with just an exporter. Telegraf is written to integrate very tightly with InfluxDB, so I would likely give it much more serious consideration if the stack I was working with was closer to the TICK stack ( Telegraf, InfluxDB, Chronograf, Kapacitor ), but I don’t know that I would currently.