May 04 2009

The Right Way to do Monitoring and Mass Administration

Published by at 9:39 am under Technology   

Over the weekend I flipped through these slides about Nanite (code), and it got me thinking about system monitoring (again), as well as mass administration tools (Puppet and its younger competitor Chef). The key bit from the talk is the idea of using a proven, off the shelf messaging server (RabbitMQ) as the communication bus among a set of processes running on many servers.

I would like very much to see a piece of software that puts these pieces together:

  1. Monitoring features, like those in Zabbix or other similar tools
  2. Mass administration features, like those in Puppet
  3. Run it over a messaging bus rather than a homegrown communication mechanism

Such a system would allow some very nice improvements:

  • The messaging bus could provide real time “presence” information.
  • Urgent events could be sent immediately, rather than polled.
  • Urgent administration changes could be sent over the same communication channel as normal operations, unlike (for example) the puppetrun mechanism is puppet.
  • The specification for how a server is configured could be integrated in to the specification for how it should be monitored. This would be an enormous improvement over the current state of the art (in open source tools anywhere) where these two concerns are separated in to tools that don’t talk to each other.

In addition to the feature improvements, I suspect that both kinds of tools (monitoring and administration) would find they can get by with a smaller codebase by outsourcing the communication bus to a messaging server.

If you found this post useful, please link to it from your web site, mention it online, or mention it to a colleague.

One response so far

One Response to “The Right Way to do Monitoring and Mass Administration”

  1. Karl Katzke says:

    Kyle, have you looked at the OpenAIS stuff? The documentation kinda sucks right now, but it’s the “new” heartbeat for high-availability stuff on linux. I’m successfully using it with Pacemaker (the cluster resource manager) to implement failover, mirroring, colocation, and migration of all kinds of linux services in a high-availability cluster.