Network / System Monitoring Smorgasbord

At one of my firms (a Software as a Service provider), we have a Zabbix installation in place to monitor our piles of mostly Linux servers. Recently we look a closer look at it and and found ample opportunities to monitor more aspects, of more machines and device, more thoroughly. The prospect of increased investment in monitoring led me to look around at the various tools available.

The striking thing about network monitoring tools is that there are so many from which to choose. Wikipedia offers a good list, and the comments on a Rich Lafferty blog post include a short introduction from several of the players. (Update – Jane Curry offers a long and detailed analysis of network / system monitoring and some of these tools (PDF).)

For OS level monitoring (CPU load, disk wait time, # of processes waiting for disk, etc.), Linux exposes extensive information with “top”, “vmstat”, “iostat”, etc. I was disappointed to not find any of these tools conveniently presenting / aggregating / graphing the data therein. From my short look, some of the tools offer small subsets of that data; for details, they offer the ability for me to go in and figure out myself what data I want in and how to get it. Thanks.

Network monitoring is a strange marketplace; many of the players have a very similar open source business model, something close to this:

  • core app is open source
  • low tier commercial offering with just a few closed source addons, and support
  • high tier commercial offering with more closed source addons, and more support

I wonder if any of them are making any money.

Some of these tools are agent-based, others are agent-less. I have not worked with network monitoring in enough depth to offer an informed opinion on which design is better; however, I have worked with network equipment enough to know that it’s silly not to leverage SNMP.
I spent yesterday looking around at some of the products on the Wikipedia list, in varying levels of depth. Here I offer first impressions and comments; please don’t expect this to be comprehensive, nor in any particular order.

Zabbix

Our old installation is Zabbix 1.4; I test-drove Zabbix 1.6 (advertised on the Zabbix site as “New look, New touch, New features”. The look seemed very similar to 1.4, but the new feature list is nice.

We most run Ubuntu 8.04, which offers a package for Zabbix 1.4. Happily, 8.04 packages for Zabbix 1.6 are available at http://oss.travelping.com/trac.

The Zabbix agent is delightfully small and lightweight, easily installing with a Ubuntu package. In its one configuration file, you can tell it how to retrieve additional kinds of data. It also offers a “sender”, a very small executable that transmits a piece of application-provided data to your Zabbix server.

I am reasonably happy with Zabbix’s capabilities, but I have the GUI design to be pretty weak, with lots of clicking to get through each bit of configuration. I built far better GUIs in the mid-90s with far inferior tools to what we have today.  Don’t take this as an attack on Zabbix in particular though; I have the same complaint about most of the other tools here.

We run PostgreSQL; Zabbix doesn’t offer any PG monitoring in the box, but I was able to follow the tips at http://www.zabbix.com/wiki/doku.php?id=howto:postgresql and get it running. This monitoring described there is quite high-level and unimpressive, though.

Hyperic

I was favorably impressed by the Hyperic server installation, which got two very important things right:

  1. It included its own PostgreSQL 8.2, in its own directory, which it used in a way that did not interfere with my existing PG on the machine.
  2. It needed a setting changed (shmmax), which can only be adjusted by root. Most companies faced with this need would simply insist the installer run as root. Hyperic instead emitted a short script file to make the change, and asked me to run that script as root. This greatly increased my inclination to trust Hyperic.

Compared to Zabbix, the Hyperic agent is very large: a 50 MB tar file, which expands out to 100 MB and includes a JRE. Hyperic’s web site says “The agent’s implementation is designed to have a compact memory and CPU utilization footprint”, a description so silly that it undoes the trust built up above. It would be more honest and useful of them to describe their agent as very featureful and therefore relatively large, while providing some statistics to (hopefully) show that even its largish footprint is not significant on most modern servers.

Setting all that aside, I found Hyperic effective out-of-the-box, with useful auto-discovery of services (such as specific disk volumes and software packages) worth monitoring, it is far ahead of Zabbix in this regard.

For PostgreSQL, Hyperic shows limited data. It offers table and index level data for PG up through 8.3, though I was unable to get this to work, and had to rely on the documentation instead for evaluation. This is more impressive at first glance than what Zabbix offers, but is still nowhere near sufficiently good for a substantial production database system.

Ganglia

Unlike the other tools here, Ganglia comes from the world of high-performance cluster computing. It is nonetheless apparently quite suitable nowadays for typical pile of servers. Ganglia aims to efficiently gather extensive, high-rate data from many PCs, using efficient on-the-wire data representation (XDR) and networking (UDP, including multicast). While the other tools typically gather data at increments of once per minute, per 5 minutes, per 10 minutes, Ganglia is comfortable gathering many data points, for many servers, every second.

The Ganglia packages available in Ubuntu 8.04 are quite obsolete, but there are useful instructions here to help with a manual install.

Nagios

I used Nagios briefly a long time ago, but I wasn’t involved in the configuration. As I read about all these tools, I see many comments about the complexity of configuring Nagios, and I get the general impression that it is drifting in to history. However, I also get the impression that its community is vast, with Nagios-compatible data gathering tools for any imaginable purpose.

Others

Zenoss

Groundwork

Munin

Cacti

How Many Monitoring Systems Does One Company Need?

It is tempting to use more than one monitoring system, to quickly get the logical union of their good features. I don’t recommend this, though; it takes a lot of work and discipline to set up and operate a monitoring system well, and dividing your energy across more than one system will likely lead to poor use of all of them.

On the contrary, there is enormous benefit to integrated, comprehensive monitoring, so much so that it makes sense to me to replace application-specific monitors with data feeds in to an integrated system. For example, in our project we might discard some code that populates RRD files with history information and published graphs, and instead feed this data in to a central monitoring system, using its off-the-shelf features for storage and graphing.

A flip side of the above is that as far as I can tell, none of these systems offers detailed DBA-grade database performance monitoring. For our PostgreSQL systems, something like pgFouine is worth a look.

Conclusion

I plan to keep looking and learning, especially about Zenoss and Ganglia. For the moment though, our existing Zabbix, upgraded to the current version, seems like a reasonable choice.

Comments are welcome, in particular from anyone who can offer comparative information based on substantial experience with more than one of these tools.

High Quality Screen Recordings

At Oasis Digital we’ve found that we can communicate effectively with each other and with customers, across time and space, using screen + audio recording (also called screencasts or screen videos). We use these to demonstrate a new feature, to explain how code works, to described how a new feature should work, etc. The communication is not as good as a live, in-person meeting/demo, but the advantages often outweigh that factor:

  1. No travel.
  2. No need to syncronize schedules.
  3. The receiving person can view the recording repeatedly, at their convenience.
  4. Customers and develoeprs who join the project team later, can look at old recordings to catch up.

It turns out that I am unusually picky about the quality of such recordings; I’ve written up some technical notes on how to get good results, and posted them: HighQualityScreenRecordings.pdf.

A few highlights:

  • A reasonably fast computer can both run application and record screen video at the same time; but if you will be recording the use of an application that generates a lot of disk activity, you must save the video to separate hard drive (internal, external, network server, etc.) from the hard drive you are running your OS and applications from. (For applications that generate little disk activity, a single system hard drive works fine.)
  • Use a headset-style microphone, and record in a quiet place: close the door, turn off the music, etc.
  • Adjust your audio levels well. Please. This is the most common and most annoying problem with screencast and podcast recordings I find.
  • Bytes are cheap; use a sufficiently large window and sufficiently high bitrate.

Many more details are in the PDF linked above.

The J2EE / Rails / TurboGears / etc. Video

Lots of people are linking to this excellent video (presentation and screencast) by Sean Kelly at JPL (380 megabyte, 30+ minutes, worth it):

http://oodt.jpl.nasa.gov/better-web-app.mov

Watch it all the way through. Wow.

It’s a little over-the-top in its J2EE example; but the disparity is still stark… and near to my heart as we have a time tracking web application in development.

But I think he stopped short of a fundamental point: that the benefits of dynamic languages, XML-free convention over configuration, fast turnaround, etc. are not specific to GUIs or web applications; he experienced them there and showed them there, but as far as I can tell the same issues and benefits apply equally to the non-GUI tiers of a system.

Slider Control for Touch-Screen Applications

An improved version of this post is cross-posted on the Oasis Digital blog.

At Oasis Digital we are working on an application that will run on a touch-screen computer, and which will be used to (among other things) control an audio amplification system. There are some design considerations for touch-screen applications which are rather stark once you use the touch-screen for a few minutes:

  • A finger is rather less precise than a mouse pointer – hitting small targets is hard
  • Drag/drop operations (or grab-adjust operations) sometimes don’t start quite where you aimed
  • Your finger blocks the point on the screen where you are pressing
  • The concept of keyboard “focus” is moot on applications with no keyboard

To accomodate the first two of these, I’ve built a prototype/example of a slider control to use for audio adjustment in such an application. It has these key features:

  1. It does not matter if you click on the “handle” or on the rest of the bar – because with touch screen you won’t be able to reliably do one vs. the other.
  2. The adjustment is not immediate; there is a limit of the speed of the adjustment to produce smoother audio control. The slider handle will move down very quickly, but will move up slowly. This avoids the possibility of an accidental touch pushing the audio amplification in to feedback.

The prototype/example looks like the screenshot here:

Touchscreen Slider

… but ignore the static image, it doesn’t tell the story. To see it in action, take a look at the screencast (currently in a WMV file, easily viewable only on Windows) or download the example program and play with it. The source is also on github.

This example runs on Delphi 2007 (rather old… but I had it handy on the PC at hand), an enhanced version of it in us used in a production application for a touch-screen audio control system. My secondary goal was to use it as a target to create an AJAX control with the same behavior. Anyone want to write that?