Section 10.3. Alerting

10.3. Alerting

Tracking stats through Ganglia, while a fairly essential part of any ongoing application, is not enough. When a database server crashes (and unfortunately, everything does), it's not very useful for us to know about it 24 hours later when we look at a query rate graph and see that we haven't been serving any select queries.

What we need is a tool that will monitor key statistics and alert us when they change to a certain value, drop below a certain value, or go above a certain value. The Nagios system (http://www.nagios.org/) allows you to do just that. Nagios allows you to write monitoring "plug-ins" that test various system services and report back their status. When services move from OK into Warning or Critical states, a configurable chain of actions occur, emailing and paging engineers, disabling services, and so on.

When configuring a Nagios install, you need to design an escalation chain for your system. If you're the only engineer, then you'll need to direct all pages to yourself, but as your team grows, different people can take responsibility for different components. At some point, you can separate engineering from operations and take engineers off the direct paging list. Then when a failure is detected, on-call operations get paged. When they can't deal with the issue directly, they pass the event on to the next person in the escalation chain.

You need to think carefully about how the escalation chain works for each different failure from each different component in your system. Database issues may need to go to a different engineer than web or storage server issues. When the on-call engineer gets paged about an issue and she can't solve it, you need to have a clearly documented escalation procedure to make sure the right person gets notified. For minor problems, this can often mean emailing an engineer so she can fix the issue the next time she's onlinenot every issue is worth getting someone out of bed for.

The types of checks that you'll want Nagios to perform fall into a few basic categories, and we'll look at each in turn.

10.3.1. Uptime Checks

the most basic form of monitoring check is checking that a service is up. This ranges from checking that a host is alive and on the network by performing a ping to component-level services. A web server needs a check to see if Apache is running, while a database server needs a check to see the MySQL is running.

In addition to checking services as a whole, you might want to check subservices. Within MySQL, you'll want to check that all slaves have active I/O and SQL threads, to show that replication hasn't stopped. All main components and services in your system should be checked, such as memcached, Squid, NFS, and Postfix. Some services are more critical than others; losing 1 web server out of 20 isn't a big deal, while losing your MySQL master server should be paging your DBA immediately.

10.3.2. Resource-Level Monitoring

Resources that can gradually be used up, such as disk space or network bandwidth, should be monitored with high watermark checks. When a disk becomes 95 percent free, you might want to set a warning in Nagios, while 98 percennt would set off a critical warning.

Much lower numbers can be used for warnings, which will allow you to use Nagios warning alerts as indications that you need to expand capacity by purchasing more disk space or connectivity bandwidth.

10.3.3. Threshold Checks

Some statistics can't be "used up" like disk space, but still present issues when they start getting above a certain high watermark. For MySQL replication setups, it's good to be paged when replication lag goes over a certain value, as the system will be in a state that needs tending to.

Similarly, you'll want to keep an eye on things like disk I/O and memory page swap rates so that you can be alerted when things start to go wrong. If you keep your server clocks in sync using NTP (or even if you don't), then it can be useful to be alerted when server clocks drift too far apart. This is especially important if your system relies on clocks being roughly in sync. If some data being created gets dated correctly while other data is five minutes behind, then you're going to end up with some oddly sorted results. This can be especially weird in discussion forums, where some new replies will appear at the end of a thread while some will appear halfway up the thread.

10.3.4. Low-Watermark Checks

Not everything than can go wrong is covered by the standard checks, so it can often be helpful to be alerted when it starts to look like things are going badly. Low-watermark checks can alert you when a particular statistic drops below a defined threshold, indicating that something's starting to drop, even if it hasn't dropped to zero.

By putting a low-watermark check on MySQL queries or Apache requests, you can be alerted when traffic starts to unexpectedly drop. This could indicate any number of things: a load balancer isn't directing enough traffic to a machine, the machine is clogged and can only service a reduced number of connections, a routing issue has cut off a portion of the Internet, a network is flooded and dropping packets, or even that you just don't have a lot of people using your application. Low-watermark checks are a useful mechanism to alert you to further investigate a situation and respond to a threat before it starts to bring down servers.

There are a huge number of statistics to track and monitor, and the initial hurdle is being able to track them at all and be alerted when things go wrong. Once this is in place, deciding which statistics to track and which to ignore can be quite a challenge. A good rule of thumb is to track more than you need to. It's easy to ignore extra statistics, but tough to do without the ones you need. Start with a large set and reduce it down as time goes by and you get a better understanding of what you find useful. And remember, a well-monitored application is a happy application.