Chapter 10. Statistics, Monitoring, and Alerting

Your killer application is going well and is scaling to provide services to the millions of paying customers clamoring to use it. Performance is good, the system scales easily, and you've made the world a better place.

The reality is that as time goes on, things change, and components fail. Your challenge in maintaining a large application is to make sure you stay on top of what's happening, limits you're reaching, and places you're over-provisioning.

Performing periodic spot-checks for optimizing bottlenecks is all well and good, but consistent long-term monitoring provides a much better overview of how the system is performing and helps us see weak spots on the horizon. It's more or less impossible to test everything before pushing real traffic onto it. Even with intensive load testing, we can never be sure exactly what users are going to do and how their behavior will affect the system. By monitoring our system well, we can see what's happening in real time and address any problems as they occur. Of course, being able to monitor a system well is the real challenge.