Even with our skilled team, it helps to have systems in place informing us of any problems, whether real or potential. The Production Operations team's monitoring system of choice is Nagios.
Nagios is a 10-year-old, battle-hardened system for monitoring... just about anything. At Lithium, we use it to make sure our communities are running, to make sure our servers are healthy and to give us insights into services that may be due for a preemptive restart or configuration change.
Lithium has extended Nagios through various means: We've created a custom IRC bot to alert us to any changes in service status. We also work actively with our development teams to incorporate monitoring hooks directly into our application, to monitor the internals of our communities. These hooks give us rich access to health data such as garbage collection statistics over time, acute views at memory usage of specific pieces of the application and complex analyses of internal structures to provide heads-ups of any possible issues.
Nagios is a critical tool for the Production Operations team, but its extension and the maturity of Lithium's internal tools that surround it are what make the process shine.