Nagios: Watching the Watchmen

by Lithium Guru on 05-26-2009 10:47 PM

Nagios-whitebg-212x50.png

Even with our skilled team, it helps to have systems in place informing us of any problems, whether real or potential. The Production Operations team's monitoring system of choice is Nagios.

 

Nagios is a 10-year-old, battle-hardened system for monitoring... just about anything. At Lithium, we use it to make sure our communities are running, to make sure our servers are healthy and to give us insights into services that may be due for a preemptive restart or configuration change.

 

service-detail.png

 

Lithium has extended Nagios through various means: We've created a custom IRC bot to alert us to any changes in service status. We also work actively with our development teams to incorporate monitoring hooks directly into our application, to monitor the internals of our communities. These hooks give us rich access to health data such as garbage collection statistics over time, acute views at memory usage of specific pieces of the application and complex analyses of internal structures to provide heads-ups of any possible issues.

 

Nagios is a critical tool for the Production Operations team, but its extension and the maturity of Lithium's internal tools that surround it are what make the process shine.

World: A Community's Bird's-eye View

by Lithium Guru on 05-04-2009 12:09 PM - last edited on 05-04-2009 04:31 PM

The Production Operations team has a variety of tools at its disposal, both homegrown and third-party. Our first line of defense when managing a community is called World.

 

(Disclaimer: This is just a test community :robothappy:)

 

World - Status.png

 

World is a management console installed on each community. It provides valuable community and system metrics to the PO team that help us diagnose and fix problems as -- or before -- they occur.

 

On the status screen shown above, we see trending across various time slices. One extremely valuable row in particular is "Monthly Req. Rate." This information can show if the community is running along just fine, or if it's being inudated by a spike in traffic. Although the community software will self-adjust to traffic spikes through the use of its process queues and throttling, World offers the last line of defense, giving us control over application variables that can be adjusted to deal with abnormal traffic.

 

World has too many screens to go into and too much functionality to touch on in a single post, but one other extremely helpful screen is the "stats" screen:

 

World - Stats.png

 

The stats screen shows six commonly used statistics about the community, but we can choose from a list of hundreds of lesser-used stats, just in case we need to track anything in particular. We can view current memory profiles to track garbage collections, investigate CPU spikes, higher-than-normal request rates, and a number of other helpful statistics.

 

World is just one weapon in the PO arsenal, but it's an extremely helpful tool to have at our side.

Lifting the Curtain

by Lithium Guru on 04-17-2009 02:18 PM - last edited on 04-17-2009 02:35 PM

Hello, folks. I'm Adam from Lithium's Production Operations (PO) team.

 

I wanted to get an introductory post out about what the PO team is, and why you might not've heard of us in the past. If you haven't heard of us, we've probably been doing our jobs! Our mission is to strive to have every production community running, all the time. We work closely with our IT department to manage servers (both internally and colocated), to keep tabs on community traffic, and to keep community uptime as close to 100% as possible.

 

Over the course of this blog, I'd like to touch on the tools we've created to support our tasks, the procedures we follow to ensure security and quick response times, and where we're headed in 2009 and beyond.

 

As Lithium continues to grow, so does Production Operations' challenge of keeping everything running smoothly. To co-opt a Futurama quote, "If you do things right, people won't be sure you've done anything at all."

Message Edited by AdamT on 04-17-2009 02:35 PM

Announcements

Announcements

The Lithosphere: Your place to exchange ideas and share experiences about online community in the enterprise.

Getting Started

Here are a few ways to maximize your experience on the community:

  1. 1
    Choose your preferences
  2. 2
    Read our guidelines
  3. 3
    Check out the Help FAQs