Home – Blog – Cool Tools: Then One Day It Happens…
May 4, 2012 —
You’ve joined an elite team of engineers and administrators tasked to oversee your company’s technological needs. As your company’s ambitious marketing teams generate more and more buzz, you find that with each day your job circles increasingly around growing your business’s capacity. Months go by filled with unhindered efforts in project completion. You’ve helped double your web traffic, beef up your network, and revamp your monitoring system. Long story short, six months go by, and all is gloriously well. Most importantly, another quarter concludes with 100% uptime.
So, it begins one bright morning in October, when while standing amongst the familiar faces of a quarterly all hands, your phone beeps. Like the first harmless drop of a thunderstorm, your ringtone leads the symphony of SMS tones that erupt around you. One by one, your department mates hastily retreat to their desks.
As you un-suspend your laptop, it hits. A nearby cubicle yells, “The website is down. ” The concept of “seeing it for yourself” goes straight out the window as your company website’s front page has been replaced with a “server not responding. ” Someone from support comes down the hall at a brisk pace, darting around anxiously looking for anyone’s attention. Keyboards click furiously as utilization and capacity on every paging system is checked. Memory usage looks fine, and none of the partitions are full. Is the database responding? Are the any blocks on it? How about the web servers? How many threads are there? Are they load balanced? How’s our web traffic looking? Is there anything at all? Your team verbally divides lists of hosts to cover more ground. Everything looks fine, so why is nothing working? Moments later, directors and executives begin to arrive. Casting aside all other questions, they have come looking to answer the paramount one: Is this customer impacting?
Finally, the order comes in to bounce every affected system, potentially dropping active connections. The alarms subside, and the fire is squelched. Verbal RFOs begin to get exchanged. Terms like “process anomaly” and “cascading effect” are passed around. Shoulder blades remain tense. Ultimately, the bad news comes in the form of two numbers, “12 minutes. We just lost a nine. “ To the rest of the world, your site is back to business as usual, but you know how this incident has now changed your work life for at least the next 3-15 months.
In a scenario where the driving forces of a product are uptime and availability, a reactive philosophy in maintaining technology could mean the difference between nines, and meeting your SLAs. It only takes roughly 5 minutes a year of total service interruptions to bring a 99.999% uptime down to 99.99%. But what if there was another way? What if we could detect a problem before it becomes one? What if all monitors and logs you need to know in a crisis scenario were not just centrally available, but groomed? I have chosen to learn all I can about Splunk, because of what it can bring to the data analytics table. More visibility helps capacity planning, systems monitoring, root cause analysis, and just plain making data make sense. I mean, if we can make something better, then why not? And what better way is there to learn about a product, than getting the opportunity to see how other companies use it? As a Function1 consultant, not only will I be able to share with our customers some new and innovative ways to use Splunk, but I’ll also have the chance to encounter more use cases than I could ever fathom. Here’s to the pursuit of knowledge.