With the cost of downtime approaching $700 billion a year, no one can afford interruptions in their app or service. (That works out to be $5,600/minute.)
If you're running your application on Amazon Web Services, we recommend using AWS application monitoring software to stay one step ahead of potential server issues, otherwise you're left to mine your logs after the fact for information regarding root causes. There’s two main ways of mining your logs for data. One is helpful. The other is the path of pain. Let’s look at each one.
Contrary to popular belief, downtime existed in the ‘90s, even though the internet and online services were still very young. In 1998, AT&T infamously experienced 26 hours of downtime when its frame relay network went down, shuttering bank transactions and killing AT&T’s established SLA.
A technical paper written a few years later about the incident reported:
Service Level Agreements (SLAs) had promised 99.99% availability and four-hour recovery times did not go unnoticed. Approximately 6,600 customers were not charged, resulting in millions of dollars in lost revenue alone.
Source: Agilent Technologies via Keysight
How was log management accomplished back then? In a moment of crisis, a developer would start grepping your /var/logs folder or syslog for keywords or constructing complicated regexes. Server by server. Occasionally, larger firms had the resources for building proprietary, in-house custom solutions that only one person knew how to operate or update.
This is what logging looks like today for companies stuck in the past.
You hear from your customer service team about site downtime and reactively scramble to fix the cascading issues and go late into the evening when you should be at home.
You lament the current state of your system — too many log files, too many problems to triangulate, and too few people fixing the root causes.
Your CTO is breathing down your neck, promising to fire you this time if the problem is not fixed within 15 minutes.
You endlessly search your logs, but it seems too many new logs files exist for you to get ahead of the curve.
Life as a site reliability engineer seems unreliable because you’re not in control, kind of like when you’re trying to identify a problem without the appropriate system permissions required to investigate it.
Smart software engineers and entrepreneurs saw the problem and developed solutions. Log collection has traditionally been time consuming and its volume can be unwieldy. Centralized logging tools aggregate all your systems’ logs into one place (more commonly in the cloud, but there are also on-premise solutions available as well).
Once ingested into a data store, log software parses and indexes the data and presents it in easily digestible graphs. Alert thresholds can be set up to automatically trigger when downtime occurs, warning the appropriate personnel who can fix the problem.
Centralized log management helps demonstrate its own value to upper management when you can solve downtime quicker and more efficiently. It makes log trends more easily accessible and digestible.
This is what logging looks like for companies who have embraced modern logging tools:
Your centralized logging tool mines your site logs through real-time data streams and uses AI to forecast the impending downtime.
You receive alerts and can get right to tackling the problem. You confidently fly through root cause analysis and resolve the errors, leaving work on time because you eliminated the problem before the support team was inundated with emails and tweets.
Your CTO trusts you to keep the site working, and the CEO recognizes that you’ve just averted multiple disasters.
The year 2018 is upon us. AT&T’s infamous frame relay network outage happened 20 years ago. Don’t be like AT&T in 1998. Make a New Years resolution with your engineering team to bring your organization’s log management into the present.
Looking for log management software that’s not only modern but also part of the future? Our predictive logging solution, Lumberjack, centralizes all your logs and makes them searchable.
It also goes a step further with artificial intelligence that alerts you before service interruptions occur.