| Graphiant

It sounds simple enough, but monitoring today’s massive networks can be hopelessly complex.

In 1979, at the Three Mile Island nuclear power plant, a stuck valve drained coolant from nuclear fuel rods, causing the rods to heat uncontrollably. Within seconds, there were hundreds of alerts, hopelessly confusing the operators.

In the end, one engineer ignored all the alerts and simply turned on a water pump. The crisis was averted, but only after radioactive steam was released into the community. The moral of this story is that the complexity of the system has rendered traditional monitoring systems unusable.

We all know about “economies of scale.” It is why Amazon beat local bookstores, and we get our electricity from utilities instead of generating it ourselves. However, not all processes benefit from scale.

Take hierarchical management in business, which becomes less effective at scale. Decision-making is slower and executives make worse decisions because they simply cannot be as knowledgeable about specific matters as leaders at smaller, more focused firms.

It turns out that scale can drive both economies and diseconomies, and believe it or not, that is the perfect place to start today’s blog post on how Graphiant made it simple to monitor a massive global network.

Why Flooding People with Alarms is Broken

Here is how network monitoring works today.

Collect data on important network metrics (status, throughput, latency, etc.)
When something exceeds acceptable limits, send an alert to the operator.

It sounds simple enough, but monitoring today’s massive networks can be hopelessly complex.

For instance, you cannot monitor everything, so you resort to ‘sampling’, where you monitor every Nth metric. This leads to missed issues and loss of context.

Secondly, the “network effect” increases exponentially as new nodes are brought online.

A Better Way

These lessons resonate with us at Graphiant, where we run a massive and complex global network. This led to important changes in how we manage our global footprint:

No sampling: We decided early on to monitor via streaming 100% of all available metrics. We would not risk missing important information or context.
We stopped flooding our operators with a deluge of low-level alerts. Indulge me while I explore how one of my favorite quotes is helpful here.

“Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.”

-Clifford Stoll and Gary Schubert

At Three Mile Island, the operators were flooded with data. What they needed was the knowledge of what was wrong, and the wisdom to know how they could fix it.

By contextualizing the metric streams and applying hierarchical rule-based logic written by domain experts, we have reduced the background noise that the operator faces daily.

The changes we made worked! Most days, operators get real-time notifications of clearly described problems (not just raw data), and we continue to train the software to auto-remediate the problem. As a result, the operator only sees a record of the fix.

Recently, Azure had an unplanned outage that knocked out one of our Controllers. Within 60 seconds, our operators were notified about the Controllers. They spun new Controllers up within minutes – problem solved. The collapse of the Controller did not result in thousands of alerts coming in from the clients that were connected to the Controller. Operators received one consolidated alert pointing to the issue with surgical precision.

Best Practices for Wisdom-Based Monitoring

We’ve established best practices for using this new approach:

We treat our rule engine like an open-source solution. Our goal is to allow anybody the flexibility to write and publish their set of custom rules that suit their needs. This allows our domain experts to write precision rules, which are faster, less noisy and more effective.
We can add business logic to our software to canary our ruleset to a certain set of devices and exclude alerting on devices going through planned maintenance windows. This level of control allows us to focus on the things that matter most.
We add remediation advice to our alerts wherever possible. This makes it easier and more effective for our operators to solve problems.
When possible, we have the software “auto remediate” (i.e., implement the fix without human intervention). This saves time and reduces operator effort.

In the end, replacing data with knowledge and wisdom has driven two key benefits for us. First, we encounter far fewer negative events (downtime, brownouts, etc.). Moreover, we respond to events more promptly.

Better yet – we manage our network with fewer operators

The lesson here? Sometimes, scale is your friend. However, when it is not, you must devise a method to manage the scale effectively.

Follow Vinay on LinkedIn.

View his demo on Graphiant’s Observability Features on YouTube.

What Three Mile Island Taught Me About Managing a Massive Global Network

Why Flooding People with Alarms is Broken

A Better Way

Explore and Stay Informed

HIPAA’s New Cybersecurity Overhaul: How Graphiant’s Data Assurance Exceeds Healthcare Network Requirements

Cloudifying Telecom

Troubleshooting Made Easy

What Three Mile Island Taught Me About Managing a Massive Global Network

Why Flooding People with Alarms is Broken

A Better Way

Explore and Stay Informed

HIPAA’s New Cybersecurity Overhaul: How Graphiant’s Data Assurance Exceeds Healthcare Network Requirements

Cloudifying Telecom

Troubleshooting Made Easy

Subscribe to the Graphiant Newsletter