It sounds simple enough, but monitoring today’s massive networks can be hopelessly complex.

In 1979, at the Three Mile Island nuclear power plant, a stuck valve drained coolant from nuclear fuel rods, causing the rods to heat uncontrollably. Within seconds, there were hundreds of alerts, hopelessly confusing the operators.

In the end, one engineer ignored all the alerts and simply turned on a water pump. The crisis was averted, but only after radioactive steam was released into the community. The moral of this story is that the complexity of the system has rendered traditional monitoring systems unusable.

 

    Take hierarchical management in business, which becomes less effective at scale. Decision-making is slower and executives make worse decisions because they simply cannot be as knowledgeable about specific matters as leaders at smaller, more focused firms.

    It turns out that scale can drive both economies and diseconomies, and believe it or not, that is the perfect place to start today’s blog post on how Graphiant made it simple to monitor a massive global network.

    Why Flooding People with Alarms is Broken
    Here is how network monitoring works today.

    ⦁ Collect data on important network metrics (status, throughput, latency, etc.)

    ⦁ When something exceeds acceptable limits, send an alert to the operator.

    It sounds simple enough, but monitoring today’s massive networks can be hopelessly complex.

    For instance, you cannot monitor everything, so you resort to ‘sampling’, where you monitor every Nth metric. This leads to missed issues and loss of context.

    Secondly, the “network effect” increases exponentially as new nodes are brought online.

      A Better Way
      These lessons resonate with us at Graphiant, where we run a massive and complex global network. This led to important changes in how we manage our global footprint:

      • No sampling: We decided early on to monitor via streaming 100% of all available metrics. We would not risk missing important information or context.
      • We stopped flooding our operators with a deluge of low-level alerts. Indulge me while I explore how one of my favorite quotes is helpful here.
      • Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.
        Clifford Stoll
        At Three Mile Island, the operators were flooded with data. What they needed was the knowledge of what was wrong, and the wisdom to know how they could fix it.

      By contextualizing the metric streams and applying hierarchical rule-based logic written by domain experts, we have reduced the background noise that the operator faces daily.

      The changes we made worked! Most days, operators get real-time notifications of clearly described problems (not just raw data), and we continue to train the software to auto-remediate the problem. As a result, the operator only sees a record of the fix.

      Recently, Azure had an unplanned outage that knocked out one of our Controllers. Within 60 seconds, our operators were notified about the Controllers. They spun new Controllers up within minutes – problem solved. The collapse of the Controller did not result in thousands of alerts coming in from the clients that were connected to the Controller. Operators received one consolidated alert pointing to the issue with surgical precision.

      Best Practices for Wisdom-Based Monitoring
      We’ve established best practices for using this new approach:

      • We treat our rule engine like an open-source solution. Our goal is to allow anybody the flexibility to write and publish their set of custom rules that suit their needs. This allows our domain experts to write precision rules, which are faster, less noisy and more effective.
      • We can add business logic to our software to canary our ruleset to a certain set of devices and exclude alerting on devices going through planned maintenance windows. This level of control allows us to focus on the things that matter most.
      • We add remediation advice to our alerts wherever possible. This makes it easier and more effective for our operators to solve problems.
      • When possible, we have the software “auto remediate” (i.e., implement the fix without human intervention). This saves time and reduces operator effort.
      • In the end, replacing data with knowledge and wisdom has driven two key benefits for us. First, we encounter far fewer negative events (downtime, brownouts, etc.). Moreover, we respond to events more promptly. Better yet – we manage our network with fewer operators.

        The lesson here? Sometimes, scale is your friend. However, when it is not, you must devise a method to manage the scale effectively.

      In the end, replacing data with knowledge and wisdom has driven two key benefits for us. First, we encounter far fewer negative events (downtime, brownouts, etc.). Moreover, we respond to events more promptly. Better yet – we manage our network with fewer operators.The lesson here? Sometimes, scale is your friend. However, when it is not, you must devise a method to manage the scale effectively.