Scene:
You are an enterprise administrator, and it’s business hours.
You get frantic calls from multiple folks!

     “Users are experiencing bad video quality on their Zoom sessions!”

    Time is ticking, and you need to first isolate the origin of the problem.

    It’s the Texas Site — 004 (15 minutes have passed since the call.)

    The alerts aren’t stopping, and you need to find the root cause of the issue.

    Is it an underlay issue? Or an overlay connectivity issue? Or a control plane issue? Or even a system plane issue?

      Maybe logical drilling down is the only option. So should you:

      • Sieve through thousands of “alarms”?
      • Use medieval sources like Syslog, IPFIX flow records, SNMP MIBs to retrace events?

      Time has most certainly run out.

      What started as a triage is now a post-mortem!

      End Scene

      Issues like this are seldom isolated, and they are just one of the many typically received (cue alert fatigue).

      Troubleshooting an issue in any enterprise network is like finding a weed in a meadow of grass, typically very reactive and laborious.

      Our Approach

      In the scenario above, anything could have gone wrong on any or all observation planes. That is why, at Graphiant, we came up with a bottom-up approach to solving this problem so that narrowing down issues proactively (preferably before they happen), should be just a few simple clicks away. For example, a bad carrier MoS score should be bubbled up as a data-plane metric and raised as an ALERT as it starts deteriorating or has historically been performing poorly.

      What needs to be monitored? The answer varies depending on who you are talking to. For example, a NaaS admin like Graphiant would like to monitor his/her management plane, control plane, and the entire backbone. On the other hand, the enterprise admin would be interested in the health of his/her branch site(s), DC site(s), WAN circuit, etc.

      So, measure, measure, and measure everywhere…. 

      Our high-level approach to solving this problem is to bubble up all metrics from all planes, consolidate them, and present them in a consumable fashion.

      Periodic and Real Time

      How aggressively should we measure? We took the approach of measuring the metrics along all the pillars both periodically (scraper) and real-time updates from the agents.

      Scrape and Real Time

      To know the state of a metric, periodic measurement is just NOT enough. As indicated in the above time events, at periodic intervals, the metric could be in the desired state but could have deteriorated and recovered between two samples. Hence, we went with the hybrid approach, which solves the initial state problem and the flap measurement issues. 

      Scale

      What we are proposing sounds good theoretically, but will this scale be practical? Realistically, an enterprise could have anywhere from 5,000 devices to 20,000 devices spread throughout the world. I am sure the following questions arise:

      • How do we handle this volume of metrics?
      • How do we consolidate it all in real-time?
      • How do we present in a consumable fashion to the troubleshooting admin?
      • How do we show historical behavior?
      • And finally, how does it scale?

      Here is our proposed architecture …

      The trick is to build a pipeline that can handle this scale. 

      (1) Ingester does a raw insert as quickly as possible so that there is no loss of metric from the devices

      (2) Asynchronous enrichment from the raw Kafka topic

      (3) Correlate enriched events using stream processor and pre-aggregate runtime metrics (for speedy time-series metrics)

      (4) Feed this onto columnar database (Real-time OLAP DB)

       

      Presentation and Drill-down

      The last and most important pillar is how to present this to the admin. We chose a heat map with category filters to narrow down the issue as it happens with very few clicks.

      This provides a bird’ s-eye view of what is wrong and what is not. The overview consolidates all the plane states for a given site in the above heat map. A cell will be lit red if any of its planes go red, i.e., data, control, management, or system.

      If you want to learn why we chose this path for our technology, refer to this blog post: What Three Mile Island Taught Me About Managing a Massive Global Network for more information.

      For more info on Graphiant, visit www.graphiant.com

      Read our Technical White Paper: https://graphiant.com/resource/graphiant-technical-white-paper/