ONUG’s Working Groups are having a great impact on the tech industry in numerous ways. The AiOPS and Observability Working Group is composed of core members Tim Van Herck of VMware, Russ Currie of Netscout, and Ted Turner of Kentik. ONUG recently produced a webinar to review the progress each of the working groups has made, the following summarizes some of the main activities of this working group.
Van Herck reviewed the group’s main objective, which is to leverage ML/AI to help businesses answer questions that require data that is distributed across different cloud and equipment vendors. “Our goal is to correlate all this data together, so you have a single insight into what’s going on, Van Herck explained. “The biggest difference is that we want to leave the data in place, avoiding the use of large data lakes and warehouses, which are expensive to maintain.”
These changes will ultimately lead to shortened troubleshooting time. If someone calls in with a support question, support must rapidly get to the components within that user’s workflow, redirect the attention of the engineer, and avoid a recurrence of the problem moving forward? Instead of spending time troubleshooting that one problem, it is used as a learning experience to accelerate future troubleshooting and/or avoid the same problem altogether.
Van Herck used a common example to illustrate. “What’s wrong with my Zoom call?” This is a question tech support frequently gets. Low network visibility makes it a challenge to identify the problem. He explained that if you are at a branch location, you have more control. You can answer questions, such as “Is there an overlay network issue with the VPN, SD-WAN or cloud security tunnel?” or “Is the router running out of resources?” However, as you get closer to the user, visibility becomes even harder. It’s hard to know what the Wi-Fi signal looks like. It’s difficult to identify endpoint device issues with their CPU, storage, or memory.
“As a support engineer, you need a lot more visibility. When a call comes in, you need to get your bearings.” Support must be able to identify the issue within the network quickly. The questions support needs to answer is within a variety of domains formats, adding to the complexity of visibility.
“Extract meaningful correlations to accelerate troubleshooting.” That’s how Van Herck explained the purpose of the framework developed by the working group. The purpose of the framework is to segment workflow, so you can easily troubleshoot. The three main layers are:
This framework makes it possible to quickly answer questions, such as “why is my Zoom not working?” Support can trace the problem all the way from the user’s laptop to an access point, through the routing and switching environment, as well as finding issues within the internet connection or the application itself.
The AIOPS group is specifically focused on the more complex failure scenarios, including suboptimal signal strength, interface errors on a switch, or minor packet loss on the network. In isolation, these would not contribute to the low-quality Zoom call. But, combined they can cause significant impairment of application delivery. That’s what the group is really focused on, correlating all these parts on the network to provide a clear view, enabling support to diagnose and fix problems quickly.
The group did an early proof of concept of their framework in 2019. Their test case involved a mist access point that was connected to an SD-WAN edge, causing an issue with the Zoom application for an end-user. Both resource orchestrators were streaming data into an AWS bucket. Next, the data went to the visualization layer and was read by AWS Redshift. This PoC follows these steps:
It’s critical that the group has a good understanding of where all data is coming from. Especially as they focus on correlation, they must be confident in the reliability of the data. Van Herck detailed four main sources of data.
The group will also focus on data security and privacy, including the following areas.
The next area of focus is the correlation and baseline. “We need an entity relation that includes all the various data sources we ingest,” explained Van Herck. “We think this is a job best done by the enterprise administrator.” This would include some impedance matching at times as well. Lastly, the group will focus on root cause and analysis. That includes data expansion, root cause, predictive analytics, and self-healing.
Van Herck summarized the next steps in these five bullets:
“Calling all vendors, IT executives and leadership,” was Van Herck’s concluding comment!