Retooling IT Operations with Machine Learning and Artificial Intelligence

The ONUG Community is in the vanguard of migrating of enterprise IT applications to hybrid multi-cloud environments. While the business benefits are real, this effort is not without its challenges. Hybrid multi-cloud infrastructure is cloud-scale, confronting IT managers with application performance management challenges beyond the capabilities of existing tools and techniques for traditional enterprise IT.

The good news is that in this new era, IT managers can leverage the same cloud-scale technologies enabling this migration for ensuring application performance in what are proving to be complex, highly dynamic environments.

Cloud-scale infrastructure is software-defined and virtualized, facilitating full stack instrumentation for collecting many types of telemetry data across multiple layers and domains, including performance metrics, packet data, flow metadata and log data. New streaming protocols and high speed Big Data pipelines allow monitoring and analytics platforms to ingest this flood of data, which could easily amount to hundreds of terabytes per day for a typical hybrid multi-cloud environment.

Modern Big Data technologies were born in the cloud to address the extreme scalability demands of cloud-based applications and services for hundreds of millions of users.

Forward thinking IT managers are already using Big Data monitoring platforms for aggregating, analyzing and storing the massive volume, high velocity and wide variety of telemetry data generated by cloud-scale infrastructure. Big Data platforms enable ITOps, NetOps, DevOps and SecOps teams to harness this data for a multitude of operational scenarios.

While Big Data enables powerful multi-dimensional analytics for extracting actionable insights, most platforms still require a high degree of operator expertise and knowledge to perform the proper sequence of queries needed to identify performance problems and determine their root cause. Big Data is necessary but by itself is not sufficient for hybrid multi-cloud application performance management.

Operations teams need smarter tools that can help streamline and automate workflows. They need tools that constantly analyze the wide variety of time series data acquired by monitoring the full stack of cloud-scale infrastructure. Smarter tools can automatically detect anomalies or unusual conditions and then take corrective remedial action without operator intervention. This may seem like a tall order today, but advances in the application of machine learning and artificial intelligence (AI) should one day make this vision a reality.

The human brain is an incredible computation engine that excels at pattern matching and correlation, which we sometimes credit to “intuition”. Experienced operators are regularly able to monitor multiple dashboard indicators, view lists of alerts and selectively interrogate telemetry data to rapidly determine the root cause of performance problems. However, emerging hybrid multi-cloud environments are so distributed and vast in scale that no single operator can stay on top of everything. Even teams of operators working in concert can find themselves struggling to resolve problems.

Operators need to turn to machines, which fortunately are far superior at numerical computation than the human brain and much better equipped to make sense out of the flood of cloud-scale telemetry data. Machine learning is destined to play a key role in retooling operations because machines are not only capable of rapidly analyzing streams of time series data that would overwhelm human operators, but they can also detect subtle fluctuations and patterns that the human brain might overlook in a deluge of monitoring data.

Machine learning algorithms are capable of extracting information and meaning from massive data sets. In the hybrid multi-cloud monitoring domain, several types of machine learning algorithms well-suited for analyzing time series data sets can be applied to performance metrics. Clustering and anomaly detection algorithms can be used to detect and flag statistical outliers that are likely indicators of a problem. Statistical analysis algorithms can be used to discover performance trends and predict future behavior.

In addition to Big Data and machine learning, hybrid multi-cloud operations will also benefit from the application of AI technology to remove human operators from the critical path and fully automate many routine workflows. Rules-based “expert systems” will be capable of emulating the best practices of skilled human operators to automate workflows for detecting problems, performing root cause analysis and taking remedial action.

However, in complex, highly dynamic operational environments, there may be no single best solution to a given problem. When confronted with this situation, human operators are usually forced to make choices that may result in a sub-optimal outcome, but they just make a decision, move on and deal with the outcome later. Reinforcement learning is a type of machine learning concerned with this type of AI.

Ultimately, we need AI that can take action based on a set of choices that are likely to result in a sub-optimal outcome but can also monitor this outcome and utilize this information when taking future actions that yields a better result. This type of sophisticated machine learning coupled with artificial intelligence would really help alleviate the significant operational burden of managing hybrid multi-cloud application performance.

Author's Bio

Guest Author