Observability and Analytics: How to Manage Multi-Cloud Infrastructure

by ONUG Board Member

September 18, 2018

Hybrid multi-cloud models of infrastructure are the new norm. More and more companies are choosing a multi-cloud model as their primary strategy. The “State of the Cloud” report by RightScale shows that 81% of enterprises preferred multi-cloud models and 51% used hybrid models in 2018.

In some cases, the popularity of multi-cloud infrastructure comes down to a company’s desire to decrease dependability on only just one vendor when leveraging the model to add higher value to the enterprise. In others, it is to cut expenses and to make the infrastructure more cost-efficient.

In all cases, however, multi-cloud models have raised the complexity of infrastructure, and very few enterprises have been able to control it successfully. It has also led to many other problems, the least of which involve detecting anomalies. What enterprises fail to understand is that this can be quickly resolved with observability and analytics, or knowing how to manage a multi-cloud infrastructure.

Monitoring Cloud Performance for Better Services

Service monitoring allows you to observe the runtime of your cloud services to understand how they behave against implemented policies. It includes creating performance reports that help to determine the use of the service, regarding performance, reliability, security, cost and other metrics.

Detailed analysis also allows enterprises to determine how the cloud is behaving under management, which helps to improve service governance, orchestration, and integration with security. Monitoring analytics gives administrators the chance to create forecast models about future behavior, according to current performance.

An enterprise can use tools for observing hybrid multi-cloud operations as they are operating. In most cases, companies use cloud service brokers (CSBs) and cloud management platforms (CMP) to monitor their clouds. Although the tools increase the level of complexity, they enable the user to manage the cloud and carry out essential operations successfully.

An enterprise can also observe its application stacks and call chains to provide users with robust, secure and quality services. It helps you determine how various problems in the cloud affect customer experience, like the impact of telecommunication lines. If a cloud infrastructure is built using the Amazon Web Services, traffic is automatically directed to regional data centers around the globe to ensure there is no significant effect on user experience.

Detecting and Controlling Anomalies in the Cloud

Monitoring and analytics have to be combined with management. Reliable monitoring allows administrators to detect anomalies in real time. Once an abnormality has been identified and its level of risk evaluated, users can choose to control it, mitigate it, or ignore it (when it is out of their control):

1. Anomalies that can be controlled when and where users want to administrate updates:

Performance-based anomalies that occur after code updates;
Performance-based anomalies that occur when vendors release new software (OS updates, security updates, hardware updates);
Performance-based anomalies that occur when fiber seeking blackholes (or automobile accidents) impact telecommunication lines

2. Anomalies that can be controlled by customers, and successfully mitigated by administrators:

When peak load demand variations overflow data centers (including SaaS or PaaS resources) beyond where their solution can provide reasonable service;
When marketing and sales campaigns direct unexpected volumes of traffic;
When slow growth turns into a wild surge of traffic, after a good service of a word of mouth promotion

3. Anomalies that are out of the control of administrators, which can, but should not be mitigated:

DDOS attacks by script kiddies
DDOS attacks by nation states
DDOS attacks by hacktivism
Spearfishing attacks by various mechanisms for political, financial, or BOT network control
Bitcoin mining operations by botnets
Ransomware attacks that encrypt data or destroy it

In each situation, administrators observe normal traffic flow to detect disruptions and scan user activity to evaluate risk and verify the source of the anomaly:

Anonymous IP address activity
Failure to login
Unusual administrative action
Inactive account activity
Infrequent location activity
Impossible travel
User and device agents
A high rate of movement

When anomalies are found and categorized, an enterprise can determine if it’s cheaper to fix the problem or merely scale around it. Consistently increasing capacities to make up for performance anomalies can increase expenses of running the cloud beyond reasonable costs. The business is left to allocate more money to the cloud than the number of returns a cloud can achieve by contributing to customer productivity.

It’s a Work in Progress

Managing a hybrid multi-cloud infrastructure is a challenge due to the lack of clear blueprints and defined standards of operation. IT organizations have to approach the problem with an open mind, learn from others experience, evaluate their cloud environment, analyze it, and make regular upgrades as the technology evolves.

The ONUG Community knows how innovations in cloud infrastructure management have to be applied to transform the digital enterprise and allow it to scale and thrive in the digital market. That’s why we have gathered the cloud professionals at this year’s ONUG Fall Event in New York City. Here; you will have the chance to learn how significant corporations manage their systems or find out about new best practices that are already a hot topic on the hybrid multi-cloud working group.

Author's Bio

ONUG Board Member