The principle of disaggregation is fundamental to both software-defined networking and software-defined application infrastructure. Yet disaggregation presents serious challenges for network operations, IT operations and DevOps teams.
At the network layer, SDN enables separation of the control plane from the data plane and the network operating system (NOS) from the underlying switching hardware.
Programmable merchant silicon enables disaggregation at the chip level. The P4 language can be used to generate different firmware loads in order to implement specialized networking functions.
Monolithic applications are now disaggregated into microservices, each running in individual containers and possibly distributed across multiple clouds. The container run-time frameworks – Kubernetes and Docker – are also disaggregated from the underlying server operating system.
The benefits of disaggregation are flexibility and customization. Network and IT architects can mix and match software, firmware and hardware components to tailor disaggregated network and application infrastructure so that it satisfies their organization’s specific business needs.
Low cost off-the-shelf switches can be sourced from multiple vendors and deployed with the preferred NOS. Operators can take the do-it-yourself approach to disaggregated software and assemble separate components or they can purchase a turnkey package from a single provider that integrates multiple open source components.
A key requirement is that the full stack of disaggregated application and network infrastructure is programmable at each level in the hierarchy using well-defined APIs, allowing for a high degree of customization.
Impressive, for sure, but also really complicated. With so many different moving parts, how do operators ensure that the entire stack is working properly and rapidly detect when something is wrong? How do operators quickly respond to changing conditions and know what actions to take?
Hybrid multi-cloud environments are growing too large-scale, complex and dynamic for human operators to remain in the loop for critical operational intelligence functions.
We need a new model for managing the full stack of disaggregated multi-cloud infrastructure that is vast in scale, virtualized at multiple layers and highly dynamic, featuring constantly shifting application workloads and traffic flows. We need what I am calling “cognitive multi-cloud infrastructure”.
Cognitive implies that the infrastructure is sufficiently intelligent to automatically monitor full stack operations and take any necessary remedial or preventative actions without operator intervention.
Highly distributed and disaggregated multi-cloud infrastructure can be very costly to manage due to the large number of operators involved and yet operators are often not able to react quickly enough to faults, anomalies or rapid changes in network conditions.
Machine learning and AI are fundamental to cognitive infrastructure. Smart machines can sift through the flood of streaming telemetry and generate actionable insights that can not only be presented to operators via dashboards but also drive closed-loop automation mechanisms. Advanced machine learning and AI algorithms are needed to perform the complex correlations typically required to detect problems in disaggregated infrastructure that may also span multiple domains across both public and private clouds.
Cognitive infrastructure starts with embedding instrumentation in every software and hardware component at each layer in the full stack. Streaming telemetry protocols are used to deliver monitoring information derived from this instrumentation to powerful analytics engines that are capable of processing telemetry data in real-time.
Figure 1 depicts a conceptual model for cognitive multi-cloud infrastructure. There are independent AI-driven automation loops operating at each layer which also feed operational intelligence to the layer above.
Figure 1 – Machine Learning and AI Are Critical for Automating Disaggregated Multi-Cloud Infrastructure
In addition, machine learning and AI feed insights into service, infrastructure and network orchestration software that spans multiple layers in the full stack to manage more complex automation functions such as provisioning, configuring and activating additional compute and networking resources in response to increased demand.
We are just starting on the journey to realize this vision of highly automated, cognitive multi-cloud infrastructure. Machine learning and AI today mainly serve to assist operators in deriving actionable intelligence for taking remedial actions initiated by human operators. Yet the pressing economic imperative for increased automation and streamlining of tedious and labor-intensive workflows is driving a wave of technology innovation that is starting to bear fruit.
We’ll know we have reached our journey’s final destination when cognitive infrastructure is capable of passing the Turing Test – when we reach the point where it is not possible to discern if disaggregated infrastructure is being managed by teams of human operators or machines.