Autonomous Networks, Not Network Automation

In today’s fast-paced cloud environments, automation is required to effectively manage systems of any size. Often the goal is to do more with less, so operators can strive to look beyond what it takes to keep the lights on; this is the same for managing the network or system infrastructure. Even though network operators face similar pain points as those seen in compute environments, they have what seems like an inherent advantage in fewer pieces of equipment to manage. However, with the trend of building more densely utilized CLOS fabrics, every network node has 10-20x the number of servers under it. While there are fewer network devices vs. servers to manage, this multiplier means every network node has the potential for larger impact. Therefore, we cannot accept the adage that few devices result in a weaker desire to automate, since the potential to cause widespread outages is so high. These systems should be the first to be highly automated, leading them to be on the forefront of being truly autonomous; de-risking daily operations.  

So why hasn’t this happened?

Over the past 10 years, the move to mobile has required applications to have the flexibility to move resources between sites and locations on demand. This approach, as pioneered by hyperscale operators, drives application monoliths to be split into smaller microservices, pushing virtualization and containerization, allowing maximum portability. As applications become more distributed, east-west bandwidth requirements increase dramatically. This forces the network to increase throughput and reliability; making the choice to adopt CLOS networks an easy one. Hyperscale operators use off-the-shelf components, leveraging cheap and reliable merchant silicon at the core. However, this effectively increases the number of devices from 2-4 distribution/core devices to 48+ CLOS nodes, putting pressure on legacy network management methods. Hyperscalers addressed this management complexity with homegrown Network Operating Systems (NOS), built with network automation baked-in making it easier to manage at scale.

As smaller organizations attempt to emulate the hyperscale operators they quickly find that they are hindered by legacy NOS architectures. We are still stuck in a world of manually-driven networks, writing scripts on top of the Command Line Interface (CLI). In the best case, tools like Anisble or Puppet are leveraged to bring automation and minimal orchestration to the network.  In the worst case, network managers just end up hiring more engineers. The result is clear – there is a significant automation gap between network and compute infrastructure.

To illustrate the gap between where we are as an industry today and where we need to be, this well-circulated image elegantly illustrates the levels of service automation maturity:

Today most networks are operating in the zone between Levels 0 and 2. Operators are just keeping their heads above water with a mix of Zero Touch Provisioning, screen scraping scripts, and more people. They are reactionary to changes that occur, and have no easy way of verifying that automation actually worked, beyond logging in and manually reviewing the result. This does not scale! It is error prone and leads to massive inconsistencies between design and implementation.

Where should network operators look for the answer to this problem?

Cloud Native solutions offer a vastly more mature autonomous experience. They have become exceedingly good at having software manage other software; operating at Automation Maturity Level 3 and above.  We can learn and adopt this for the network. But this has to start with modernizing the basic building block – the Network OS.

Microservices, containerization, and orchestration (in the form of Kubernetes) are the foundation of Cloud Native architectures. With these building blocks they have met the automation challenge head on. We know what this looks like; when network and application services operate with the same level of automation maturity by looking at Service Fabric Meshes in the Public Cloud. If we leverage Kubernetes and apply modern containerized microservices architectures to the NOS we unite NetOps and DevOps. At its core the network is nothing more than a complex distributed application and can be made automated in the same way as other similar applications. Autonomous networking is the future and a modern NOS architecture is required to put into practice a way to build networks that match the same automation maturity we see with Cloud Native applications.


Author's Bio

Adam Casella

Adam Casella

CTO and Co-Founder at SnapRoute

Adam Casella co-founded SnapRoute in August, 2015, where he is responsible for setting technical direction and ensuring the operator’s perspective is always seen. Adam’s background of supporting networking products as a vendor, paired with his operational experience running large-scale data center networks gives him a unique perspective on how to build reliable, resilient, and easy-to-use products. Adam is an authority on both dissaggregated and traditional networking technologies and especially their use in hyperscale spine/leaf CLOS designs and topologies.

Prior to founding SnapRoute, Adam was responsible for designing and building hyperscale data center networks at Apple. This included the full spectrum of operations from product evaluation, proof-of-concept labs, overall topology design, configuration schemes, device deployment, and maintenance.

Before Apple, Adam was a lead engineer in Cisco’s TAC on the LAN and Data Center Switching teams – giving him deep insight on silicon pipelines, hardware and software architecture, and a strong background in debugging and troubleshooting complex, multi-faceted technical issues.

Adam has a BS in Network and Systems Administration from RIT (Rochester Institute of Technology).