Ultra-High-Speed Lossless Networks Opens the Door for AI

by George Zhao

December 4, 2018

AI Demands New Network Capabilities

Artificial Intelligence (AI) is driving the need for high-performance distributed computing and high-speed distributed storage using Solid-State Drives (SSDs). The traffic model for efficient AI training requires very high network throughput, and storage networks based on Non-Volatile Memory Express (NVMe) or Solid State Device drives (SSDs) add a low-latency requirement for the network.

Thus, new applications and IT infrastructure are placing new demands on networks. Fortunately, solutions are available to meet these demands with reliable ultra-high throughput. Innovative technologies such as Remote Direct Memory Access (RDMA), real-time scheduling, and dynamic congestion control deliver low latency and zero packet loss on the network — exactly the performance required by AI applications.

AI uses techniques such Machine Learning (ML) in which cloud-computing infrastructures are fed huge amounts of data to find patterns, create models, solve problems, and answer questions. The resulting AI systems have widespread application in agriculture, financial markets, manufacturing, healthcare, retail, and other industries. For example, AI can be used to answer natural-language questions or implement image-recognition systems in self-driving cars. These systems have high throughput requirements that need to be supported by networks. The same is true for technologies such as Virtual Reality (VR) and Augmented Reality (AR), which typically interact with data in the cloud and require support from high-performance backend systems.

For high-performance applications such as AI, the key measures of network performance are throughput, latency, and congestion. Throughput is the total capacity of the network to quickly transfer large volumes of data. Latency is the sum of delays in the network. Congestion occurs when traffic overwhelms parts of the network. One factor that can impact throughput dramatically is packet loss. The loss of data in transit across the network can have a range of impact from simple performance degradation, for example when higher level applications need to retransmit data that is lost to continue, or severe causing complete failure or corruption, which is possible when the higher level application is not written to support loss and can’t recover in order to continue. In order to mitigate this, ideally the network should provide the option of completely lossless transport capabilities, able to signal and mitigate congestion while managing queueing, so that data is never lost.

Network Requirements for AI

Many experts believe that data requirements for networks and computing systems will continue to skyrocket, fueled by the proliferation of data generated by sensors and other Internet of Things (IoT) devices, supplied through Application Programming Interfaces (APIs), and moved in and out of high-performance distributed storage. All of this data can fuel AI programs for many purposes, but only if the networking and computing infrastructure can handle this large amount of data.

IT managers already see system slowdowns due to data overloads, and the problem will only increase as AI and other new applications impose greater demands. AI systems in the cloud can process much of this data and transaction load. The real challenge is connecting computing resources inside the cloud. These resources and distributed AI applications interact using networks in a data center. The amount of data exchanged between data center servers is extremely high. This traffic inside the data center transferred between systems, known as “East-West” network traffic, can be as much as 50 times the amount of the traffic exchanged outside the data center.

Most network solutions have focused largely on reducing static latency, the latency incurred by the forwarding of the network device, when in fact it is the dynamic latency that has been proven to impact applications more seriously. In addition, most existing congestion management systems in other existing network platforms tends to be relatively static. Unfortunately, the highly varied types of traffic on today’s networks makes congestion a complex, dynamic process that static methods fail to handle adequately. Legacy networking technologies such as InfiniBand and conventional converged Ethernet are no longer the only solutions to support applications such as AI and are, in fact, no longer optimal choices. Emerging technologies such as RoCE and dynamic congestion control combined with intelligent scheduling and real-time monitoring offer much more cost-effective ways to build large, low-latency, lossless Ethernet data centers that provide full performance monitoring. A solution integrates these technologies into a practical solution that enterprises can deploy today to dramatically improve performance while minimizing costs.

Author's Bio

George Zhao

Director, OSS & Ecosystem at Huawei Technologies Co., Ltd.