Networking for Nerds

How AI Changes Your Network Infrastructure Requirements

Traditional networking must evolve to run AI workloads and transmit data across distributed GPU clusters efficiently and reliably

Ted Kawka
How AI Changes Your Network Infrastructure Requirements

As companies progress through the various stages of AI maturity, they continuously discover new infrastructure requirements. One such requirement is transforming their networking infrastructure to run AI workloads on GPUs. Given their significant investment in acquiring and managing GPUs, companies need to ensure these servers are constantly running—without connectivity interruptions, latency challenges or bandwidth issues.

Traditionally, Ethernet has been the go-to choice for CPU networking. However, the high-performance computing demands of processing AI workloads across large, distributed GPU clusters have raised the bar for performance, scalability and efficiency. Applications such as natural language processing, computer vision, advanced driver-assistance systems (ADAS), virtual assistants and medical diagnostics all require low-latency, high-bandwidth networks that can efficiently handle complex workloads. If the network cannot supply data to the GPUs fast enough, they will be underutilized, resulting in the hardware not delivering the expected value for the cost.

Mature technologies, including InfiniBand™ and Remote Direct Memory Access Over Converged Ethernet (RoCE), are beginning to emerge as top choices for networking infrastructure in AI-ready data centers. Another contributor to evolving network technologies is the Ultra Ethernet Consortium (UEC), a neutral party developing protocols around high-speed networking and specifications based on Ethernet technologies—which will be of significant interest to companies in the future. Many leading companies that develop AI hardware or software are participating across various membership levels within the organization.

Network infrastructure technologies for AI will continue to evolve and play a significant role in enabling AI workloads running in high-performance data centers.

Exploring options for AI networking technologies

Choosing the right AI networking technologies depends on the AI workload types companies are running, the volume of data they’re processing and the number of GPU clusters that need to connect with each other.

In addition to solving AI networking challenges related to low latency and high bandwidth, these technologies can enable a lossless network environment to help overcome network performance bottlenecks that naturally occur in large-scale distributed systems. Data sent across the network needs to reach its destination without being lost or corrupted. Lossless networks eliminate or significantly reduce packet loss, ensuring data integrity and reliability.

InfiniBand[1] is a high-bandwidth, low-latency technology that’s been around for more than twenty years but has been relatively unknown until now. It enables high throughput and ultra-low end-to-end latency for tremendous amounts of data moving over short distances (typically within a data center). This makes InfiniBand an ideal solution for running AI workloads across GPU clusters.

InfiniBand uses its own adapters or switches to facilitate data transfers, which makes it a premium solution. End-to-end control is built into the protocol to achieve a lossless network, rather than dealing with the re-transmissions and pauses of a typical Ethernet network. This helps control the amount of data released to the network, preventing buffer overflow and packet loss. Companies rely on InfiniBand for speed and reliability in high-performance computing environments.

RoCE[2] is an Ethernet-based technology that provides high-performance networking for AI workloads. It’s more flexible and less expensive than InfiniBand and is ideal for companies with AI workloads that do not require fast processing speeds. RoCE is a more familiar networking technology than InfiniBand. Because it’s a layer 3 protocol, it can be routed, allowing potentially longer data transfers and connectivity to other networks. There are also more manufacturers, leading to more choice in equipment.

Ultra Ethernet Consortium[3] is leading the development of Ultra Ethernet Transport (UET). UET is an Ethernet-based communication stack architecture for high-performance networking that will meet the demands of AI and HPC with standards-based robust, scalable and cutting-edge solutions. While still early in its development, it will likely have an impact soon, since many leading companies are involved in it.

UEC plans to drive the development of new software and hardware to increase processing speeds and resolve other AI networking barriers. For instance, currently, there are limits on the number of concurrent interconnected nodes. These are determined by a combination of network type, protocols, hardware capabilities, and proper configuration. UEC plans to introduce a solution to increase those limits.

Industries with extreme networking requirements for AI workloads

While all industries require high-performance networking for processing AI workloads, there are some where the performance requirements are especially high. Certain industry-specific use cases demand the fastest possible networking speed and reliable data transfers to complete time-sensitive model training.

For instance, life sciences companies can train AI models to identify compounds that physicians can use to treat diseases in new and more effective ways. Imagine the improved patient outcomes they can drive by introducing groundbreaking treatments that use these newly identified compounds.

Consider the sheer volume of data that credit card companies accumulate from their customers. They can train their AI models to extract insights and help them identify new products and services to pair with existing offerings, driving additional revenue while improving user satisfaction.

Advanced driver assistance systems (ADAS) companies must ingest and migrate massive volumes of testing data from the field for analysis and processing. These companies can train their AI models to develop and support advanced driver assistance and autonomous driving solutions.

Positioning AI-specific network infrastructure in the right places

When it comes to deploying network infrastructure for AI, location matters. Equinix AI-ready data centers are strategically located in the world’s most connected markets and provide a scalable infrastructure foundation that enterprises can use to advance their AI capabilities.

AI processing technologies have much higher power and cooling requirements than traditional technologies. AI-ready data centers can provide the reliable power capacity and high-density cooling technologies required to support the next generation of power-intensive AI workloads.

Our globally distributed and highly interconnected facilities put enterprises close to cloud and network service providers and industry-specific ecosystems, including AI hardware and software providers. With a network of data centers in 70+ key markets in 34 countries, we enable global reach, compliance and fast, low-latency connections for superior network performance.

In addition to providing traditional colocation services, we offer the flexibility of digital services, private connectivity solutions and access to 220+ cloud on-ramps to all the major providers.

Visit our website to discover how Equinix AI-ready data centers support demanding AI, compute and storage applications today and help you scale for tomorrow.

 

[1] InfiniBand

[2] RoCE Initiative

[3] Ultra Ethernet Consortium

Subscribe to the Equinix Blog