When we say the word “latency,” most people have a specific definition in mind: the delay that occurs while a system waits for a packet to finish moving over a network. This definition is technically correct, but it’s also incomplete. There are multiple types of latency, and this definition only accounts for one of them: network latency.
As digital infrastructure has expanded to circle the globe and applications have become more distributed, it’s understandable that many people would be so focused on solving for network latency that they’d fail to consider any other kind. We believe it’s time for a more comprehensive definition of latency. In addition to network latency, it should include compute latency: the time from the moment a computer—usually a server—receives a request to the time it returns a response. Thus, latency could refer to anything that causes delays within a data architecture, regardless of the source.
Why change the way we talk about latency?
There are several factors that highlight why it’s so important and timely to adopt a broader view of latency.
First, today’s customers expect an exceptional user experience, which means they don’t want to wait for an application to respond. Rather than solely considering how latency impacts systems, we should also think about how it impacts humans. This would logically require us to optimize for lower latency in all its forms. When an application provides a poor user experience, customers don’t care about the source of the problem; they just want it fixed.
Also, enterprises are adopting many different advanced applications, which vary widely in how sensitive they are to different forms of latency. To support these different applications effectively, enterprises need to consider many factors, including:
- How much data needs to be processed to support the application, the total capacity of the computing architecture, and the complexity of the algorithm (all of which determine the compute latency)
- Where data originates and how far it needs to go for processing (which determines the network latency)
- Whether the application is human-facing or system-facing, and which use cases it’s intended to support (which could determine how latency impacts the user experience)
Let’s consider some examples. Inference workloads for generative AI applications typically aren’t that sensitive to network latency; given the massive volume of data they process in order to provide a response, compute latency is the much bigger issue. Also, these applications are human-facing, which means a few extra milliseconds of delay wouldn’t be enough to meaningfully degrade the user experience. Generally speaking, the networks that support these applications already provide acceptable performance. If you want to improve user experience, you’d be better off increasing compute capacity in order to speed up processing.
In contrast, self-driving cars must operate at machine speed; a couple of extra milliseconds of delay could literally be a matter of life and death. Consider a 5G vehicle-to-everything (V2X) use case, as shown below. The purpose of V2X communications is to help self-driving cars detect and avoid vulnerable road users (VRUs) like pedestrians and cyclists. To achieve this, the car needs near real-time visibility into its surrounding environment, which it can only get by processing a constant stream of data from various connected devices. Since traffic conditions are always changing (and VRUs are always moving), this processing must happen in a matter of milliseconds.
Therefore, widespread adoption of self-driving cars will require an ecosystem of partners working together to ensure data gets where it needs to be as quickly as possible and gets processed as quickly as possible once it arrives. In this case, the goal is to eliminate every nanosecond of unnecessary latency; achieving this will require optimizing both the compute and the network infrastructure. For example, deploying a multiaccess edge compute (MEC) architecture at the edge and integrating it with mobile networks—as shown in the diagram above—can help ensure the extremely low latency that V2X communications demand.
Different types of latency impact different AI workloads in different ways
The growth of private AI for enterprises further highlights the importance of addressing all forms of latency. AI infrastructure is inherently distributed because different AI workloads have different compute and networking requirements. This also means that latency impacts different AI workloads in different ways.
In particular, AI training requires a lot of data processing capacity, which shines a light on the growing importance of compute latency. However, it would be wrong to suggest that training workloads aren’t also impacted by network latency.
Solving for parallel computing in AI training
To get the required compute capacity, AI training workloads often rely on parallel processing, where a job is spread across multiple GPUs and multiple nodes. As the diagram below shows, there are three distinct steps to completing a parallel training job: process, notify and synchronize. In the process stage, each GPU completes its segment of the job. In the notify and synchronize stages, the different GPUs connect with one another and compile the results for the complete job.
Thus, there’s a circular dependency between compute latency and network latency: The job completion time is determined by how quickly GPUs can turn out results and how quickly the network can synchronize those results. Therefore, if your goal is to train private AI models faster, you must work to minimize both forms of latency.
Given the supply chain issues facing GPU manufacturers, there’s valid concern that limited compute capacity could become a bottleneck that prevents enterprises from succeeding with private AI. However, you may not even need to acquire new GPUs. Instead, optimizing your network infrastructure could help you unlock the under-utilized capacity of the GPUs you already have.
That’s because GPUs are sensitive to network latency. When they can’t synchronize results quickly, their utilization rate decreases. Improving network performance helps bring this rate back up. Therefore, even small incremental investments in your network could drive millions of dollars in savings and enable AI applications that would otherwise be out of reach.
Prioritizing for different types of inference workloads
In contrast to training workloads, people tend to think of AI inference workloads as highly sensitive to network latency, yet less sensitive to compute latency. Therefore, they assume that optimizing inference workloads requires deploying infrastructure at the digital edge to ensure proximity to data sources, thereby limiting the distance data must travel.
Of course, there is some truth to this, but as we saw with training workloads, we sometimes need to think beyond the generalities. The exact AI use case you’re pursuing will determine what kind of inference you need. As mentioned earlier, there are applications that run at machine speed and those that run at human speed. Machine-speed applications will require real-time or near real-time inference; for those that run at human speed, asynchronous or batch inference will likely be sufficient.
For asynchronous and batch inference, the impact of network latency will be negligible; therefore, optimizing network infrastructure will provide limited value. Unlike machine-speed applications that require real-time inference, asynchronous and batch inference wouldn’t need to be optimized for proximity. Instead, you might think about where you should deploy to get the necessary compute capacity to keep processing time low.
This may feel counterintuitive; it’s commonly accepted that since inference workloads handle significantly less data than training workloads, they aren’t as heavily impacted by compute latency. Again, this idea isn’t wrong, but it may be an oversimplification. You must think in terms of the volume of requests the application will have to answer.
For instance, a single inference request for a generative AI application may not be that compute-intensive on its own. However, when you consider the total number of users and requests the application is servicing over time, the total amount of processing required starts to add up. This fact illustrates why compute latency can be a problem for inference workloads as well as training workloads.
Deploying infrastructure in the AI era requires nuance. Training and inference workloads can both be impacted by network and compute latency in different ways; therefore, you must take all forms of latency into account for all your workloads as you plan your AI-ready data architecture.
Why solve for latency on Platform Equinix?
At Equinix, we’re well positioned to help our customers optimize their AI workloads for different forms of latency.
Reducing network latency requires both optimizing for proximity and using high-performance networking technology. Platform Equinix® can provide everything you need to deploy at the digital edge and ensure proximity for your workloads. This includes our global footprint of Equinix IBX® colocation data centers and our on-demand digital infrastructure services. In addition, Equinix Fabric® provides software-defined interconnection services for network performance that’s proven to be better than that of the public internet.
To address compute latency, we can help you deploy the infrastructure you need, where you need it. Depending on your AI models, you may not need high-performance GPUs for all your training workloads. You may be able to use CPUs, which you can deploy on demand in 30+ metros using Equinix Metal®, our dedicated bare metal service.
Finally, our joint offering with NVIDIA can help you access the GPU capacity you need inside an Equinix IBX data center. You’ll also get all the integrated capabilities you need to keep those GPUs running well, including Equinix managed services and interconnection solutions. To learn more about Equinix Private AI with NVIDIA DGX, read the solution brief.
Also, for more expert insights on all the different factors you should consider when deploying digital infrastructure for private AI, access the Equinix Indicator.