The Infrastructure Behind AI

AI Is Only as Strong as Your Hardware; Here’s How to Get it Right

Selecting the right AI hardware starts with asking the right questions

Benjamin Jenkins
AI Is Only as Strong as Your Hardware; Here’s How to Get it Right

TL:DR

  • AI hardware selection requires understanding specific workloads first, as choosing GPUs without clear use cases wastes millions in investment & rapid depreciation cycles.
  • AI Centers of Excellence prevent shadow AI by centralizing governance, aligning datasets & technical expertise across teams for optimal hardware decisions.
  • AI-ready data centers with liquid cooling & interconnection enable proper GPU coordination, as hardware performance depends entirely on deployment environment.

In the race to adopt enterprise AI, many companies are focused on building high-quality models. But models aren’t the only thing that matters. In fact, the quality of your models and the value you get from them are directly tied to the hardware you use to implement them. You need powerful compute hardware to train or fine-tune your models, but also hardware at the edge for low-latency inference workloads.

Until recently, implementing distributed AI hardware was something most businesses didn’t need to worry about. Only the very largest organizations had the budgets to pursue advanced machine learning. Now that AI hardware is becoming more widely available, many organizations find themselves having to decide which chips to acquire for the first time. They recognize how valuable their data is, and they know they need the right compute infrastructure to capitalize on that value. But they may not know how to choose the right hardware for their needs.

What makes choosing AI hardware so difficult?

Today, many organizations don’t have IT hardware expertise in house. With the widespread availability of cloud infrastructure, managing physical servers has become “someone else’s problem.” If fact, some IT professionals have never even seen a server before. For instance, today’s application developers only have to worry about building applications that work. They’re completely cut off from what happens under the hood to make that possible. When IT team members have only experienced hardware as an abstract concept, how can they make informed decisions about which equipment to acquire?

The speed at which AI hardware is developing also makes decision-making difficult. Hardware acquisition traditionally happened in 60-month cycles. After you decided, you didn’t have to worry about it again for another five years. With GPUs, it’s more like 18 months at best. When you order GPUs, every second you’re not optimizing them is value you’re missing out on, and one second closer to having to replace them. From the very moment they’re shipped to your facility, it’s almost like money is falling off the back of the truck.

Even if you make the right decision about which hardware to acquire, you’ll still need to make sure it’s deployed in the right environment. AI deployments often require optimized network infrastructure and liquid cooling solutions to enable higher power density. These systems will be extremely difficult to implement unless you’re a data center expert who works on them every day.

What does a good AI hardware strategy look like?

While the pressure to launch AI quickly is real, you can’t let that pressure push you into making rash decisions. You shouldn’t drop millions to acquire a bunch of GPUs just because that’s what you think you’re supposed to do. Also, you shouldn’t build a top-of-the-line AI hardware environment and then figure out what you want to do with it. That’s like buying a hammer just so you can walk around looking for nails to hit.

Instead, start by figuring out exactly what you want to achieve with AI, and then formulate your hardware strategy accordingly. This is often a cultural challenge just as much as a technical one. As organizations grow, they need to be careful to avoid shadow AI, where different departments within the organization develop their own separate AI strategies and pursue them as they see fit. This is the same problem we saw during the early days of cloud adoption, and it’s important to learn from those mistakes so you don’t repeat them.

To avoid this issue, organizations can implement an AI Center of Excellence. This involves creating a centralized governance model for AI. It requires a holistic approach to AI across the organization, including:

  • Understanding the problems that different teams are trying to solve with AI
  • Understanding what technical expertise those teams already have
  • Looking at different internal datasets to determine how they align

By doing this, organizations can make hardware decisions that provide the best possible benefits to as many different teams as possible.

It’s also important to gear your choice of AI hardware to the specific workloads you’re trying to support and think about how to optimize them for success. GPUs get a lot of attention, but you need to be intentional about when, why and how you use them. In fact, one could argue that the line between GPUs and CPUs is starting to blur. You can achieve some amazing things with the latest generation of CPUs, and they may be the right fit for certain AI workloads. The software ecosystem is also rapidly advancing, along with new hardware like field-programmable gate arrays (FPGAs), language processing units (LPUs) and tensor processing units (TPUs). With many different vendors pushing innovative and clever solutions for AI and machine learning, the hardware landscape is getting more complex to navigate. The hardware choices that once centered around a small handful of vendors are slowly expanding, as enterprises begin to consider specialty solutions for specific problems.

If processing AI workloads is like building a brick wall, then deploying GPUs is like hiring thousands of bricklayers at once. They could each lay a single brick, and the entire wall would be finished in seconds. But that only works if the different bricklayers are able to coordinate with one another, so that each one knows exactly when and where to lay their brick.

It’s the same with GPUs: There needs to be low-latency connectivity and proper orchestration to ensure the different chips work together as part of a well-oiled machine. Otherwise, it’s just going to create chaos. It’s also possible that you don’t even need to build a wall that quickly in the first place. Maybe laying bricks one at a time—essentially, the CPU approach—is better suited to your needs. If you know that going in, you can decide accordingly.

How do you find the right AI hardware partners?

It goes without saying that most organizations will need assistance to properly execute their AI hardware strategies, especially if they don’t have the needed expertise in house. If this sounds like your organization, then you shouldn’t be afraid to ask for help. When it comes down to it, there’s only one question that really matters for your AI hardware strategy: Who should you call for help?

It’s important to work with partners that you can trust to help fill in the gaps in your organization’s technical skill set. But it’s most likely not going to be one partner. Of course, you’ll need to work with hardware manufacturers, but you’ll also need networking providers to make sure your hardware is properly connected. In addition, you may choose cloud or GPU as a Service providers in addition to traditional manufacturers. Essentially, you’ll need an entire AI ecosystem.

Also, while it’s true that your AI strategy is only as strong as your hardware, your hardware is only as strong as the data center in which you deploy it. You’ll need AI-ready data centers that offer features such as liquid cooling and interconnection solutions. In fact, many of the partners you’ll need in your AI ecosystem can be found in the same place you can access AI-ready digital infrastructure: inside an Equinix IBX® colocation data center.

In addition to partnering with leading manufacturers to offer AI factory solutions that are quick and easy to deploy, Equinix has a partner ecosystem that includes thousands of enterprises and service providers, including many of the brands that are accelerating AI adoption throughout the world.

Learn more about how you can make the most of your AI hardware: Read the white paper The engine of AI powering innovation at scale.

 

Subscribe to the Equinix Blog