If you’re trying to understand why Artificial Intelligence (AI) and density go hand in hand, you need to understand the implications of a distance, and to understand the implications, we’ll have to go into some background.

First, here’s the distance you need to keep in mind: 100 meters.

We’ll come back to it in a moment. But let’s go through the background (if you already know about the infrastructure needed to do generative AI training, skip ahead).

Let’s explore what the infrastructure of a generative AI training cluster looks like.

You probably know all about cloud architectures, where most servers are running at 20% utilization, running thousands of relatively small asynchronous jobs (in virtual machines or containers), connected by an Ethernet network.

An AI training cluster is different. It runs essentially one distributed synchronous job. One workload. And it’s a massively intense workload. Every node in the cluster is passing large amounts of data to and from other nodes, and each node is crunching data as quickly as it possibly can.

That’s why AI server nodes often include specialized silicon – DPUs, IPUs, GPUs, TPUs, etc. Adding extra chips to a server makes it run faster.

The other key component of cluster performance is inter-node communication. As a result of the need for speed and inter-node data sharing at scale, these clusters are connected with a massive backend network. Most AI training clusters use InfiniBand for their backend networks because of its very high throughput (up to 400Gbps) and low latency.

For InfiniBand copper cables, the maximum distance is 30 meters. For fiber optic cables, the maximum distance is 100 meters for all link rates.

And there’s our distance: 100 meters. That’s the maximum theoretical connection length from one AI training node to another, using fiber. To make matters worse, in real-world conditions, the actual connection distance might be shorter, especially if using fiber is cost or complexity prohibitive.

If you’re using InfiniBand, your physical cluster size is going to be constrained by cabling, and for now, there’s no way around it.

Who cares, right? A few dozen servers in a couple of racks, that distance isn’t a problem, is it?

Let’s talk about cluster size for a moment. GPT-3 was trained on a cluster with 285,000 processor cores and 10,000 graphics cards. The GPT-4 cluster is an order of magnitude larger.

Those are huge clusters, of course. But even if you’re designing a cluster for an enterprise or a university, you’re going to be forced to cram hundreds or thousands of servers together in a relatively small space – to cope with the connection length limits.

Fine. You cram a thousand servers into racks, fully populated, cable them together, and you’re ready to start crunching data. Right?

Wrong.

There’s another physical constraint that limits how close together these servers can be. It’s heat.

AI servers are insanely hot, because they’re running at 90% utilization and are crammed with specialized silicon. The latest NVIDIA AI server, fully populated with processors and GPUs, can generate over 10kW of heat per server.

Okay…

To do an AI training cluster, at scale, you’d need to cram thousands of servers together in a relatively small space while each server produces the heat of eight or nine space heaters, which would make the racks so hot, all the servers would shut down.

How can you work around this heat constraint in a dense environment?

That’s a great question, because most data centers CAN’T cope with that much heat in a dense deployment. The latest Uptime Institute industry survey told us that most rack densities are below 6 kW per rack. That’s less than one NVIDIA AI server per rack. There’s a reason why Meta paused all data center construction when they realized that they’d need massive amounts of AI. They recognized the need to completely redesign their data centers to cope with the heat and density of AI infrastructure.

How is Meta supporting AI? Immersion cooling is the key.

Immersion cooling supports much higher densities. For instance, the Spanish company Submer, a specialist in immersion cooling infrastructures for data centers and one of Hypertec’s partners, supports up to 100 kW of server heat per pod. That’s a megawatt of server capacity in ten pods. Suddenly it becomes possible to cool the servers in a dense deployment.

Nearly any organization looking to deploy AI at scale is looking at immersion cooling. They’re also looking at specialized vendors like Hypertec. We understand the intersection of AI (because we design and build GPU-accelerated servers for AI) and immersion cooling solutions, because we’re one of the pioneering companies in the immersion cooling space.

Our expertise, derived from decades of experience in designing demanding infrastructures for high-performance computing (HPC) and high-frequency trading, allows us to concretely assess how your emerging AI infrastructure will require a dense deployment in data centers and provide you with the means to deploy this infrastructure.

To learn more about how our immersion cooling solutions, combined with those of our partners, can help your company progress towards the future of AI, visit: https://hypertec.com/immersion-cooling/

This post is also available in: FR

You May Also Like