GCP – New year, new updates to AI Hypercomputer
The last few weeks of 2024 were exhilarating as we worked to bring you multiple advancements in AI infrastructure, including the general availability of Trillium, our sixth-generation TPU, A3 Ultra VMs powered by NVIDIA H200 GPUs, support for up to 65,000 nodes in Google Kubernetes Engine (GKE), and Parallelstore, our distributed file system service that offers low-latency, high-throughput storage that’s essential for HPC and AI workloads. We’re excited to see what you build with these new capabilities.
These innovations come together in AI Hypercomputer, a systems-level approach that draws from our years of experience serving AI experiences for billions of users, and combines performance-optimized hardware, open software and frameworks, and flexible consumption models. This means when you build your AI solution on Google Cloud, you can choose from a set of purpose-built infrastructure components that are designed to work well together. This freedom to choose the appropriate solution for the needs of your specific workload is fundamental to our approach.
Here are some key updates to AI Hypercomputer from the last quarter based on new infrastructure components and how they enable specific AI use cases.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e565d1aa7c0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>
Running distributed (multi-node) workloads
The performance of multi-node (multi-host) applications such as large-scale AI training and HPC workloads can be highly sensitive to network connectivity, requiring precise setup and proactive monitoring. We wanted to make it easier for customers to run large multi-node workloads on GPUs, and launched A3 Ultra VMs and Hypercompute Cluster, our new highly scalable clustering system. Both offerings were made generally available to close out 2024.
A3 Ultra, with NVIDIA H200 GPUs is a new addition to the A3 family of NVIDIA Hopper GPU-accelerated VMs with twice the GPU-to-GPU network bandwidth and twice the high bandwidth memory (HBM) compared to A3 Mega with NVIDIA H100 GPUs. A3 Ultra VMs offer the best performance in the A3 family. They are built with our new Titanium ML network adapter and incorporate NVIDIA ConnectX-7 network interface cards (NICs) to deliver a secure, high-performance cloud experience for AI workloads. Combined with our datacenter-wide 4-way rail-aligned network, A3 Ultra VMs deliver up to 3.2 Tbps of non-blocking GPU-to-GPU communication with RDMA over Converged Ethernet (RoCE).
A3 Ultra VMs are also available through GKE, which provides an open, portable, extensible, and highly scalable platform for training and serving AI workloads. To try out A3 Ultra VMs, you can easily create a cluster with GKE or try this pretraining GPU recipe.
Hypercompute Cluster, meanwhile, is a supercomputing services platform built on AI Hypercomputer that lets you deploy and manage a large number of accelerators as a single unit. With features such as dense co-location of resources with ultra-low-latency networking, targeted workload placement, advanced maintenance controls to minimize workload disruption, and topology-aware scheduling integrated into popular schedulers like Slurm and GKE, we built Hypercompute Cluster to help you achieve your throughput and resilience goals. You can use a single API call with pre-configured and validated templates for reliable and repeatable deployments, and with cluster-level observability, health monitoring, and diagnostic tooling, Hypercompute Clusters can run your most demanding workloads easily on Google Cloud. Hypercompute Cluster is now available with A3 Ultra VMs.
LG Research is an active user of Google Cloud infrastructure, which they used to train their large language model, Exaone 3.0. They are also an early adopter of A3 Ultra VMs and Hypercompute Cluster, which they are using to power their next set of innovations.
“From the moment we started using Google Cloud’s A3 Ultra with Hypercompute Cluster, powered by NVIDIA H200 GPUs, we were immediately struck by its remarkable performance gains and seamless scalability for our AI workloads. Even more impressive, we had our cluster up and running with our code in under a day — an enormous improvement from the 10 days it used to take us. We look forward to further exploring the potential of this advanced infrastructure for our AI initiatives.” – Jiyeon Jung, AI Infra Sr Engineer, LG AI Research
Making inference on TPUs easier
To enable the next generation of AI agents capable of complex, multi-step reasoning, you need accelerators designed to handle the demanding computational requirements of these advanced models. Trillium TPUs provide significant advancements for inference workloads, delivering up to 3x improvement in inference throughput compared to prior generation TPU v5e.
There are multiple ways to leverage Google Cloud TPUs for AI inference based on your specific needs. You can do this through Vertex AI, our fully managed, unified AI development platform for building and using generative AI, and which is powered by the AI Hypercomputer architecture under the hood. But if you need greater control, we have options lower in the stack that are designed for optimal serving on Cloud TPUs: JetStream is a memory-and-throughput-optimized serving engine for LLMs. MaxDiffusion offers a launching point for diffusion models. And for the Hugging Face community, we worked closely with Hugging Face to launch Optimum TPU and Hugging Face TGI to make serving on Cloud TPUs easier.
Most recently, we announced experimental support for vLLM on TPU with PyTorch/XLA 2.5. Motivated by the great response for this popular serving option, we’ve been running a preview with a small set of customers to get to the stage of bringing the performance (and price-performance) benefits of Cloud TPUs to vLLM.
Our goal is to make it easy for you to try out Cloud TPUs with your existing vLLM setup — just make a few configuration changes to see performance and efficiency benefits in Compute Engine, GKE, Vertex AI, and Dataflow. You can take vLLM for a spin on the Trillium TPUs with this tutorial. All this innovation is happening in the open, and we welcome your contributions.
And in case you missed it, Google Colab now supports Cloud TPUs (TPU v5e) if you want to try TPUs for your project.
Pushing the boundaries of AI infrastructure
As we start a new year, we’re excited to continue pushing the boundaries of AI infrastructure with AI Hypercomputer. These updates represent our ongoing commitment to providing you with the performance, efficiency, and ease of use you need to accelerate your AI journey. We look forward to seeing what you achieve with these new capabilities.
Read More for the details.