2024 11 15

GCP – What’s new with HPC and AI infrastructure at Google Cloud

At Google Cloud, we’re rapidly advancing our high-performance computing (HPC) capabilities, providing researchers and engineers with powerful tools and infrastructure to tackle the most demanding computational challenges. Here’s a look at some of the key developments driving HPC innovation on Google Cloud, as well as our presence at Supercomputing 2024.

You can also stay apprised of our HPC and AI advances by joining the new Google Cloud Advanced Computing Community (details below).

Next-generation HPC VMs

We began our H-series with H3 VMs, specifically designed to meet the needs of demanding HPC workloads. Now, we’re excited to share some key features of the next generation of the H family, bringing even more innovation and performance to the table. The upcoming VMs will feature:

Improved workload scalability via RDMA-enabled 200 Gbps networking
Native support to directly provision full, tightly-coupled HPC clusters on demand
Dynamic Workload Scheduler to provision fixed-lifetime clusters now or in the future
Titanium technology that delivers superior performance, reliability, and security

We provide system blueprints for setting up turnkey, pre-configured HPC clusters on our H series VMs.

The next generation of H series is coming in early 2025.

aside_block: <ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e11a43375e0>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Parallelstore: World’s first fully-managed DAOS offering

Parallelstore is a fully managed, scalable, high-performance storage solution based on next-generation DAOS technology, designed for demanding HPC and AI workloads. It is now generally available and provides:

Up to 6x greater read throughput performance compared to competitive Lustre scratch offerings
Low latency (<0.5ms at p50) and high throughput (>1GiB/s per TiB) to access data with minimal delays, even at massive scale
High IOPS (30K IOPS per TiB) for metadata operations
Simplified management that reduces operational overhead with a fully managed service

Parallelstore is great for applications requiring fast access to large datasets, such as:

Analyzing massive genomic datasets for personalized medicine
Training large language models (LLMs) and other AI applications efficiently
Running complex HPC simulations with rapid data access

A3 Ultra VMs with NVIDIA H200 Tensor Core GPUs

For GPU-based HPC workloads, we recently announced A3 Ultra VMs, which feature NVIDIA H200 Tensor Core GPUs. A3 Ultra VMs offer a significant leap in performance over previous generations. They are built on servers with our new Titanium ML network adapter, optimized to deliver a secure, high-performance cloud experience for AI workloads, and powered by NVIDIA ConnectX-7 networking. Combined with our datacenter-wide 4-way rail-aligned network, A3 Ultra VMs deliver non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE).

Compared with A3 Mega, A3 Ultra offers:

2x the GPU-to-GPU networking bandwidth, powered by Google Cloud’s Titanium ML network adapter and backed by our Jupiter data center network
Up to 2x higher LLM inferencing performance with nearly double the memory capacity and 1.4x more memory bandwidth
Ability to scale to tens of thousands of GPUs in a dense, performance-optimized cluster for large AI and HPC workloads

With system blueprints, available through Cluster Toolkit, customers can quickly and easily create turnkey, pre-configured HPC clusters with Slurm support on A3 VMs.

A3 Ultra VMs will also be available through Google Kubernetes Engine (GKE), which provides an open, portable, extensible, and highly-scalable platform for large-scale training and serving of AI workloads.

Trillium: Ushering in a new era of TPU performance for AI

Tensor Processing Units, or TPUs, power our most advanced AI models such as Gemini, popular Google services like Search, Photos, and Maps, as well as scientific breakthroughs like AlphaFold 2 — which led to a Nobel Prize this year!

We recently announced that Trillium, our sixth-generation TPU, is available to Google Cloud customers in preview.

Compared with TPU v5e, Trillium delivers:

Over 4x improvement in training performance
Up to 3x increase in inference throughput
67% increase in energy efficiency
4.7x increase in peak compute performance per chip
Double the high bandwidth memory capacity
Double the interchip interconnect bandwidth

Cluster Toolkit: Streamlining HPC deployments

We continue to improve Cluster Toolkit, providing open-source tools for deploying and managing HPC environments on Google Cloud. Recent updates include:

Slurm-gcp V6 is now generally available, providing faster deployments and robust reconfiguration among other benefits.
Google Cloud Customer Care is now available for Toolkit. You can find more information here on how to get support via the Cloud Customer Care console.
HPC VM Image Rocky Linux 8 is now generally available, making it easy to build an HPC-ready VM instance, incorporating our best practices running HPC on Google Cloud.

GKE: Container orchestration with scale and performance

GKE continues to lead the way for containerized workloads with the support of the largest Kubernetes clusters in the industry. With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.

At the same time, we continue to invest in automating and simplifying the building of HPC and AI platforms, with:

Secondary boot disk, which provides faster workload startups through container image caching
Fully-managed DCGM metrics for improved accelerator monitoring
Custom compute classes, offering greater control over compute resource allocation and scaling
Extensive innovations in Kueue.sh, which is becoming the de facto standard for job queueing on Kubernetes with topology-aware scheduling, priority and fairness in queueing, multi-cluster support (see demo by Google and CERN engineers), and more

Customer success stories: Atommap and beyond

Atommap, a company specializing in atomic-scale materials design, is using Google Cloud HPC to accelerate its research and development efforts. With H3 VMs and Parallelstore, Atommap has achieved:

Significant speedup in simulations: Reduced time-to-results by more than half, enabling faster innovation
Improved scalability: Easily scaled resources for 1,000s to 10,000s of molecular simulations, to meet growing computational demands
Better cost-effectiveness: Optimized infrastructure costs, with savings of up to 80%, while achieving high performance

Atommap’s success story highlights the transformative potential of Google Cloud HPC for organizations pushing the boundaries of scientific discovery and technological advancement.

Looking ahead

Google Cloud is committed to continuous innovation for HPC. Expect further enhancements to HPC VMs, Parallelstore, Cluster Toolkit, Slurm-gcp, and other HPC products and solutions. With a focus on performance, scalability, compatibility, and ease of use, we’re empowering researchers and engineers to tackle the world’s most complex computational challenges.

Google Cloud Advanced Computing Community

We’re excited to announce the launch of the Google Cloud Advanced Computing Community, a new kind of community of practice for sharing and growing HPC, AI, and quantum computing expertise, innovation, and impact.

This community of practice will bring together thought leaders and experts from Google, its partners, and HPC, AI, and quantum computing organizations around the world for engaging presentations and panels on innovative technologies and their applications. The Community will also leverage Google’s powerful, comprehensive, and cloud-native tools to create an interactive, dynamic, and engaging forum for discussion and collaboration.

The Community launches now, with meetings starting in December 2024 and a full rollout of learning and collaboration resources in early 2025. To learn more, register here.

Google Cloud at Supercomputing 2024

The annual Supercomputing Conference series brings together the global HPC community to showcase the latest advancements in HPC, networking, storage and data analysis. Google Cloud is excited to return to Supercomputing 2024 in Atlanta with our largest presence ever.

Visit Google Cloud at booth #1730 to jump in and learn about our HPC, AI infrastructure, and quantum solutions. The booth will feature a Trillium TPU board, NVIDIA H200 GPU and ConnectX-7 NIC, hands-on labs, a full schedule of talks, a comfortable lounge space, and plenty of great swag!

The booth theater will include talks from ARM, Altair, Ansys, Intel, NAG, SchedMD, Siemens, Sycomp, Weka, and more. Booth labs will get you deploying Slurm clusters to fine-tune the Llama2 model or run GROMACS using Cloud Batch to run microbenchmarks or quantum simulations, and more.

We’re also involved in several parts of SC24’s technical program, including BoFs, User Groups, and Workshops. Googlers will participate in the following technical sessions:

Converged HPC and Cloud Computing in the Era of Generative AI (Bill Magro speaking)
HPC & Cloud Convergence: drivers, triggers, and constraints (Felix Schürmann speaking)
DAOS User Group (DUG) ‘24 (Dean Hildebrand speaking)
DAOS BoF (Dean Hildebrand speaking)
9th International Parallel Data Systems Workshop (PDSW) (Dean Hildebrand speaking)
IO500: The High-Performance Storage Community BoF (Dean Hildebrand speaking)
High-Performance Object Storage: I/O for the Exascale Era Tutorial (Dean Hildebrand speaking)
Women in HPC Workshop

Google is also hosting or sponsoring the following exciting events during SC24. We’re looking forward to seeing you there!

Finally, we’ll be holding private meetings and roadmap briefings with our HPC leadership throughout the conference. To schedule a meeting, please contact hpc-sales@google.com.