GCP – AI Hypercomputer developer experience enhancements from Q1 25: build faster, scale bigger
Building cutting-edge AI models is exciting, whether you’re iterating in your notebook or orchestrating large clusters. However, scaling up training can present significant challenges, including navigating complex infrastructure, configuring software and dependencies across numerous instances, and pinpointing performance bottlenecks.
At Google Cloud, we’re focused on making AI training easier, whatever your scale. We’re continuously evolving our AI Hypercomputer system, not just with powerful hardware like TPUs and GPUs, but with a suite of tools and features designed to make you, the developer, more productive. Let’s dive into some recent enhancements that can help streamline your workflows, from interactive development to optimized training and easier deployment.
Scale from your notebook with Pathways on Cloud
You love the rapid iteration that Jupyter notebooks provide, but scaling to thousands of accelerators means leaving that familiar environment behind. At the same time, having to learn different tools for running workloads at scale isn’t practical; nor is tying up large clusters of accelerators for weeks for iterative experiments that might run only for a short time.
You shouldn’t have to choose between ease-of-use and massive scale. With JAX, it’s easy to write code for one accelerator and scale it up to thousands of accelerators. Pathways on Cloud, an orchestration system for creating large-scale, multi-task, and sparsely activated machine learning systems, takes this concept further, making interactive supercomputing a reality. Pathways dynamically manages pools of accelerators for you, orchestrating data movement and computation across potentially thousands of devices. The result? You can launch an experiment on just one accelerator directly from your Jupyter notebook, refine it, and then scale it to thousands of accelerators within the same interactive session. Now you can quickly iterate on research and development without sacrificing scale.
With Pathways on Cloud, you can finally stop rewriting code for different scales. Stop over-provisioning hardware for weeks when your experiments only need a few hours. Stay focused on your science, iterate faster, and leverage supercomputing power on demand. Watch this video to see how Pathways on Cloud delivers true interactive scaling — far beyond just running JupyterHub on a Google Kubernetes Engine (GKE) cluster.
Debug faster, optimize smarter with Xprofiler
When scaling up a job, simply knowing that your accelerators are being used isn’t enough. You need to understand how they’re being used and why things might be slow or crashing. How else would you find that pesky out-of-memory error that takes down your entire run?
Meet the Xprofiler library, your tool for deep performance analysis on Google Cloud accelerators. It lets you profile and trace your code execution, giving you critical insights, especially into the high level operations (HLO) generated by the XLA compiler. Getting actionable insights using Xprofiler is easy. Simply launch an Xprofiler instance from the command line to capture detailed profile and trace logs during your run. Then, use TensorBoard to quickly analyze this data. You can visualize performance bottlenecks, understand hardware limits with roofline analysis (is your workload compute- or memory-bound?), and quickly pinpoint the root cause of errors. Xprofiler helps you optimize your code for peak performance, so you can get the most out of your AI infrastructure.
Skip the setup hassle with container images
You have the choice of many powerful AI frameworks and libraries, but configuring them correctly — with the right drivers and dependencies — can be complex and time-consuming. Getting it wrong, especially when scaling to hundreds or thousands of instances, can lead to costly errors and delays. To help you bypass these headaches, we provide pre-built, optimized container images designed for common AI development needs.
For PyTorch on GPUs, our GPU-accelerated instance container images offer a ready-to-run environment. We partnered closely with NVIDIA to include tested versions of essential software like the NVIDIA CUDA Toolkit, NCCL, and frameworks such as NVIDIA NeMo. Thanks to Canonical, these run on optimized Ubuntu LTS. Now you can get started quickly with a stable environment that’s tuned for performance, avoiding compatibility challenges and saving significant setup time.
And if you’re working with JAX (on either TPUs or GPUs), our curated container images and recipes for JAX for AI on Google Cloud streamline getting started. Avoid the hassle of manual dependency tracking and configuration with these tested and ready-to-use JAX environments.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e575816a130>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>
Boost GPU training efficiency with proven recipes
Beyond setup, maximizing useful compute time (“ML Goodput”) during training is crucial, especially at scale. Wasted cycles due to job failures can significantly inflate costs and delay results. To help, we provide techniques and ready-to-use recipes to tackle these challenges.
Techniques like asynchronous and multi-tier checkpointing increase checkpoint frequency without slowing down training and speed up save/restore operations. AI Hypercomputer can automatically handle interruptions, choosing intelligently between resets, hot-swaps, or scaling actions. Our ML Goodput recipe, created in partnership with NVIDIA, bundles these techniques, integrating NVIDIA NeMo and the NVIDIA Resiliency Extension (NVRx) for a comprehensive solution to boost the efficiency and reliability of your PyTorch training on Google Cloud.
We also added optimized recipes (complete with checkpointing) for you to benchmark training performance for different storage options like Google Cloud Storage and Parallelstore. Lastly, we added recipes for our A4 NVIDIA accelerated instance (built on NVIDIA Blackwell). The training recipes include sparse and dense model training up to 512 Blackwell GPUs with PyTorch and JAX.
Cutting-edge JAX LLM development with MaxText
For developers who use JAX for LLMs on Google Cloud, MaxText provides advanced training, tuning, and serving on both TPUs and GPUs. Recently, we added support for key fine-tuning techniques like Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO), alongside resilient training capabilities such as suspend-resume and elastic training. MaxText leverages JAX optimizations and pipeline parallelism techniques that we developed in collaboration with NVIDIA to improve training efficiency across tens of thousands of NVIDIA GPUs. And we also added support and recipes for the latest open models: Gemma 3, Llama 4 training and inference (Scout and Maverick), and DeepSeek v3 training and inference.
To help you get the best performance with Trillium TPU, we added microbenchmarking recipes including matrix multiplication, collective compute, and high-bandwidth memory (HBM) tests scaling up to multiple slices with hundreds of accelerators. These metrics are particularly useful for performance optimization. For production workloads on GKE, be sure to take a look at automatic application monitoring.
Harness PyTorch on TPU with PyTorch/XLA 2.7 and torchprime
We’re committed to providing an integrated, high-performance experience for PyTorch users on TPUs. To that end, the recently released PyTorch/XLA 2.7 includes notable performance improvements, particularly benefiting users working with vLLM on TPU for inference. This version also adds an important new flexibility and interoperability capability: you can now call JAX functions directly from within your PyTorch/XLA code.
Then, to help you harness the power of PyTorch/XLA on TPUs, we introduced torchprime, a reference implementation for training PyTorch models on TPUs. Torchprime is designed to showcase best practices for large-scale, high-performance model training, making it a great starting point for your PyTorch/XLA development journey.
Build cutting-edge recommenders with RecML
While generative AI often captures the spotlight, highly effective recommender systems remain a cornerstone of many applications, and TPUs offer unique advantages for training them at scale. Deep-learning recommender models frequently rely on massive embedding tables to represent users, items, and their features, and processing these embeddings efficiently is crucial. This is where TPUs shine, particularly with SparseCore, a specialized integrated dataflow processor. SparseCore is purpose-built to accelerate the lookup and processing of the vast, sparse embeddings that are typical in recommenders, dramatically speeding up training compared to alternatives.
To help you leverage this power, we now offer RecML: an easy-to-use, high-performance, large-scale deep-learning recommender system library optimized for TPUs. It provides reference implementations for training state-of-the-art recommender models such as BERT4Rec, Mamba4Rec, SASRec, and HSTU. RecML uses SparseCore to maximize performance, making it easy for you to efficiently utilize the TPU hardware for faster training and scaling of your recommender models.
Build with us!
Improving the AI developer experience on Google Cloud is an ongoing mission. From scaling your interactive experiments with Pathways, to pinpointing bottlenecks with Xprof, to getting started faster with optimized containers and framework recipes, these AI Hypercomputer improvements help to remove friction so you can innovate faster, and build on the other AI Hypercomputer innovations we announced at Google Cloud Next 25:
Explore these new features, spin up the container images, try the JAX and PyTorch recipes, and contribute back to open-source projects like MaxText, torchprime, and RecML. Your feedback shapes the future of AI development on Google Cloud. Let’s build it together.
Read More for the details.