GCP – Expanding our NVIDIA partnership: Now shipping A4X Max, Vertex AI Training, and more
Today’s AI models are moving from billions to trillions of parameters, and are capable of complex, multi-modal reasoning. This leap in sophistication demands a new class of purpose-built infrastructure and software to handle the immense computational and memory requirements of these next-generation models.
At Google Cloud, we’re committed to empowering developers and organizations to build and deploy what’s next in AI. Today, we are excited to deepen our partnership with NVIDIA with a suite of new capabilities that strengthens our platform for the entire AI lifecycle:
-
New A4X Max instances powered by NVIDIA’s GB300 NVL72, purpose-built for multimodal AI reasoning tasks
-
Google Kubernetes Engine (GKE), now supporting Dynamic Resource Allocation Kubernetes Network Driver (DRANET), boosting bandwidth in distributed AI/ML workloads
-
GKE Inference Gateway, now integrating with NVIDIA NeMo Guardrails
-
Vertex AI Model Garden to feature NVIDIA Nemotron models
-
Vertex AI Training recipes on top of the NVIDIA NeMo Framework and NeMo-RL
Let’s take a closer look at these developments.
A4X Max with NVIDIA GB300 GPUs
A4X Max is now shipping in production. These new instances, powered by NVIDIA GB300 NVL72, are optimized for the most demanding, multimodal AI reasoning workloads. A4X Max includes 72 Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs connected with NVIDIA’s fifth-generation high-speed GPU interconnect NVIDIA NVLink to function as a single, unified compute platform with shared memory and high-bandwidth communication. Together with Google’s Titanium ML adapter and Google Cloud’s Jupiter network fabric, A4X Max is purpose-built to scale to tens of thousands of GPUs in non-blocking, rail-optimized clusters. Compared to A4X powered by NVIDIA GB200 NVL72, A4X Max delivers 2x the network bandwidth on each system.
A4X Max leverages Google Cloud’s Cluster Director, letting you combine optimized compute, networking, and Google’s storage offerings into a cohesive, performant, and easily managed environment. Cluster Director manages the complete lifecycle of A4X Max clusters — from provisioning and topology-aware placement across the NVL72 domains, to providing powerful observability and resiliency capabilities. It integrates with optimized storage solutions like Managed Lustre, while a managed pre-configured Slurm environment offers fault-tolerant scalable job scheduling for A4X Max. Cluster Director also provides deep observability into job and system performance across the GPUs, NVLink and DC networking fabrics. To maximize throughput, Cluster Director helps ensure high reliability with features like automatic straggler detection and in-job recovery. Cluster Director capabilities like topology aware scheduling, maintenance management, and faulty node reporting are also available transparently through Google Kubernetes Engine (GKE), enabling customers to stay in the GKE environment while running A4X Max.
What all this this means for your workloads:
-
Optimized reasoning and inference: With its 72-GPU NVLink domain, delivering 1.5x FP4 FLOPs, 1.5x HBM memory capacity, and 2x the network bandwidth compared to A4X, A4X Max is specifically designed for low-latency inference, especially for the largest reasoning models. When integrated with GKE Inference Gateway, you benefit from prefix-aware load balancing, improving Time to First Token latency for prefix-heavy workloads. Disaggregated serving can also be enabled to further optimize performance. This is achieved by leveraging Inference Gateway, llm-d, and vLLM together, resulting in significant throughput improvements.
-
Enhanced training and serving performance: With more than 1.4 exaflops per GB300 NVL72 system, A4X Max offers a 4x increase in LLM training and serving performance compared to A3 VMs powered by NVIDIA H100 GPUs.
-
Maximum scalability and parallelization: Based on RDMA over Converged Ethernet (RoCE), A4X Max’s networking fabric delivers low-latency high-performance GPU-to-GPU collectives for distributed training and disaggregated serving workloads. By leveraging a new data-center-scaling design, A4X Max clusters can be 2x larger compared to A4X clusters.
The preview of A4X Max instances comes on the heels of our new G4 VMs powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, and support for NVIDIA Omniverse libraries. Taken together, these offerings underscore our commitment to delivering an end-to-end platform for every AI workload, while our deepening partnership with NVIDIA provides you with a powerful, comprehensive ecosystem to build what’s next in AI.
Increased RDMA performance with GKE DRANET
Today, we’re deploying managed DRANET into production, starting with A4X Max. By enabling topology-aware scheduling of GPUs and RDMA network interface cards, DRANET boosts bus bandwidth for all-gather and all-reduce operations in distributed AI/ML workloads. This translates to improved cost efficiency due to better VM utilization. It does this by scheduling GKE Pods on nodes where the RDMA device and the GPU have the best possible connectivity. DRANET also simplifies RDMA management by making RDMA devices first-class, native resources within GKE. Learn more about DRANET for GKE here.
GKE and NVIDIA NeMo Guardrails
As organizations deploy their AI models into production, they must ensure their safety, security, and responsible behavior. Today, we are announcing the integration of NVIDIA NeMo Guardrails with GKE Inference Gateway, an extension to GKE Gateway for serving generative AI applications.
GKE Inference Gateway optimizes model serving with features like model-aware routing and autoscaling, while NeMo Guardrails add a critical layer of safety, preventing models from engaging in undesirable topics or responding to malicious prompts. Together, they offer a secure, scalable, and manageable inference solution to speed up your AI initiatives.
Vertex AI Model Garden to feature NVIDIA Nemotron models
To give developers greater choice and performance, Vertex AI Model Garden will soon have support for NVIDIA’s Nemotron family of open models as NVIDIA NIM microservices. This integration — starting with the upcoming availability of the NVIDIA Llama Nemotron Super v1.5 model — will give developers and organizations access to the NVIDIA’s latest open-weight models directly within Vertex AI. With a Vertex AI managed deployment, you can rapidly develop and deploy custom AI agents powered by Nemotron models, all while maintaining control over performance, cost, and compliance.
Models deployed through Vertex AI offer the following benefits :
-
Granular control over your deployments, with the ability to optimize for performance or cost by selecting from a wide range of machine types and Google Cloud regions.
-
Robust security by deploying models entirely within your own VPC and adhering to your VPC-SC policies.
-
Incredible ease of use — you can discover, license, and deploy these cutting-edge models in just a few clicks.
Vertex AI Training with NVIDIA NeMo Integration
Vertex AI Training provides the essential control and flexibility enterprises need to adapt foundation models to their proprietary data. To accelerate the creation of highly accurate, proprietary models, we are announcing expanded capabilities in Vertex AI Training that simplify and accelerate the path to developing large-scale models.
Customers benefit from a fully managed and resilient Slurm environment that simplifies large-scale training. Automated resiliency features improve cluster uptime. Our comprehensive data-science tooling removes much of the guesswork from complex model development. Finally, curated and optimized pre-training and post-training recipes built on top of standardized frameworks like NVIDIA NeMo and NeMo-RL empower builders to move from a novel idea to a production-ready, domain-specialized model with greater speed and efficiency.
Take the next steps
These updates enhance the capabilities and flexibility of our Google Cloud platform for running AI workloads. You can choose between the flexibility and control of infrastructure as a service (IaaS) with Google Compute Engine or GKE with Cluster Director; or the fully managed, end-to-end experience of Vertex AI, which provides a secure, scalable, and simplified workflow to train, tune, and manage models.
Together, these infrastructure innovations represent a significant step forward in our mission to provide a complete platform for AI development and deployment. The combination of Google Cloud’s infrastructure and NVIDIA’s latest technology provides a solid foundation for building the next generation of AI applications.
To get started with the A4X Max preview, please contact your Google Cloud sales representative. Vertex AI Training, meanwhile, has everything you need to transform your models into proprietary assets that define your business advantage. To deploy and manage AI models at scale with enterprise-grade security and efficiency, learn how GKE Inference Gateway can help you serve inference workloads. We are excited to see what you will build.
Read More for the details.
