GCP – Unlock the AI performance you need: Introducing managed DRANET for A4X Max on GKE
As AI/ML models grow, their infrastructure demands are pushing traditional networking to its limit, creating critical performance bottlenecks. This is especially true for models running on Kubernetes and Google Kubernetes Engine (GKE).
At Google, we’ve been working in the open-source community to make Kubernetes aware of specialized hardware capabilities. For example, we’ve been active in developing the Kubernetes Dynamic Resource Allocation (DRA) framework, a generic API for specialized hardware. Building on DRA, we proposed the Dynamic Resource Allocation for Networking, or DRANET, which extends the DRA API to manage network interfaces as first-class, schedulable resources, with a focus on performance.
Today, we are proud to announce a preview managed DRANET in Google Kubernetes Engine (GKE), launching first with our brand-new A4X Max instances. With this release, Google Cloud is deploying managed DRANET into production, starting with the A4X Max. Managed DRANET offers an enterprise-grade, integrated solution to intelligently allocate high-performance network interfaces alongside accelerators on Kubernetes, addressing the core challenges of network performance and operational complexity for demanding AI workloads.
Hidden performance bottlenecks in AI networking
DRANET on GKE is specifically designed for AI workloads that run across multiple GPUs. Modern accelerator instances like the new A4X Max use multiple high-throughput RDMA network interfaces to feed those powerful GPUs. However, the traditional Kubernetes Networking interface has limitations that make it hard to take full advantage of these networking capabilities:
-
Topology blindness: Peak performance requires network interface alignment. To reduce latency, the GPU and its network interface must be physically “close,” ideally on the same non-uniform memory access (NUMA) node. The default Kubernetes scheduler is unaware of this hardware topology, which can lead to sub-optimal pairings and severely degraded performance.
-
Poor operational performance: The inability to co-schedule NICs and GPUs also leads to sub-optimal resource utilization. This impacts overall cluster performance and efficiency, as schedulers cannot effectively match available accelerators with the specific network interfaces they require.
How GKE with DRANET unlocks performance
When powered by our managed DRANET integration, GKE’s control plane delivers higher performance through:
- Intelligent alignment for higher throughput: This is the core performance win. GKE can now allocate network interfaces that are NUMA-aligned with the assigned GPUs, resulting in lower latency and higher throughput. NUMA alignment can be critical: as detailed in our DRANET research paper, we saw bus bandwidth increased by up to 59.6% during a set of internal tests.
- A dynamic resource specification: DRANET allows you to dynamically express your workload’s networking needs directly in your pod specification. You can ask for a specific number of high-performance network interfaces right alongside your GPU request. GKE then ensures your pod is only scheduled to a node that has both the required GPU and the specific network interfaces available.
These are sophisticated, complex processes, but with managed DRANET on GKE, the complexity is abstracted away. You get the performance of a topology-aware cluster with the flexibility and simplicity of a mature, enterprise-grade container orchestration platform.
DRANET and the new A4X Max: a perfect match
Managed DRANET for GKE arrives just in time for the Google Cloud A4X Max instance, our new flagship AI platform based on the NVIDIA GB300 NVL72 rack-scale system. These instances are built for extreme-scale AI and feature multiple RDMA interfaces.
Managed DRANET on GKE unlocks the full performance of this hardware, ensuring every GPU has the dedicated, aligned, low-latency network path it needs. For a deeper dive into the A4X Max instance itself, please read our full launch blog [add-link-here].
The future of AI networking on GKE
The launch of managed DRANET on GKE is a milestone, shifting Kubernetes from topology-agnostic to topology-aware resource management. That’s the power of Google Cloud: innovating and leading a powerful open-source concept, and delivering it as a simple, scalable, and managed solution.
To learn more about DRANET and get started:
- Read the A4X Max launch blog
- Get started with DRANET on GKE
- Explore the open source project
- Learn more in the DRANET open source blog
- Go under the covers in the DRANET research paper
Read More for the details.
