GCP – Kubernetes, your AI superpower: How Google Kubernetes Engine powers AI innovation
The age of AI is now. In fact, the global AI infrastructure market is on track to increase to more than $200 billion by 2028.
However, working with massive data, intricate models, and relentless iterations isn’t easy, and adapting to this new era can be daunting. Platform engineering and infrastructure teams that have invested in Kubernetes may wonder: “After years of building expertise in container orchestration to operate production workloads at scale, how do I enable this next generation of AI workloads?”
The good news? You don’t need to start from scratch. You’re well on your way — your Kubernetes skills and investments aren’t just relevant, they’re your AI superpower.
Today, at Google Cloud Next, we’re announcing significant improvements to Google Kubernetes Engine (GKE) to help platform teams succeed with AI:
- Cluster Director for GKE, now generally available, lets you deploy and manage large clusters of accelerated VMs with compute, storage, and networking — all operating as a single unit.
- GKE Inference Quickstart, now in public preview, which simplifies the selection of infrastructure and deployment of AI models, while delivering benchmarked performance characteristics.
- GKE Inference Gateway, now in public preview, provides intelligent routing and load balancing for AI inference on GKE.
- A new container-optimized compute platform is rolling out on GKE Autopilot today, and in Q3, Autopilot’s compute platform will be made available to standard GKE clusters.
- Gemini Cloud Assist Investigations, now in private preview, helps with GKE troubleshooting, decreasing the time it takes to understand the root cause and resolve issues.
- With a built-in partnership with Anyscale, RayTurbo on GKE will launch later this year to deliver superior GPU/TPU performance, rapid cluster startup, and robust autoscaling.
Read on for more details about these announcements.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3edb1c2fdb50>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Scale your AI workloads with Cluster Director for GKE
As AI models grow in size and demand more machines for compute processing, platform teams need to deliver new architectures to deploy models across multiple hosts and operate massive clusters of GPUs and TPUs as a single unit. Without these capabilities, customers often struggle to complete large training jobs and to deliver the inter-machine performance they need for AI.
To handle these scaling challenges, our supercomputing services, Cluster Director for GKE (formerly Hypercompute Cluster), is now generally available. With Cluster Director for GKE, you can deploy and manage large clusters of accelerated VMs with compute, storage, and networking — all operating as a single unit. It delivers exceptionally high performance and resilience for large distributed workloads by automatically repairing faulty clusters based on their bill of health.
One of the best things about Cluster Director for GKE is that you can orchestrate all of this through standard Kubernetes APIs, and ecosystem tooling. There are no new platforms — just new capabilities on the platform you already know and love. You can use GKE node labels to:
-
Schedule pods based on network topology to maximize efficiency and minimize network hops.
-
Report and replace faulty nodes by gracefully evicting workloads from the node and automatically replacing them with spare capacity within your co-located zone.
-
Manage host maintenance so you can manually start host maintenance from GKE or use maintenance information while scheduling your workloads.
To get started with Cluster Director for GKE, use configurable blueprints in Cluster Toolkit or Accelerated Processing Kit (XPK), a command-line tool that requires no prior Kubernetes knowledge.
Make your apps smarter with inferencing on GKE
We are seeing a clear trend in the age of AI: amazing innovation is happening where traditional compute interacts with neural networks — otherwise known as “inference.” Companies operating at the cutting edge of Kubernetes and AI, like LiveX and Moloco, run AI inference on GKE.
Customers and platform teams deploying AI inference on Kubernetes tell us they face two key challenges:
-
Balancing performance and cost: Tuning accelerators to meet the right performance targets without overprovisioning requires extensive knowledge of Kubernetes, AI models, GPU/TPU accelerators, and specific inferencing metrics like Time To First Token (TTFT).
-
Model-aware load balancing: With AI models, response length is often highly variable from one request to another, so response latency varies widely. This means traditional load balancing techniques like round-robin can break down, exacerbating latency and underutilizing accelerator resources.
To address these challenges, we’re introducing new AI inference capabilities in GKE:
-
A new GKE Inference Quickstart, now in public preview, lets you pick an AI model and then provides a set of benchmarked profiles to choose from. The profiles include infrastructure configuration, GPU/TPU accelerator configuration, and Kubernetes resources needed to match a set of AI performance characteristics like TTFT.
-
GKE Inference Gateway, now in public preview, reduces serving costs up to 30%, tail latency by up to 60%, and increases throughput by up to 40%. Customers get a model-aware gateway optimized for intelligent routing and load balancing, including advanced features for routing to different model versions.
The best solutions to complex problems meet you where you are and point you to what’s next. The combination of the GKE Inference Quickstart and GKE Inference Gateway do just that.
Optimize workloads with GKE Autopilot
Optimizing cloud usage and cost savings continues to be a top priority for both Google Cloud and for cloud users, with 71% naming it as their top initiative for the year. If you’re running web and API servers, queue processors, CI/CD agents, or other common workloads, you’re likely over-provisioning some resources to make your apps more responsive. GKE customers often request more compute resources than they use, leading to underutilization and unnecessary costs.
In 2021, we launched GKE Autopilot to combat overprovisioning. Autopilot dramatically simplifies Kubernetes cluster operations and enhances resource efficiency. More and more customers, including Toyota and Contextual AI, are turning to Autopilot for critical workloads. In fact, 30% of active GKE clusters created in 2024 were created in Autopilot mode.
Today, we are announcing new performance improvements to GKE Autopilot, including faster pod scheduling, scaling reaction time, and capacity right-sizing — all made possible by unique hardware capabilities only available on Google Cloud. With Autopilot, your cluster capacity is always right-sized, allowing you to serve more traffic with the same resources or existing traffic with fewer resources.
Currently, Autopilot consists of a best-practice cluster configuration and a container-optimized compute platform that automatically right-sizes capacity to match your workloads. Many customers have told us that they want to right-size capacity on their existing clusters without having to use a specific cluster configuration. To help, starting in Q3, Autopilot’s container-optimized compute platform will also be available to standard GKE clusters, without requiring a specific cluster configuration.
Save time with Gemini Cloud Assist
Nothing slows down the pace of innovation like having to diagnose and debug a problem in your application. Gemini Cloud Assist provides AI-powered assistance across the application lifecycle, and we’re unveiling the private preview of Gemini Cloud Assist Investigations, which helps you understand root cause and resolve issues faster.
The best part? It’s all available right from the GKE console, so you can spend less time troubleshooting and more time innovating. Sign up for the private preview to be able to:
-
Diagnose pod and cluster issues from the GKE console — even across other Google Cloud services such as nodes, IAM, or load balancers.
-
See observations from logs and errors across multiple GKE services, controllers, pods, and underlying nodes.
Kubernetes is the open infrastructure platform for AI
For organizations looking for a comprehensive machine learning platform, we recommend Vertex AI, a unified AI development platform for building and using generative AI on Google Cloud. With access to Vertex AI Studio, Agent Builder, and over 200 models in Model Garden — plus the ability to call it from GKE — it’s a great option if you’re looking for an easy-to-use solution.
Over the last decade, Kubernetes has earned its spot as the de-facto standard for hosting cloud-native applications and microservices. Today, organizations that need deep control over their infrastructure are once again turning to Kubernetes to build their AI training and inference platforms. In fact, IBM, Meta, NVIDIA, and Spotify all use Kubernetes for their AI/ML workloads.
To make Kubernetes an even better platform for AI, we’re proud to lead and contribute alongside these companies (and more with the Cloud Native Computing Foundation) to create exciting open source innovations:
-
Built with Intel, NVIDIA, and more, Dynamic Resource Allocation simplifies and automates hardware allocation and scheduling to pods and workloads.
-
Developed in conjunction with Apple, Red Hat, and others, Kueue and JobSet provide powerful AI training orchestration, streamline job management, and optimize accelerator utilization.
-
Partnering with DaoCloud, we built LeaderWorkerSet, which enables deployment and management of large, multi-host AI inference models via a Kubernetes-native API.
Empower data scientists and AI/ML engineers with Ray on GKE
Platform teams have historically relied on Kubernetes and GKE to serve the needs of software engineers building microservices and related cloud-native applications. With increased AI usage, these same platform teams now need to serve a new user base: data scientists and AI/ML engineers. However, most data scientists and AI/ML engineers aren’t familiar with Kubernetes and need a simpler, more approachable way of interacting with distributed infrastructure.
To curb the steep learning curve, many organizations turn to Ray, an open-source framework that provides an easy way for AI/ML engineers to develop Python code on their laptops and then scale that same code elastically across a Kubernetes cluster.
We’re committed to making Kubernetes the best platform for using Ray, and have been working closely with Anyscale, the creators of Ray, to optimize open-source Ray for Kubernetes. Today, in partnership with Anyscale, we’re announcing RayTurbo on GKE, an optimized version of open-source Ray. RayTurbo delivers 4.5x faster data processing and requires 50% fewer nodes for serving. On GKE, RayTurbo will take advantage of faster GKE startup times, high-performance storage for model weights, TPUs, and better resource efficiency with dynamic nodes and superior pod scalability. RayTurbo on GKE will launch later this year.
Better together: GKE and AI
AI introduces new challenges for platform teams, but we’re confident that the technology and products that you already use — Kubernetes and GKE — can help you tackle them. With the right foundation, platform teams can expand their scope to support data scientists and AI/ML engineers in addition to software engineers.
This confidence is grounded in experience. At Google, we use GKE to power our leading AI services — including Vertex AI — at scale, relying on the same technologies and best practices that we’re sharing with you today.
Ready to start using Kubernetes for the next generation of AI workloads? We look forward to watching you succeed on GKE.
Read More for the details.