2025 04 10

GCP – New GKE inference capabilities reduce costs, tail latency and increase throughput

When it comes to AI, inference is where today’s generative AI models can solve real-world business problems. Google Kubernetes Engine (GKE) is seeing increasing adoption of gen AI inference. For example, customers like HubX run inference of image-based models to serve over 250k images/day to power gen AI experiences, and Snap runs AI inference on GKE for its ad ranking system.

However, there are challenges when deploying gen AI inference. First, during the evaluation phase of this journey, you have to evaluate all your accelerator options. You need to choose the right one for your use case. While many customers are interested in using Tensor Processing Units (TPU), they are looking for compatibility with popular model servers. Then, once you’re in production, you need to load-balance traffic, manage price-performance with real traffic at scale, monitor performance, and debug any issues that arise.

To help, this week at Google Cloud Next, we introduced new gen AI inference capabilities for GKE:

GKE Inference Quickstart to help you set up inference environments according to best practices enhancements
GKE TPU serving stack to help you easily benefit from the price-perf of TPUs
GKE Inference Gateway, which introduces gen-AI-aware scaling and load balancing techniques

Together these capabilities help reduce serving costs by over 30%, tail latency by 60%, and increase throughput by up to 40% compared to other managed and open-source Kubernetes offerings.

01- Revised Graphs – Introducing GKE Optimized Inference
02-Revised Graphs – Introducing GKE Optimized Inference (1)
03-Revised Graphs – Introducing GKE Optimized Inference (2)

GKE Inference Quickstart

GKE Inference Quickstart helps you select and optimize the best accelerator, model server and scaling configuration for your AI/ML inference applications. It includes information about instance types, their model compatibility across GPU and TPUs, and benchmarks for how a given accelerator can help you meet your performance goals. Then, once your accelerators are configured, GKE Inference Quickstart can help you with Kubernetes scaling, as well as new inference-specific metrics. In future releases, GKE Inference Quickstart will be available as a Gemini Cloud Assist experience.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3ece88cb05b0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

GKE TPU serving stack

With support for TPUs and vLLM, one of the leading open-source model servers, you get seamless portability across GPUs and TPUs. This means you can use any open model, select the vLLM:TPU container image and just deploy on GKE without any TPU-specific changes. GKE Inference Quickstart also recommends TPU best practices so you can seamlessly run on TPUs without any switching costs. For customers who want to run state-of-the-art models, Pathways, used internally at Google for large models like Gemini, allows you to run multi-host and disaggregated serving.

GKE Inference Gateway

GKE Gateway is an abstraction backed by a load balancer to route incoming requests to your Kubernetes applications, and traditionally, it has been tuned for web serving applications, using load-balancing techniques such as round-robin, whose requests have very predictable patterns. But LLMs have high variability in their request patterns. This can result in high tail latencies and uneven compute utilization, which can negatively impact the end-user experience and unnecessarily increase inference costs. In addition, traditional Gateway does not support routing infrastructure for popular Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA), which can increase GPU efficiency by model reuse during inference.

For scale-out scenarios, the new GKE Inference Gateway provides gen-AI-aware load balancing, for optimal routing. With GKE Inference Gateway, you can define routing rules for safe rollouts, cross-regional preferences, and performance goals such as priority. Finally, GKE Inference Gateway supports LoRA, which lets you map multiple models to the same underlying service, for better efficiency.

To summarize, the visual below shows the needs of the customers during the different stages of the AI inference journey, and how GKE Inference Quickstart, GKE TPU serving stack and GKE Inference Gateway help simplify the evaluation, onboarding and production phases.

04- Graphs - Introducing GKE Optimized Inference

What our customers are saying

“Using TPUs on GKE, especially the newer Trillium for inference, particularly for image generation, has reduced latency by up to 66%, leading to a better user experience and increased conversion rates. Users get responses in under 10 seconds instead of waiting up to 30 seconds. This is crucial for user engagement and retention.” – Cem Ortabas, Co-founder, HubX

“Optimizing price-performance for generative AI inference is key for our customers. We are excited to see GKE Inference Gateway with its optimized load balancing and extensibility in open-source. The new GKE Inference Gateway capabilities could help us further improve performance for our customers’ inference workloads “ – Chaoyu Yang, CEO & Founder, BentoML

With GKE’s new inference capabilities, you get a powerful set of capabilities to take the next step with AI. To learn more, join our GKE gen AI inference breakout session at Next 25, and hear how Snap re-architected their inference platform.

GCP – New GKE inference capabilities reduce costs, tail latency and increase throughput

GKE Inference Quickstart

GKE TPU serving stack

GKE Inference Gateway

What our customers are saying

Related Posts

AWS – Amazon SNS now supports delivery to Amazon Data Firehose in three additional AWS Regions

AWS – AWS Fargate now supports SOCI Index Manifest v2 for greater deployment consistency

AWS – Amazon Rekognition Face Liveness launches accuracy improvements and new challenge setting for improved UX