GCP – Google, Bytedance, and Red Hat make Kubernetes generative AI inference aware
Over the past ten years, Kubernetes has become the leading platform for deploying cloud-native applications and microservices, backed by an extensive community and boasting a comprehensive feature set for managing distributed systems. Today, we are excited to share that Kubernetes is now unlocking new possibilities for generative AI inference.
In partnership with Red Hat and ByteDance, we are introducing new capabilities that optimize load balancing, scaling and model server performance on Kubernetes clusters running large language model (LLMs) inference. These capabilities build on the success of LeaderWorkerSet (LWS), which enables multi-host inference for state-of-the-art models (including ones with 671B parameters), and push the envelope on what’s possible for gen AI Inference on Kubernetes.
First, the new Gateway API Inference Extension now supports LLM-aware routing, rather than traditional round robin. This makes it more cost-effective to operationalize popular Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low-Rank Adaptation (LoRA) at scale, by using a base model and dynamically loading fine-tuned models (‘adapters’) based on user need. To support PEFT natively, we also introduced new APIs, namely InferencePool and InferenceModel.
Second, a new inference performance project provides a benchmarking standard for detailed model performance insights on accelerators and HPA scaling metrics and thresholds. With the growth of gen AI inference on Kubernetes, it’s important to be able to measure the performance of serving workloads alongside the performance of model servers, accelerators, and Kubernetes orchestration.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3ec2e20df5e0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Third, Dynamic Resource Allocation, developed with Intel and others, simplifies and automates how Kubernetes allocates and schedules GPUs, TPUs, and other devices to pods and workloads. When used along with the vLLM inference and serving engine, the community benefits from scheduling efficiency and portability across accelerators.
“Large-scale inference with scalability and flexibility remains a challenge on Kubernetes. We are excited to collaborate with Google and the community on the Gateway API Inference Extension project to extract common infrastructure layers, creating a more unified and efficient routing system for AI serving — enhancing both AIBrix and the broader AI ecosystem.” – Jiaxin Shan, Staff Engineer at Bytedance, and Founder at AIBrix
“We’ve been collaborating with Google on various initiatives in the Kubernetes Serving working group, including a shared benchmarking tool for gen AI inference workloads. Working with Google, we hope to contribute to a common standard for developers to compare single-node inference performance and scale out to the multi-node architectures that Kubernetes brings to the table.” – Yuan Tang, Senior Principal Software Engineer, Red Hat
“We are partnering with Google to improve vLLM for operationalizing deployments of open-source LLMs for enterprise, including capabilities like LoRA support and Prometheus metrics that enable customers to benefit across the full stack right from vLLM to Kubernetes primitives such as Gateway. This deep partnership across the stack ensures customers get production ready architectures to deploy at scale” – Robert Shaw, vLLM Core Committer and Senior Director of Engineering Neural Magic (acquired by Red Hat)
Together, these projects allow customers to qualify and benchmark accelerators with the inference performance project, operationalize scale-out architectures with LLM-aware routing with the Gateway API Inference extension, and provide an environment with scheduling and fungibility benefits across a wide range of accelerators with DRA and vLLM. To try out these new capabilities for running gen AI inference on Kubernetes, visit Gateway API Inference Extension, the inference performance project or Dynamic Resource Allocation. Also, be sure to visit us at KubeCon in London this week, where we’ll be participating in the keynote as well as many other sessions. Stop by Booth S100 to say hi!
Read More for the details.