At Google Cloud, we’re continuously working on Google Kubernetes Engine (GKE) scalability so it can run increasingly demanding workloads. Recently, we announced that GKE can support a massive 65,000-node cluster, up from 15,000 nodes. This signals a new era of possibilities, especially for AI workloads and their ever-increasing demand for large-scale infrastructure.
This groundbreaking achievement was built upon Google’s prior experience training large language models (LLMs) on a 50,000+ chip TPU cluster. By leveraging technologies like TPU multislice training, GKE can now handle a massive number of virtual machines while addressing challenges in resource management, scheduling, and inter-node communication.
In addition to demonstrating GKE’s capability to handle extreme scale, this breakthrough also offers valuable insights for optimizing large-scale AI training on Google Cloud.
Running large AI workloads on Kubernetes means running both resource-intensive training and dynamic inference tasks. You need a huge amount of computational resources to train a massive, interconnected model. Simultaneously, inference workloads need to scale efficiently in response to changing customer demand. Mixing training and inference — two workloads with different characteristics — on the same cluster presents a number of complexities that need to be addressed.
In this blog post, we explore a benchmark that simulates these massive AI workloads on a 65,000-node GKE cluster. As we look to develop and deploy even larger LLMs on GKE, we regularly run this benchmark against our infrastructure as a continuous integration (CI) test. We look at its results in detail, as well as the challenges we faced and ways to mitigate them.
Benchmark design
As with any benchmark, the devil is in the details. Below, here are some of the requirements we set forth for our test environment:
- CPU-only: For the purpose of benchmarking the Kubernetes control plane, we opted to use CPU-only machines, which is a much more cost-effective way to measure the performance of the cluster on a large scale compared to GPUs or TPUs.
- Cluster size: At the start of the benchmark we created a 65,000-node cluster. We assumed the cluster would not need to autoscale on the node level, but that workloads dynamically change in size, and can be stopped and restarted.
- Real-life scenarios: We wanted to show the GKE cluster’s ability to accommodate scaling, ease of use, and workload fungibility between training and inference based on real-life scenarios and use cases. As such, the benchmark focused on scenario-related metrics like scheduler throughput. Specifically, we prioritized a usage pattern that combines a very large training job (50K+ nodes) with a scalable inference workload.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3ebe97ca7fa0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Cluster setup
We created the 65,000-node cluster using a publicly available Terraform configuration, with variables to set the cluster name and project. To achieve this scale, we followed best practices from the GKE documentation on planning large GKE clusters.
kube-scheduler
We also used a customized kube-scheduler configuration for our simulated workload. At 500 bindings per second, we were able to schedule a large-scale workload, ensuring high efficiency of the resources.
Simulating the AI workload
In our experiment, we used a StatefulSet with Pods running sleep containers (minimal containers running a sleep command for the duration of the pod’s running) to simulate the behavior of a large-scale AI workload. This allowed us to closely examine resource allocation and scheduling within the Kubernetes cluster without having to run distributed AI workloads on CPU-based VMs. When designing the workload, we made the following design decisions:
-
Choosing the right Kubernetes workload: For our test setup we focused on the StatefulSet API, which is commonly used in generative AI workloads. We used a headless service for the StatefulSet to mimic communication between Pods within the distributed training workload.
-
Ensuring a single-user-workload Pod per node (in addition to DaemonSets): We configured the StatefulSet to ensure that only one Pod was scheduled per Node, which reflects how most users currently run their AI workloads. We did this by specifying the hostPort within the StatefulSet’s Pod spec template.
-
Simulating “all-or-nothing” preemption: To accurately reflect the dynamics of AI workloads, especially the “all-or-nothing” nature of many distributed training jobs, we implemented a manual scale-down mechanism. This means we trigger scale-down of the training workload right after the inference workload scales up.
By employing these techniques, we were able to create a realistic simulation of an AI workload within our Kubernetes cluster. This environment enabled us to thoroughly test the scalability, performance, and resilience of the cluster under a variety of conditions.
Tooling
To develop our benchmark, we used several tools, including ClusterLoader2 to build the performance test, Prow to run the test as part of our continuous integration pipeline, Prometheus to collect metrics, and Grafana to visualize them.
Performance test
Our simulated scenario mimics a mix of AI workloads: AI training and AI inference. There are five phases, with each simulating a different real-life scenario that occurs over the course of the LLM development and deployment lifecycle.
Phase #1: Single workload — creating a large training workload (StatefulSet) from scratch
In the first phase, we run a large training workload, represented by a StatefulSet with 65,000 nodes. This represents a large-scale distributed training that spans 65,000 VMs. Each Pod maps to a single Node, utilizing all of the resources accessible within the cluster.
The phase is complete when the training job terminates, ending in an empty cluster.
Phase #2: Mixed workload — training and inference workloads (StatefulSets)
In the second phase, we run a mixed workload environment within a single cluster, highlighting the capability of running different types of workloads and sharing resources. This involves concurrently running one training workload with 50,000 Pods and another inference workload with 15,000 Pods. Each Pod is assigned to a single Node, helping to ensure full utilization of the cluster’s resources. Notably, the inference workload is given higher priority than the training workload.
Phase #3: Scale up of inference workload (StatefulSet), training workload disruption
In Phase #3, we scale up the inference workload, which in real life is typically due to increased traffic/demand on the services. Since the inference workload has a higher priority, it interrupts the training workload. Once the inference workload is scheduled and running, we recreate the training workload. Given that the training workload has a lower priority, it stays in pending state as long as the inference workload is working at full capacity.
Phase #4: Scale down inference workload, training workload recovery
Here, we simulate a decrease in traffic on the inference workload, triggering the scale-down of the inference workload from 65,000 Pods back to 15,000 Pods. This enables the pending training workload to be scheduled and run again.
Phase #5: Training workload finishes
Finally, we come to the end of the training, indicated by the termination and deletion of training workload, freeing up resources in the cluster.
Note: In our tests we used StatefulSets as this is what large AI model producers use. However with the latest advancements, Kubernetes Job and JobSet are the recommended APIs to run ML training workloads. Those abstractions were also tested at scale, but in dedicated tests.
Metrics
For our test we used ClusterLoader2’s built-in measurements to collect relevant metrics, metrics from Prow logs, and internal GKE metrics.
Key metrics measured by ClusterLoader2 include:
-
Pods state transition duration: How long it takes a workload’s Pod to change state (e.g., to reach running state or to be deleted); monitoring a workload’s in-progress status (i.e., how many Pods are created, running, pending schedule, or terminated).
-
Pod startup latency: The time it takes for a Pod to go from being created to be marked as running.
-
Scheduling throughput: The rate at which Pods are successfully assigned to Nodes
In addition to the ClusterLoader2 measurements, we also measured:
-
Cluster creation/deletion time
-
Various cluster metrics that are exported to Prometheus (e.g., API server latency metrics)
Benchmark results
The results we present in this document are based on a simulation that runs at a specific point in time. To provide context, here’s a timeline with an explanation of when each phase took place.
Observing workloads
Based on data from ClusterLoader2, we generated the chart below, which summarizes all the phases and how the training and inference workload interact with one another throughout the performance test.
In phase #1, we see a smooth workload creation process in which pods are created pretty quickly, and scheduled with only minor delay. The process takes ~2.5m to create, schedule and run 65,000 Pods on an empty cluster (with caveats — see the previous section).
In phase #2, we observe a similar smooth creation process for the training workload, with 50,000 Pods created in under 2 min from an empty cluster. Moreover, we observe the creation of 15,000 Pods for the inference workload in under a minute from a nearly full cluster, demonstrating the fast scheduling even when the cluster is not empty.
In Phase #2, both training and inference workloads were scheduled quickly. Notably, 15,000 inference Pods were created in under a minute on a nearly full cluster, demonstrating fast scheduling even on a non-empty cluster.
During phase #3, we observe the scale up of the inference workload to 65,000 Pods and the disruption and termination of the training workload. Scheduling inference Pods suffers some delay compared to phase 2 due to waiting for the training Pods to be evicted from the Nodes. Nonetheless, the entire startup process for the inference workload takes less than four minutes in total.
After terminating and recreating the training workload, we observe its Pods in pending state (as seen between 7:20 and 7:25 in the graph, with the dotted blue representing created training pods, at 50,000 and the dotted orange representing the running training with Pods at 0) while the higher-priority inference workload occupies the full 65,000 Nodes.
Cluster performance
We use the metrics collected by Prometheus for information about control-plane performance across the experiment’s timeline. For example, you can see the P99 API call latency across various resources, where all API calls, including write calls, are under 400 ms latency — well within the 1s threshold; this satisfies the OSS SLO for resource-scoped API calls.
While API call latency provides a general indication of cluster health, particularly for the API server (as demonstrated by the consistently low response times shown previously), Pod creation and binding rates provide a more holistic perspective on overall cluster performance, validating the performance of the various components involved in the Pod startup process. Our benchmark reveals that a standard GKE cluster (without advanced scheduling features) can achieve a Pod creation rate of 500 Pods per second (see graph below).
Metrics results
Below you can see a table that summarizes the metrics collected through the different phases of the performance test. Please note that these metrics are a result of our experiments done at the time and shouldn’t be taken as SLOs or guarantees of performance in all scenarios. Changes in performances might be observed due to changes in GKE versions.
Final remarks
In this experiment, we showcase the GKE cluster’s ability to manage substantial and dynamic workloads. While you find specific metrics in the above table, here are a few general observations about running large AI workloads on GKE, and the potential implications for your own workloads.
-
Scaling efficiency: Our experiment involved rapid scaling of massive workloads, both up and down. However, even for such large workloads, scaling was quite efficient. Creating a StatefulSet of 65,000 Nodes and having all the Pods run on an empty cluster took only 2 min and 24 seconds! Scaling up and down during phase 3 and 4 were also both quite fast, with inference workload taking ~4min to scale up from 15,000 to 65,000 Pods (including waiting for training workload to preempt), and ~ 3min to scale down to 15,000 Pods again.
-
Image pulling and Pod startup latency: During Phase 1, we experienced a bit of degradation in Pod startup latency, with P100 around 20.4s compared to 5.6s and 5.0s in phase 2. This is due to image pull-time from Artifact Registry. It wasn’t relevant in later phases as Pods used the cached images already on the Nodes. Moreover, in this benchmark we used a small sleep container to run on the Pods of the StatefulSet — a workload that we knew wouldn’t cause additional delays that might impact performance. However, in a real-life scenario with larger images, prepare to see slower initial Pod startup times, since size of a typical image for an ML workflow will likely be in the order of gigabytes.
-
Workload diversity and its effect on scheduling throughput: The introduction of mixed workloads (training and inference) in Phase #2 and later scaling and preemption in Phase #3 adds a layer of complexity. This affected the median/average scheduling throughput, bringing it down to 222/208 Pod/s (from 496/417 Pod/s) respectively.
-
Performance bottlenecks: Examining detailed metrics can help identify potential bottlenecks. For instance, high Pod startup latency could indicate issues with resource provisioning or image pulling. We observed such issues and we were able to bring down initial StatefulSet creation time in phase 1 from 12 min to 2min 30 sec by tweaking the setup a bit. This included using Artifact Registry instead of Container Registry, as well as disabling the auto-mounting of service account credentials to StatefulSet by Kubelet (using automountServiceAccountToken: false).
Overall, the experiment’s focus on large-scale workloads makes our results particularly relevant for organizations deploying machine learning or data-processing applications on Kubernetes. The experiments, focused on Kubernetes Control Plane (KCP) performance, are part of our regular CI tests. We are continuously expanding these tests to validate the growing demands of running AI workloads on these massive clusters. Stay tuned for future blog posts exploring more sophisticated scenarios on a 65,000-node cluster, including the use of accelerators and the evaluation of diverse AI workloads on GKE.
for the details.