2025 06 17

GCP – GKE workload scheduling: Strategies for when resources get tight

As a customer of Google Kubernetes Engine (GKE), you’ve selected a container runtime with a high degree of managed operations, encompassing everything from automatic upgrades to effortless node management. This inherent efficiency allows you to focus more on your applications, and less on the underlying infrastructure. In an ideal world, this streamlined experience, coupled with GKE’s robust autoscaling capabilities, ensures perfect workload scheduling all the time. Your applications seamlessly scale up and down, always finding the resources they need, precisely when they need them.

Unfortunately though, the real world presents a few more challenges that need to be addressed. GKE offers powerful four-way autoscaling (Horizontal Pod Autoscaler, Vertical Pod Autoscaler, Cluster Autoscaler, and Node Auto Provisioning) that provides the building blocks to address the scalability needs for workloads and infrastructure. However, running an efficient platform for today’s dynamic workloads involves more than just ensuring scalability. Factors like cost optimization, capacity availability, the speed at which resources can scale, overall performance, and the flexibility of your infrastructure all profoundly affect and constrain how workload scheduling can be effectively planned on GKE. Honestly, it can get a bit cloudy on what is the best strategy and what are the trade offs between these parameters.

In this blog we will focus on specifically the GKE scheduler and the factors that can influence its workload placement decisions when capacity constraints exist. We will explore how to plan and design for these scenarios using various GKE features and workload configurations.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3ecd5756d0d0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

It’s (mostly) a constraint optimization problem

At its core, effective workload scheduling in GKE is not about finding the single best solution, but rather navigating a multi-dimensional constraint optimization problem. For many use cases it is less about overcoming strict limitations and more about finding trade-offs between competing priorities:

Cost: You want to minimize overall infrastructure spend, optimize utilization, avoid over-provisioning, and leverage cost-effective solutions.
Performance: You want to ensure the workloads that run on the platform can meet their SLOs according to their relative importance to the business.
Flexibility and agility: You want to be able to react to changes in demand of your workloads by providing the necessary capacity when needed.

Understanding your individual preferences and tolerances across these dimensions is critical to understanding how to navigate this constraint space and how to design and configure your GKE environment.

Core building blocks

While not the only factor, autoscaling and its configuration plays a key role in workload scheduling. The configuration of scaling is particular to each environment, and some best practices have been documented. GKE supports autoscaling across four dimensions:

Horizontal Pod Autoscaler (HPA) – adjusts the number of pod replicas
Vertical Pod Autoscaler (VPA) – adjusts pod resource requests based on actual usage
Cluster Autoscaler (CA) – automatically adjusts the number of nodes
Node Auto Provisioner (NAP) – adjusts node pool size based on workload demands

When capacity is a concern, it’s crucial to understand how much resource your workload requests and consumes. The GKE scheduler relies on the pod resource.request value to make an optimum scheduling decision. If this is not set, this can result in incorrect placement (e.g., on nodes with not enough capacity) and workload instability due to pre-emption. The importance of setting requests is discussed in more detail here.

Workload scheduling constraint scenarios

What are good options for running an efficient and performant platform when capacity is constrained?

Let’s take some examples of common scenarios and discuss our options for getting the best result for our workload in terms of cost, performance, and flexibility.

Capacity is fixed or limited – but some high-priority workloads need guaranteed capacity

In this scenario the number of available nodes is considered to be static but the workloads still scale with demand. This creates the need to guarantee resources for critical workloads and explicitly define priority orders.

Solution: Workload priority classes and taints/tolerations

Priority classes implement a hierarchy of workload importance, where higher priority workloads take precedence over lower ones during scheduling decisions. As shown in the diagram above, under capacity constraints, the scheduler evicts lower priority (blue) workloads to successfully schedule those with a higher priority (red).
Taints and tolerations allow capacity targeting by ensuring workloads are not scheduled onto inappropriate nodes. They make sure all the capacity on certain (tainted) nodes (e.g.. with GPUs or SSDs) is only available to specific workloads. Not even workloads with a higher priority class than those with a toleration will be scheduled on the tainted node.

Applications experience sudden spikes in demand, and workloads need to be scaled quickly without performance degradation / errors

In this scenario we need to quickly schedule workloads on a horizontally scalable cluster. Even though GKE has features like container-optimized compute and image streaming that can drastically reduce provisioning time on new nodes, scheduling pods is still much quicker than scaling nodes. This can lead to resource bottlenecks and a degraded SLO.

Solution: Placeholder pods and scaling profiles

Placeholder pods, or “balloon” pods, have the effect of holding or reserving spare running capacity in the cluster. When there’s a sudden spike and new pods need to be scheduled, these balloon pods are evicted, releasing capacity and allowing new pods to be scaled rapidly in their place. New nodes are provisioned by the cluster autoscaler to accommodate the evicted placeholder pods, and provide more capacity if needed.
Auto-scaling profiles configure node scale-down behaviours based on either cost or performance. There are two cluster-based profiles in GKE: balanced and optimize-utilization. The balanced profile scales down nodes in a less aggressive manner compared to the optimize-utilization profile, meaning nodes are available for longer. Any further spikes in demand therefore are not delayed by new node provisioning times.

Workload-specific node scale-down is also available through the use of compute classes (described in more detail below). These allow for node consolidation triggers such as utilization and time delays to customize node lifetimes for different conditions.

I need to provision additional nodes in my cluster but my preferred node type is not available

In this scenario, we address the need to scale out a cluster without knowing if our required or preferred node type, such as a certain hardware accelerator or spot instance, will be available.

Solution: GKE custom compute classes.

These allow you to specify the preferred and fallback nodes that can be used to scale out your clusters. Priorities can be defined for specific node properties like CPU and accelerators, node characteristics (VM family, min CPU/Memory, Spot), or specific instance types (n4-standard-16).
Compute classes also adopt active migration to top priorities, meaning that workloads will always be reconciled to the highest priority option (e.g. Spot instance) when it becomes available, if it was not available at deployment time.
For users of resource-based committed use discounts (CUDs), compute classes can be configured in a way that prefers their committed resources before moving to other resources. To allow for full flexibility and between machine families, regions, and even compute platforms, you should also consider moving to flexible CUDs in the future.

code_block: <ListValue: [StructValue([(‘code’, ‘apiVersion: cloud.google.com/v1rnkind: ComputeClassrnmetadata:rn name: my-classrnspec:rn priorities:rn rules:rn – machineFamily: n4rn minCores: 16rn – machineType: e2-standard-16rn – nodepools: [pool1, pool2]rn autoscalingPolicy:rn consolidationDelayMinutes: 20rn nodepoolAutoCreation:rn enabled: true’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ecd540c93d0>)])]>

My preferred resources are not available in a given region

In this scenario, the capacity required by the workload is very specific and in high demand. There might even be a possibility that the resources cannot be obtained in a region even through compute classes. This is especially important for AI-based workloads that require high-performance infrastructure and GPU or TPU accelerators.

Solution: Multi-Cluster Orchestrator and Multi-Cluster Gateway

Multi-Cluster Orchestrator is an open-source project whose primary goal is ¨simplifying multi-cluster deployments, optimizing resource utilization and costs, and enhancing workload reliability, scalability, and performance.” Using this technology in GKE, platform engineers can in effect “capacity chase” across Google Cloud regions where a workload´s capacity requirements are matched with the regions where that capacity is available. Multi-Cluster Orchestrator then initiates cluster provisioning in that region to run the workload.
Multi-Cluster Gateway is a networking solution for GKE that leverages the Kubernetes Gateway API to manage application traffic across multiple GKE clusters, potentially spanning different regions. It simplifies the complex task of exposing services and balancing workloads in geographically distributed GKE environments

Conclusion and next steps

GKE offers platform engineers a robust set of tools to optimize resource allocation, even when they face capacity constraints. Effective and holistic capacity planning depends on a clear understanding of the workloads, including their criticality, usage profiles, and capacity requirements. Managing constrained capacity can be a strategic way to control costs, making it crucial to optimize performance under these conditions.

To further enhance your capacity planning consider the following resources:

Understand GKE cluster and workload signals such as utilisation and rightsizing and how they are important in capacity planning.
Monitor scaling events such as failed pod scheduling events and node/pod number changes in the Unschedulable Pods dashboard template.
Take a look at the recently released feature — GKE Horizontal Pod Autoscaling Observability Events — which provides the ability to view HPA autoscaler decision events in logs. This can help with the tracking and understanding of scaling event decisions which may influence platform design.

GCP – GKE workload scheduling: Strategies for when resources get tight

It’s (mostly) a constraint optimization problem

Core building blocks

Workload scheduling constraint scenarios

Conclusion and next steps

Related Posts

GCP – Redefining enterprise data with agents and AI-native foundations

GCP – Spanner columnar engine: Powering next-generation analytics on operational data

GCP – Announcing AI-first Colab notebook experience for Google Cloud