2025 11 04

GCP – Upgrading Kubernetes versions just got safer with minor version rollback

Upgrading a Kubernetes cluster has always been a one-way street: you move forward, and if the control plane has an issue, your only option is to roll forward with a fix. This adds significant risk to routine maintenance, a problem made worse as organizations upgrade more frequently for new AI features while demanding maximum reliability. Today, in partnership with the Kubernetes community, we are introducing a new capability in Kubernetes 1.33 that solves this: Kubernetes control-plane minor-version rollback. For the first time, you have a reliable path to revert a control-plane upgrade, fundamentally changing cluster lifecycle management. This feature is available in open-source Kubernetes, and is integrated and generally available in Google Kubernetes Engine starting in GKE 1.33 soon.

The challenge: Why were rollbacks so hard?

Kubernetes’ control plane components, especially kube-apiserver and etcd, are stateful and highly sensitive to API version changes. When you upgrade, many new APIs and features are introduced in the new binary. Some data might be migrated to new formats and API versions. Downgrading was unsupported because there was no mechanism to safely revert changes, risking data corruption and complete cluster failure.

As a simple example, consider adding a new field to an existing resource. Until now, both the storage and API progressed in a single step, allowing clients to write data to that new field immediately. If a regression was detected, rolling back removed access to that field, but the data written to it would not be garbage-collected. Instead, it would persist silently in etcd. This left the administrator in an impossible situation. Worse, upon a future re-upgrade to that minor version, this stale “garbage” data could suddenly become “alive” again, introducing potentially problematic and indeterministic behavior.

The solution: Emulated versions

The Kubernetes Enhancement Proposal (KEP), KEP-4330: Compatibility Versions, introduces the concept of an “emulated version” for the control plane. Contributed by Googlers, this creates a new two-step upgrade process:

Step 1: Upgrade binaries. You upgrade the control plane binary, but the “emulated version” stays the same as the pre-upgrade version. At this stage, all APIs, features, and storage data formats remain unchanged. This makes it safe to roll back your control plane to the previously stable version if you find a problem.

Validate health and check for regressions. The 1st step creates a safe validation window during which you can verify that it’s safe to proceed — for example, making sure your own components or workloads are running healthy under the new binaries and checking for any performance regressions before committing to the new API versions.

Step 2: Finalize upgrade. After you complete your testing, you “bump” the emulated version to the new version. This enables all the new APIs and features of the latest Kubernetes release and completes the upgrade.

This two-step process gives you granular control, more observability, and a safe window for rollbacks. If an upgrade has an unexpected issue, you no longer need to scramble to roll forward. You now have a reliable way to revert to a known-good state, stabilize your cluster, and plan your next move calmly. This is all backed by comprehensive testing for the two-step upgrade in both open-source Kubernetes and GKE.

Enabling this was a major effort, and we want to thank all the Kubernetes contributors and feature owners whose collective work to test, comply, and adapt their features made this advanced capability a reality.

This feature, coming soon to GKE 1.33, gives you a new tool to de-risk upgrades and dramatically shorten recovery time from unforeseen complications.

A better upgrade experience in OSS Kubernetes

This rollback capability is just one part of our broader, long-term investment in improving the Kubernetes upgrade experience for the entire community. At Google, we’ve been working upstream on several other critical enhancements to make cluster operations smoother, safer, and more automated. Here are just a few examples:

Support for skip-version upgrades: Our work on KEP-4330 also makes it possible to enable “skip-level” upgrades for Kubernetes. This means that instead of having to upgrade sequentially through every minor version (e.g., v1.33 to v1.34 to v1.35), you will be able to upgrade directly from an older version to a newer one, potentially skipping one or more intermediate releases (e.g., v1.33 to v1.35). This aims to reduce the complexity and downtime associated with major upgrades, making the process more efficient and less disruptive for cluster operators.
Coordinated Leader Election (KEP-4355): This effort ensures that different control plane components (like kube-controller-manager and kube-scheduler) can gracefully handle leadership changes during an upgrade, so that the Kubernetes version skew policy is not violated.
Graceful Leader Transition (KEP-5366): Building on the above, this allows a leader to cleanly hand off its position before shutting down for an upgrade, enabling zero-downtime transitions for control plane components.
Mixed Version Proxy (KEP-4020): This feature improves API server reliability in mixed-version clusters (like during an upgrade). It prevents false “NotFound” errors by intelligently routing resource requests to a server that recognizes the resource. It also ensures discovery provides a complete list of all resources from all servers in a mixed-version cluster.
Component Health SLIs for Upgrades (KEP-3466): To upgrade safely, you need to know if the cluster is healthy. This KEP defines standardized Service Level Indicators (SLIs) for core Kubernetes components. This provides a clear, data-driven signal that can be used for automated upgrade canary analysis, stopping a bad rollout before it impacts the entire cluster.

Together, these features represent a major step forward in the maturity of Kubernetes cluster lifecycle management. We are incredibly proud to contribute this work to the open-source community and to bring these powerful capabilities to our GKE customers.

Learn more at KubeCon

Want to learn more about the open-source feature and how it’s changing upgrades? Come say hi to our team at KubeCon! You can find us at booths #200 and #1100 and at a variety of sessions, including:

Accelerating Innovation: The Evolution of Kubernetes and the Road Ahead with Jago Macleod (Google)
Upgrade Nightmare To Uptime Dream: The Cloud Provider’s Playbook for Critical Kubernetes Work with Yuchen Zhou (Google) & Uttam Kumar (Salesforce).
Navigating the Multi-Version Kubernetes Universe: How Emulation Version Shapes Your Contributions with Siyuan Zhang (Google) at the Maintainer Summit
GKE Upgrade: A New Era of Safety and Control with Wenjia Zhang (Google) at booth #200

Get started

This is what it looks like when open-source innovation and managed-service excellence come together. This new, safer upgrade feature is coming soon in GKE 1.33. To learn more about managing your clusters, check out the GKE documentation.

GCP – Upgrading Kubernetes versions just got safer with minor version rollback

The challenge: Why were rollbacks so hard?

The solution: Emulated versions

A better upgrade experience in OSS Kubernetes

Learn more at KubeCon

Get started

Related Posts

GCP – Automating FinOps cost management policies using Workload Manager

GCP – 7 ways networking powers your AI workloads on Google Cloud

AWS – EC2 Auto Scaling announces warm pool support for Auto Scaling groups that have mixed instances policies