GCP – Introducing BigQuery soft failover: Greater control for disaster recovery testing
Most businesses with mission-critical workloads have a two-fold disaster recovery solution in place that 1) replicates data to a secondary location, and 2) enables failover to that location in the event of an outage. For BigQuery, that solution takes the shape of BigQuery Managed Disaster Recovery. But the risk of data loss while testing a disaster recovery event remains a primary concern. Like traditional “hard failover” solutions, it forces a difficult choice: promote the secondary immediately and risk losing any data within the Recovery Point Objective (RPO), or delay recovery while you wait for a primary region that may never come back online.
Today, we’re addressing this directly with the introduction of soft failover in BigQuery Managed Disaster Recovery. Soft failover logic promotes the secondary region’s compute and datasets only after replication has been confirmed to be complete, providing you with full control over disaster recovery transitions, and minimizing the risk of data loss during a planned failover.
Figure 1: Comparing hard vs. soft failover
Summary of differences between hard failover and soft failover
Hard failover |
Soft failover |
|
Use case |
Unplanned outages, region down |
Failover testing, requires primary and secondary to both be available |
Failover timing |
As soon as possible ignoring any pending replication between primary and secondary; data loss possible |
Subject to primary and secondary acquiescing, minimizing potential for data loss |
RPO/RTO |
15 minutes / 5 minutes* |
N/A |
*Supported objective depending on configuration
BigQuery soft failover in action
Imagine a large financial services company, “SecureBank,” which uses BigQuery for its mission-critical analytics and reporting. SecureBank requires a reliable Recovery Time Objective (RTO) and15 minute Recovery Point Objective (RPO) for its primary BigQuery datasets, as robust disaster recovery is a top priority. They regularly conduct DR drills with BigQuery Managed DR to ensure compliance and readiness for unforeseen outages.
Before the introduction of soft failover in BigQuery Managed DR BigQuery, SecureBank faced a dilemma on how to perform their DR drills. While BigQuery Managed DR handled the failover of compute and associated datasets, conducting a full “hard failover” drill meant accepting the risk of up to 15 minutes of data loss if replication wasn’t complete when the failover was initiated — or significant operational disruption if they first manually verified data synchronization across regions. This often led to less realistic or more complex drills, consuming valuable engineering time and causing anxiety.
New solution:
With soft failover in BigQuery Managed DR, administrators have several options for failover procedures. Unlike hard failover for unplanned outages, soft failover initiates failover only after all data is replicated to the secondary region, to help guarantee data integrity.
Figure 2: Soft Failover Mode Selection
Figure 3: Disaster recovery reservations
Figure 4: Replication status / Failover details
BigQuery soft failover feature is available today via the BigQuery UI, DDL, and CLI, providing enterprise-grade control for disaster recovery, confident simulations, and compliance — without risking data loss during testing. Get started today to maintain uptime, prevent data loss, and test scenarios safely.
Read More for the details.