GCP – Sharing details on a recent incident impacting one of our customers
A Google Cloud incident earlier this month impacted our customer, UniSuper, in Australia. While our first priority was to work with our customer to get them fully operational, soon after the incident started, we publicly acknowledged the incident in a joint statement with the customer.
With our customer’s systems fully up and running, we have completed our internal review. We are sharing information publicly to clarify the nature of the incident and ensure there is an accurate account in the interest of transparency. Google Cloud has taken steps to ensure this particular and isolated incident cannot happen again. The impact was very disappointing and we deeply regret the inconvenience caused to our customer.
Scope of the impact
The below listed impacted technologies and services is a description of only Google managed services.
This incident impacted:
One customer in one cloud region.
That customer’s use of one Google Cloud service – Google Cloud VMware Engine (GCVE).
One of the customer’s multiple GCVE Private Clouds (across two zones).
This incident did not impact:
Any other Google Cloud service.
Any other customer using GCVE or any other Google Cloud service.
The customer’s other GCVE Private Clouds, Google Account, Orgs, Folders, or Projects.
The customer’s data backups stored in Google Cloud Storage (GCS) in the same region.
What happened?
TL;DR
During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again.
This incident did not impact any Google Cloud service other than this customer’s one GCVE Private Cloud. Other customers were not impacted by this incident.
Diving Deeper:
Deployment using an exception process
In early 2023, Google operators used an internal tool to deploy one of the customer’s GCVE Private Clouds to meet specific capacity placement needs. This internal tool for capacity management was deprecated and fully automated in Q4 2023 and is therefore no longer required (i.e. no need for human intervention).
Blank input parameter led to unintended behavior
Google operators followed internal control protocols. However, one input parameter was left blank when using an internal tool to provision the customer’s Private Cloud. As a result of the blank parameter, the system assigned a then unknown default fixed 1 year term value for this parameter.
After the end of the system-assigned 1 year period, the customer’s GCVE Private Cloud was deleted. No customer notification was sent because the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool, and not due a customer deletion request. Any customer-initiated deletion would have been preceded by a notification to the customer.
Recovery
The customer and Google teams worked 24×7 over several days to recover the customer’s GCVE Private Cloud, restore the network and security configurations, restore its applications, and recover data to restore full operations.
This was assisted by the customer’s robust and resilient architectural approach to managing risk of outage or failure.
Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration.
Remediation
Google Cloud has since taken several actions to ensure that this incident does not and can not occur again, including:
We deprecated the internal tool that triggered this sequence of events. This aspect is now fully automated and controlled by customers via the user interface, even when specific capacity management is required.
We scrubbed the system database and manually reviewed all GCVE Private Clouds to ensure that no other GCVE deployments are at risk.
We corrected the system behavior that sets GCVE Private Clouds for deletion for such deployment workflows.
Conclusions
There has not been an incident of this nature within Google Cloud prior to this instance. It is not a systemic issue.
Google Cloud services have strong safeguards in place with a combination of soft delete, advance notification, and human-in-the-loop, as appropriate.
We have confirmed these safeguards continue to be in place.
Closely partnering with customers is essential to rapid recovery. The customer’s CIO and technical teams deserve praise for the speed and precision with which they executed the 24×7 recovery, working closely with Google Cloud teams.
Resilient and robust risk management with fail safes is essential to rapid recovery in case of unexpected incidents.
Google Cloud continues to have the most resilient and stable cloud infrastructure in the world. Despite this one-time incident, our uptime and resiliency is independently validated to be the best among leading clouds.
Read More for the details.