GCP – How chaos testing adds extra reliability to Spanner’s fault-tolerant design
One of the secrets behind Spanner’s reliability is the team’s extensive use of chaos testing, the process of deliberately injecting faults into production-like instances of the database. Although engineers focus on testing the “happy path,” most software bugs occur when things go wrong. Given Spanner’s complex architecture and constantly evolving codebase, it is inevitable that bugs will be introduced. Here, we give an overview of the types of chaos testing we employ and the kinds of bugs it finds.
A fault-tolerant design foundation
Spanner is built from “mostly reliable” components including machines, disks, and networking hardware that have a low rate of failure. Even so, bad things happen: bad memory and disks may lead to data corruption; file accesses may yield transient or permanent errors or corruption; or network connectivity within or between data centers may be throttled or lost altogether. Worst of all, software bugs sometimes produce correlated failures in all servers running the same version of the code.
Since both correctness and availability are critical, Spanner uses principles of fault-tolerant design to mask failures of these components and achieve high reliability for the service. For example, checksums are used to detect data corruption at many levels. Spanner tablets, which store a fragment of the database, are replicated across three or (usually) more data centers and the reads and writes use Paxos to achieve consensus and consistency of the distributed state. Checksums are also used to detect corruption of a tablet replica. The data for these tablets is stored in files, and the file system keeps multiple copies of the data blocks within the data center, using checksums to detect corrupted blocks. Finally, we proceed cautiously when rolling out new software versions, alerting on any anomalies that may be caused by a new bug.
Upping reliability with chaos testing
We run over a thousand system tests per week to validate that Spanner’s design and implementation actually mask faults and provide a highly reliable service. Each test creates a production-like instance of Spanner comprising hundreds of processes running on the same computing platform and using the same dependent systems (e.g., file system, lock service) as production Spanner. Most tests run for between one and 24 hours and execute tens or hundreds of thousands of transactions.
Actual faults in production occur at a very low rate. To cover Spanner’s error-handling and fault-tolerance mechanisms, we inject faults (e.g., file and network errors) at a much higher rate in these system tests.
If these faults uncover bugs, the test fails in one of several ways:
A read or query on the database does not return the expected result. Being able to compute the expected result of a randomly generated read/query on a database populated with randomly generated data is a challenging problem. Spanner’s strong consistency model is the key to validate read/query results efficiently: each transaction records a log summarizing its effects, and subsequent transactions can replay these logs to compute the state they should observe. We describe this in further detail in an earlier article.
A Spanner API call returns an unexpected error.
A Spanner server crashes in an unexpected way. Some of the faults we inject will cause a server to crash, but we filter these and fail the test only if some new unexpected crash occurs.
One of Spanner’s internal consistency checkers reports a problem. Checkers verify that:
Files are not leaked (like Unix fsck, but on the distributed file system)
Secondary indexes are consistent with the tables they index
Declared checks and foreign key constraints are satisfied
All replicas of a tablet are equal
Let’s take a look at the kinds of faults that we inject when chaos testing Spanner.
1. Server crashes
One of the most basic faults we inject is to force a server to crash abruptly (e.g., via a SIGABRT Unix signal). This simple fault causes lots of complex failure recovery logic to be executed:
Servers use a disk-based log to protect against the loss of their in-memory state, thus crashing exercises the logic that recovers the state of all the tablets that were on the crashed server from their logs.
All distributed transactions being coordinated by the crashed server must abort and be restarted since the locks are kept in memory.
Clients that were pulling data from the crashed server via reads and/or queries are forced to fail over to another replica. The client must resume the operation without starting again at the beginning, and without losing or duplicating any results.
The restart logic is quite complex and we even trigger restarts without server crashes to exercise it at various points in the streaming of the results.
2. File faults
Spanner servers store their persistent data in the Colossus file system. In system tests, we intercept all calls to this file system and randomly inject various types of faults:
Error codes: Force file system calls (e.g. Open, Close, Read, Write) to return an error code to the server. Some codes indicate transient errors that may be retried, while others represent permanent errors.
Corrupt content: Corrupt the content read/written by a Read/Write call. Checksums should detect these.
Blackhole the request: Sometimes the file system in a data center is disabled for maintenance. This will cause file system calls to hang (not return) until the file system is re-enabled. Sometimes the file system is made read-only, in which case only writes will hang. Spanner must detect this and fail over to use an alternate replica in a different data center.
For example, through chaos testing, we found a bug involving the interaction of Spanner tablet compaction and the Colossus storage layer. Spanner tablets are stored using log-structured merge trees, with an in-memory table plus a set of Colossus files. To bound the number of files, tablets are periodically compacted by merging several files into one. The test randomly injected a rare Colossus error code and discovered that the compaction code treated it as meaning end-of-file, resulting in random data loss. Tests like these ensure that subtle data corruption issues that arise from complex system interactions are very unlikely to reach production.
3. RPC faults
Google’s Remote Procedure Call (RPC) system allows outgoing RPCs to be intercepted and manipulated in various ways. This mechanism can be used to inject a wide variety of faults:
Delays can be inserted, triggering timeouts.
Errors, either transient or permanent, can be inserted, exercising the error handling code of a variety of services/APIs.
RPCs to specific (simulated) data centers can be blocked, simulating a network partition.
RPCs to specific dependent systems can be blocked or made to return errors, simulating what would happen if that system went down. For example, Spanner’s data files are encrypted, and the keys for these files are fetched via RPC from a separate key service that runs in each data center. An outage in that service will effectively bring down the replicas in the data center, though it should not prevent clients from failing over to other healthy replicas.
RPCs having a specific network priority can be dropped to simulate the effect of network throttling. When cross data center links become saturated, lower priority packets are dropped. Some RPCs are more critical than others, and this test ensures that critical RPCs have the proper network priority.
4. Memory/quota faults
When servers run low on memory, they enter a state called pushback, which should cause clients to redirect load to other (less busy) replicas. We force the servers into this state to test this behavior, and we ensure the server does not get stuck in this state. We also occasionally force a server to fail by leaking enough memory to cause the container to kill it, making sure this messy situation is handled cleanly.
Spanner enforces quotas on disk space, memory, and flash storage per user. When these limits are reached, operations fail with a special “quota exceeded” error. We inject these to ensure they are handled properly throughout the stack.
5. Cloud faults
Access to Spanner from the Google Cloud Platform is mediated by Spanner API Front End Servers, which proxy requests coming into Google Cloud through Google front ends to a Spanner database. External clients open sessions with the Spanner database and execute transactions on these sessions. For Spanner, we crash the Spanner API frontend servers, which forces sessions to migrate to other Spanner API frontend servers. This should not be visible to the client (besides some additional latency).
6. Regional outages
The largest faults we simulate in system tests are outages of an entire region, forcing Spanner to serve data from a quorum of other regions. The majority of our system tests simulate several kinds of regional outages, triggered either by file system or network outages, and we verify Spanner continues to serve. This resilience is a property of the Paxos algorithm, which guarantees progress as long as a quorum (2 of 3, or 3 of 5) of replicas remain healthy.
Spanner earns its reputation for reliability
Spanner is fault tolerant by design. We continuously validate Spanner’s reliability by running many large-scale randomized system tests that employ chaos testing.
You can learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture.
Read More for the details.