GCP – Testing Cloud SQL failover: Where to begin
One of the key considerations of a high availability setup for Cloud SQL is its ability to failover from a primary instance to a standby instance if the primary instance becomes unresponsive. It might be a paradigm shift to think about purposely initiating a failover and watching what happens, but it can be tremendously helpful to do this sort of testing for your database to ensure that your application is resilient in the event of a zonal failover, whether for general performance testing or specific preparations for a peak event. Additionally, Regional Persistent Disks (RePD) used with your Cloud SQL instance will synchronously replicate data at the block level between two zones in a region, which means that all writes are automatically made to the standby instance in addition to the primary instance. This allows for reduced downtime and higher availability, but also brings an additional need to test failover recovery times and resiliency to ensure that your application connects to the backup instance as smoothly as possible. Let’s examine some of the key metrics to monitor when testing failover to optimize your application’s performance, including the number of database connections, queries per second, CPU and memory utilization of the instance, read/write IOPS, and peak replication lag.
How do I actually failover test my Cloud SQL instance?
Step 1: Select an instance for testing
-
Take an inventory of your Cloud SQL instances and start with an instance that reflects your average workload. Don’t start out by testing your largest instance. Instead, choose an instance that is indicative of your environment’s most typical machine size and specifications, including number of tables and records.
-
Use an instance that will be able to simulate production loads so that your testing results are meaningful.
-
Make note of the instance’s region location as well — it is a good idea to repeat the failover testing process in all of your key regions.
Step 2: Decide how you will run your test
-
It’s important to understand the specific data inputs and scenarios you want to test during failover. For example, you will probably want to test failover with different load sizes to understand any variations in behavior.
-
It may help to think of it in terms of T-shirt sizing — consider your average workload, and test failover with a small, medium and large load comparative to that. While failover testing is different from load testing, it is important to make sure that you observe failover behavior under a range of different loads.
-
Examples of variations to failover test:
-
Non-peak load
-
Peak load
-
Instances in different regions
-
A range of machine sizes
-
Different workloads, e.g., varying reads, writes, inserts, or updates
Step 3: Conduct the failover test and get ready to observe
-
Set aside enough time to conduct the failover test from start to finish and observe the results. It would be best to allow for several hours to fully capture important behaviors and resulting metrics leading up to, during, and post-failover tests.
-
Capture baselines for your key metrics before testing failover so that you know what to compare.
-
You can initiate manual failover from the GCP console or from the command line.
-
Example metrics to capture:
-
General:
-
Number of database connections – you can control the number of connections, such as by automatically shutting down certain processes before initiating failover (e.g., database/network/connections)
-
Queries per second (QPS)
-
CPU utilization of the instance (e.g., database/cpu/utilization)
-
Memory utilization of the instance (e.g., database/memory/utilization)
-
Read/write IOPS (e.g., database/disk/read_ops_count, database/disk/write_ops_count)
-
Peak replication lag (e.g., MySQL Seconds_Behind_Master metric or Postgres replica_byte_lag)
-
MySQL:
-
MySQL global statistics
-
MySQL undo space
-
InnoDB dirty pages (e.g., database/mysql/innodb_buffer_pool_pages_dirty)
-
InnoDB free pages (e.g., database/mysql/innodb_buffer_pool_pages_free)
-
MySQL slow query log counts
-
Postgres:
-
Partitioned tables, tablespaces
-
Shared blocks cache access (e.g., database/postgresql/insights/aggregate/shared_blk_access_coun)
-
Effective cache size
-
VACUUM operations
-
Overall results:
-
Time duration of failover
-
Time duration of recovery
-
Overall database/application performance (e.g., for both failover and failback)
Step 4: Consider your lessons learned
-
What logs or metrics were useful to have? Are there any additional logs or metrics that could be set up in Google Cloud’s operations suite that would enhance your understanding of your database instance or application’s behavior?
-
How did your read replica handle the failover traffic overflow? Could you create additional read replicas to handle load from the primary instance?
-
Do you observe any differences in your application’s behavior for automatic failover compared to manual failover?
-
Zonal failover in a high-availability instance from the primary to a standby instance is different from cross-regional disaster recovery. Consider testing regional failover as part of your DR strategy as well.
Additional general performance tips
-
Internal database caching can be crucial for read performance. Whether using Cloud SQL for MySQL or Postgres, be sure you understand how your instance is caching data and reading from that cached data. For example, the way MySQL’s innodb engine caches the buffer pool.
-
Generally, choose an instance size that allows for plenty of breathing room for CPU, RAM, and disk. The larger the disk size, the higher the IOPS capability.
-
Similarly, if you are running a lot of CPU-intensive queries on your instance, such as sorting, regexes, etc, you will need to utilize a larger machine size.
-
Understand how your application responds to lost connections. You can also test this by restarting your instance.
Try this out on Cloud SQL in the console or read the documentation for the Cloud SQL insights that can help you troubleshoot performance issues in such scenarios.
Read More for the details.