GCP – Accelerate PostgreSQL migrations to Google Cloud: move terabytes of data with Database Migration Service
For anyone bringing new workloads to Google Cloud, Database Migration Service (DMS) offers a comprehensive way to migrate your data, simplifying the process of transfering your data to various Google Cloud databases. Some of the major offerings of DMS include continuous migrations from MySQL, PostgreSQL and SQL Server into Cloud SQL and from PostgreSQL to AlloyDB for PostgreSQL. Additionally, DMS assists in modernizing Oracle workloads by migrating them to Cloud SQL for PostgreSQL and AlloyDB. By leveraging DMS, you can streamline your data migration process, helping to ensure a smooth transition to Google Cloud databases.
In this blog post we will look at various methods to improve the overall speed of migrations for Cloud SQL for PostgreSQL / AlloyDB workloads. Let’s dive in.
The challenges of large database migrations
The primary goal of Database Migration Service is to migrate databases seamlessly with minimal downtime. When dealing with large-scale production workloads, speed of migration is a key factor that determines the overall migration experience. For PostgreSQL databases specifically, slower migration speeds can have side effects like:
Large replication lag for the destination to catch up with the source after setting up replication
Transaction wraparound on source, as vacuum gets paused due to long-running copy operations
Increased WAL Logs size, causing high disk usage on the source
Speed up migrations:
To avoid above issues, we can fine-tune some of the settings to achieve faster migrations. Note that the following settings apply to both Cloud SQL and AlloyDB as a destination. You can improve the migration speeds following settings can be adjusted in various categories:
Parallel initial load and change data capture (CDC) using DMS
Configure PostgreSQL database parameters in the source and target
Optimize machine and network configurations
Let us look into these in some detail.
1. Parallel initial load and CDC using DMS
We recently introduced a new DMS feature for faster migration using PostgreSQL multiple subscriptions, which migrates data in parallel connections by setting up multiple subscriptions using pglogical between the source and destination databases. With this feature, data is migrated in parallel streams both during initial data load and subsequently, CDC.
When you use Database Migration Service either through its UI or with Cloud SQL APIs, the default option to migrate the data is OPTIMAL, which offers balanced performance with optimal load on the source database. If you want to further improve the speed of migration, you can choose a different setting i.e., MAXIMUM, which provides the highest dump speeds for migration.
Based on the setting you select,
DMS calculates the optimum number of subscriptions (subscription is the receiving side of the pglogical replication setup) per database depending on the database and instance-size information.
Tables are distributed to different replication sets (which provide a mechanism to control which tables in the database will be replicated) based on sizes to balance the replication set sizes across subscriptions.
Data is copied in parallel with individual connection per subscription leading to parallel initial copy or CDC.
In our experience, choosing the MAXIMUM option increases migration speed multifold compared to the speed when MINIMAL / OPTIMAL mode is selected.
Note that while the MAXIMUM setting provides the best speeds, if the source is already under high load, this can affect the performance of your applications. So make sure to observe the source resource usage before choosing this option.
2. Configure PostgreSQL database parameters in the source and target
The following database parameters will help optimize the initial load and CDC. Note that the suggestions have a range of values, which you need to test and monitor as per your workload and set them accordingly.
Configurations to fine tune on target instance
Here are the set of configurations that can be fine tuned on the destination database.
max_wal_size: Set this in range of 20GB-50GB
The system parameter max_wal_size determines the maximum size WAL can grow during automatic checkpoints. Having higher wal size reduces checkpoint frequency which in turn helps with better resource allocation for migration processes. The default max_wal_size can cause checkpoints to occur every few seconds during DMS load. To avoid this we can set the max_wal_size range between 20GB – 50GB depending on machine tier. Setting this to a higher value helps with overall migration speeds especially during initial load. For AlloyDB this parameter need not be changed as it automatically manages checkpoints. Once the migration is completed, make sure to change the value to meet the requirements of your production workload.
pglogical.synchronous_commit : Set this to off
As the name pglogical.synchronous_commit indicates, the commit acknowledgment can come before flushing the WAL records to disk. The WAL flush won’t be happening immediately but relies on wal_writer_delay settings. This is generally called an asynchronous commit, which makes commits faster during CDC when DML changes get applied but it comes at the cost of durability. If the PostgreSQL instance crashes, the last few asynchronous commits might be lost.
wal_buffers : Set 32–64 MB in 4 vCPU machines, 64–128 MB in 8–16 vCPU machines
Wal buffers indicate the amount of shared memory used for WAL data that has not yet been written to disk. The goal is to reduce commit frequency during the initial load. It can be set to 256MB for higher vCPU targets. Smaller wal_buffers increase commit frequency, so increasing the value will help in initial load.
maintenance_work_mem: Suggested value of 1GB / size of biggest index if possible
PostgreSQL maintenance operations like VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY use a dedicated memory area known as maintenance_work_mem. These operations are performed sequentially by the database. DMS migrates initial load data first and then rebuilds indexes and constraints on the destination before the start of the CDC phase. So maintenance_work_mem helps utilize optimal memory for building these constraints. Set this value much higher than the default of 64 MB. In the past, experiments setting this to 1 GB gave good results. If possible, the ideal value for this setting would be close to the size of the largest index that needs to be recreated on the destination. Once the migration is complete, you can reset this parameter back to the default value so that application query processing is not impacted.
max_parallel_maintenance_workers: Proportional to CPU count
DMS migrates data first and then recreates secondary indexes on the destination using pg_restore. DMS choses optimal parallel configuration for the –jobs parameter based on the machine configuration on the destination. To further speed up the CREATE INDEX calls, max_parallel_maintenance_workers parameter can be set on the destination for parallel index creation. The default setting is 2 and this can be further increased based on the CPU count and the memory of the destination instance. Once the migration is complete, you can reset this parameter back to the default value so that application query processing is not impacted.
max_parallel_workers: Set this proportional max_worker_processes
You can increase the maximum number of workers the system can support for parallel operations using the max_parallel_workers flag. The default value is 8. Setting the value for this higher than max_worker_processes has no effect, since parallel workers are taken from the pool of worker processes established by that setting. Make sure max_parallel_workers is equal or more than max_parallel_maintainence_workers.
autovacuum: Turn it to ‘off’
If there is a lot of data to catch up on the destination during the CDC phase, temporarily turn off autovacuum in the destination until there is minimal replication lag. You can do a one-time manual vacuum before the promotion of an instance with max_parallel_maintenance_workers=4 (set it to the number of vCPUs of the Cloud SQL instance) and maintenance_work_mem=10GB or higher to speed up the vacuum. Note that manual vacuum takes memory from maintenance_work_mem. Make sure to turn on autovacuum post-migration.
Configurations to fine tune on source instance
Finally, consider the following configurations to fine tune on the source instance:
Shared_buffers: Set to 60% of RAM
The shared_buffers parameter governs the amount of memory allocated by the database server for shared memory buffers. To enhance initial load performance and allow more memory for SELECTs to be buffered in memory, it’s recommended to increase shared_buffers to a maximum of 60% of the available RAM in the source PostgreSQL database.
3. Optimize machine and network configurations
One other important aspect that plays a key role in achieving faster migrations is the machine or network configuration. Having larger destination and source configurations (High RAM, CPU, Disk IO) helps achieve faster migrations.
Here are some ways to achieve this:
When migrating with DMS, try migrating to a large machine tier for the destination instance. Once the migration is done and before promoting the instance, you can downgrade the machine to a smaller tier. Note that you’ll need to restart the machine in this case. But since this is being done before promoting the instance, in most cases it won’t impact source downtime.
Network throughput is limited by the number of vCPUs. Every VM has a network egress cap on write throughput that depends on the machine type. The maximum disk throughput (0.48MBps per GB) is limited by Network egress throughput for the VM. For Disk IOPS, we get 30 IOPS/GB. Try to select target Cloud SQL instances with larger vCPU counts. Allocate enough disk to get more throughput and IOPS.
In our experiments, migrations with private IPs have improved migration speeds (20% faster) than migrations based on public IPs.
Don’t size the initial storage based on the source database size only; take into account the migration workload’s throughput and IOPS requirements.
The parallel threads for Index Rebuild in the target database is determined by the number of vCPUs of the target Cloud SQL instance. (DMS defers the creation of secondary indexes and constraints until after the initial load but before CDC.)
Limitations and closing thoughts
In the case of a single large table, if the source has a large table that holds the majority of the data in the database that is being migrated, then you may not see significant speed improvements with DMS. This is because the current parallelism is table-level due to the limitations of pglogical. So data within a table can’t be parallelized; this will be addressed in the upcoming releases.
Don’t enable automated backups during migration. In addition, don’t perform DDLs on the source if possible; DDLs are not supported for replication. Refer to the documentation for more details.
Optimizing DMS migrations involves fine-tuning configurations on both the source and destination instances, leveraging optimal machine and network configurations, and monitoring the workflow at various stages. By following recommended best practices and addressing potential limitations, you can achieve faster and more efficient DMS migrations. To get started, refer to the DMS quickstart guides for migrations to AlloyDB and Cloud SQL.
Read More for the details.