GCP – Blazing-fast Cloud Storage uploads and downloads with client libraries
Data-intensive applications such as analytics and AI/ML are some of the fastest growing workloads on Google Cloud Storage, but ensuring high throughput for these workloads can be a challenge. Cloud Storage client library transfer manager helps maximize throughput for your workloads, by adding functionality to key client libraries to parallelize uploads and downloads.
The new transfer manager module uses multiple workers, in threads or processes, to maximize throughput. While command-line interfaces to Cloud Storage (e.g., gcloud storage) automatically parallelize both uploads and downloads where appropriate, fully managed parallelism was not available in Cloud Storage client libraries until recently.
The transfer manager module is generally available for Java, Node.js, and Python, and is in preview for Go. Support for additional languages is in development. In this post, we’ll share some examples of how Cloud Storage client library transfer manager features can improve performance over a sequential model for media operations, in many cases quite dramatically.
How the client library performs parallel operations
The Cloud Storage client library’s transfer manager can run concurrent operations on multiple files at once, instead of looping over the files one-by-one. In addition to methods that accept file-blob pairs, the transfer manager also includes methods that let you conveniently upload or download entire folders at once. See the documentation in the “Getting Started” section below for more details.
For workloads that involve large files, the transfer manager provides a “divide-and-conquer” strategy that shards the data in a file and concurrently transfers all the shards. Sharded downloading is implemented with ranged reads. Depending on which client library you use, upload operations utilize either the XML multipart upload API or the gRPC compose API.
Source: Graffle diagram
Performance benefits
You can configure parallelism to accommodate different operating environments and workloads, and the performance impacts of transfer manager vary according to these variables. Whether the parallelism uses threads, processes, or co-routines depends on the programming language.
Switching from ordinary transfers to transfers performed by the transfer manager will have the greatest impact in your application when a lot of data needs to be moved at once. The more there is to transfer in terms of the number of objects, the size of objects, or both, the more your application will benefit.
For example, when downloading a large number of files under 16Kbs using 64 workers on a c3-highcpu-8 Compute Engine instance, the Python library transfer manager module achieved a 50x throughput improvement over a single-worker solution! Testing showed that large numbers of workers are most effective for very small files. While this example uses a fairly extreme number of workers for a relatively small instance, a smaller number of workers still delivers a significant performance improvement.
On the same instance, when moving larger files of 64MB using only 8 workers, the Cloud Storage client library transfer manager increased the throughput by 4.5x from a much higher initial baseline. Performance for sharded uploads and downloads with chunk sizes in the 32 to 64MB range performed similarly.
The optimal configuration for throughput improvements on a given workload varies depending on a number of factors including networking latency, CPU type, and memory. For example, Compute Engine instances have different networking configurations, as well as different CPU and memory resources. Likewise, accessing Cloud Storage from outside of Compute Engine imposes radically different constraints on network throughput and round-trip time.
Getting started
To get started with the Cloud Storage client library transfer manager, refer to our code samples for some major use cases, and the API reference documentation for each client library:
Sample code for downloading multiple files
Sample code for uploading multiple files
Sample code for downloading large files in chunks concurrently
Sample code for uploading large files in chunks concurrently (for Python and Node.js)
Sample code for uploading large files in chunks concurrently (for Java)
Documentation for the Python client library
Documentation for the Node.js client library
Documentation for the Java client library
Documentation for the Go client library (in preview)
Whether our new client library features solve problems for you or could use improvement, we are eager to hear your feedback. Please reach out via the “Send Feedback” button on the Cloud Storage Client Libraries documentation page, or via Github issue on any client library repo.
Read More for the details.