2025 03 18

GCP – Five tips and tricks to improve your AI workloads

Recently, we announced Gemini Code Assist for individuals, a free version of our AI coding assistant. Technology that was previously available only to the biggest enterprises is now within reach for startups and individual developers. The same applies to the AI/ML related infrastructure like powerful GPUs, specialized TPUs, or extremely efficient storage solutions.

But despite moving towards greater accessibility, enterprises need ways to optimize large AI workloads because these resources can still be quite expensive. It’s always a good idea to look for ways to optimize your workloads to reduce costs. In this article, we’ll share five tips to optimize your workflow on Google Cloud Platform.

Disclaimer: Not every idea fits every use case. These are not official recommendations.

1. Research different platforms to run your training/inference

A few years ago, if you wanted to train, tune, or use any AI model, you had to manually set up a cluster of GPU or TPU powered machines, orchestrate the whole training pipeline, and carefully manage all the resources consumed by this process. You could make it a bit easier by using Kubernetes, but apart from that, there were no widely available services that would make your AI work easier. This generated additional costs, not only did you have to pay for the hardware you used, but also for the time it took to manage all this infrastructure.

Luckily, this is no longer the case. Google Cloud offers a wide range of solutions that can help you run your jobs, from fully-managed to fully-customizable, just take your pick. Here’s a quick summary:

Vertex AI, a fully-managed, unified AI development platform. Whether it’s training, tuning or inferencing, all this can be done through a simple web interface. You can save a lot of man-hours and many headaches, by letting Google manage all the necessary infrastructure your workloads require. Additionally, you will only pay for what you use – no more cost generated by idle GPUs waiting for the next task.
In Cloud Run, there is now an option to run your containers on GPU equipped machines. This is a great way to set up a scalable, fully managed inferencing service, without the need of learning new platforms.
Cloud Batch also offers access to GPUs – this is a great option for long-running tasks like training or tuning of your AI models. Cloud Batch will take care of provisioning all required infrastructure, retrying jobs that encountered an error and releasing the used resources once the job is done. The auto-retry feature combined with Spot Instances can result in a significant reduction of cost for your workloads.
Google Kubernetes Engine (GKE) allows you to control your infrastructure, while handling all the provisioning, setting up and managing. It’s a great solution for organizations that are already using Kubernetes and have the experience required to fully benefit from the control it provides.
Google Compute Engine (GCE) is on the opposite side of the “hands-off scale” than Vertex AI. With direct access to the GPU or TPU equipped Virtual Machines, you gain full control over every aspect of the workflow.

2. Improve the startup time of your inference containers

When working with GKE or Cloud Run, don’t store your models directly in your containers. Keep your containers lightweight and store your models externally using options like Cloud Storage (with FUSE), Filestore or share a read-only persistent disks.

Why? Because bulky containers with embedded models take a long time to scale up. Nodes have to download these massive images before they can even start running. Plus, it puts a strain on node storage, which isn’t optimized for high-throughput. By mounting models externally, you separate them from the container, making everything faster and smoother. Your containers start up quicker, scaling is a breeze, and you avoid bottlenecks on your nodes.

Remember, containers are meant to be temporary and nimble, holding just the essentials. Models are big and need long-term storage. So keep them separate and use external storage to build a fast and efficient auto scaling deployment.

In case you are not able to modify the containers you are working with, GKE offers an option to use a secondary boot disk to speed up the startup process for new nodes. Nodes with secondary boot disk, start up with additional disk that already has some of the container images preloaded. Since we’re talking about startup times, I suggest having a look at the Image streaming feature of GKE, which can make every workload start faster, not only AI related.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3dffa8bc3730>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

3. Storage is not as simple as it seems

Machine Learning usually needs a lot of data, hundreds of TBs to PBs, to produce valuable results – especially when it deals with non-text data and multimodal models. Maximizing GPU/TPU utilization (goodput) during training, checkpointing, and serving is critical and often not a trivial task. For smaller AI workload requirements, with a handful of nodes and TBs of data, Filestore is a good ‘simple’ NFS solution.

For companies that have written their AI workloads to consume object storage, Cloud Storage is a fully managed object storage service suitable for AI and ML workloads of any scale. However, many companies’ AI workloads require a file system interface. Cloud Storage FUSE lets you mount Cloud Storage buckets as a file system. Cloud Storage FUSE is not fully Portable Operating System Interface (POSIX)-compliant; therefore, understanding its limitations and differences from traditional file systems is crucial. With Cloud Storage FUSE, you can access your training data, models, and checkpoints with the scale, affordability, and performance of Cloud Storage.

For workloads that require lower latency and are working with smaller files, Parallelstore is a fully managed, scratch parallel file system in Google Cloud and is ideal for AI and ML workloads that need to provide low-latency (sub-millisecond) access with high throughput and high input/output operations per second (IOPS). The Parallestore reference architecture is here.

Depending on your serving needs, Hyperdisk ML is a high-performance storage solution and particularly well-suited for serving tasks, providing exceptionally high aggregate throughput (~1TB/s) up to 2500 virtual machines concurrently.

4. Use DWS and/or future reservations to get your resources

Problems with acquiring hardware to run your big jobs cost money. You line up all the data, ready to start your work and it turns out, there are not enough GPUs available. This surprise might not generate cost in any obvious way, but the fact that you now need to adjust your schedule and “hunt” for required resources is slowing you down and time is money. Using Dynamic Workload Scheduler and Future Reservations is a way to mitigate such problems.

Future Reservations do exactly what the name suggests. They allow you to put a reservation on Cloud resources you plan to use in the future. Once a reservation is accepted by the system, you can stop worrying about the availability of GPUs you need. When the time comes, the system will provide you with the requested reservations for hardware in your projects. Once it happens, you can use those reservations however you want – the resources are available only to you, as long as the reservation exists. Keep in mind that you pay for those resources, no matter if they are utilized or not.

Dynamic Workload Scheduler (DWS) is a backend platform used by multiple Google Cloud products to make acquisition of popular hardware easier. With its Flex and Calendar mode, you can make sure that you won’t waste time or money trying to grab GPUs one-by-one as they are released by other Cloud customers. You can learn more about the DWS and how it’s utilized by various products from this video.

5. Use custom disk images to make setup faster

Running AI workloads on virtual machines requires a lot of setup. You need to have an up-to-date operating system, installed GPU drivers, installed and configured AI frameworks like JAX, Pytorch or Tensorflow. The full setup of a system like that, if started from a clean operating system image, might take even up to an hour, depending on the freshness of the OS image and your software choices. It would make sense to have to do all this setup only once, right?

This is where the power of custom disk images lies. You can configure your VM once, install all the necessary software, shut it down, create a disk image based on the disk of the configured VM and from now on, it takes only seconds to start new, fully configured workers for your needs. To make your life even easier, you can utilize image families and managed instance groups, to have Google Cloud automatically handle rolling-updates to your setups.

Get started

To keep up to date with everything that’s happening in Google Cloud, consider the following:

Subscribe to our YouTube channel (or the less technical-focused channel),
Follow our Blog (RSS),
Subscribe to our Newsletter
Join the Google Cloud Innovators program

GCP – Five tips and tricks to improve your AI workloads

1. Research different platforms to run your training/inference

2. Improve the startup time of your inference containers

3. Storage is not as simple as it seems

4. Use DWS and/or future reservations to get your resources

5. Use custom disk images to make setup faster

Get started

Related Posts

AWS – AWS IoT Greengrass v2.15 introduces updates to both nucleus and nucleus lite

GCP – How Jina AI built its 100-billion-token web grounding system with Cloud Run GPUs

GCP – Manipal Hospitals and Google Cloud partner to transform nurse handoffs with GenAI