GCP – Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer
As large language models (LLMs) continue to grow in size and complexity, the time it takes to load them from storage to accelerator memory for inference can become a significant bottleneck. This “cold start” problem isn’t just a minor delay — it’s a critical barrier to building resilient, scalable, and cost-effective AI services. Every minute spent loading a model is a minute a GPU is sitting idle, a minute your service is delayed from scaling to meet demand, and a minute a user request is waiting.
Google Cloud and NVIDIA are committed to removing these barriers. We’re excited to highlight a powerful, open-source collaboration that helps AI developers do just that: the NVIDIA Run:ai Model Streamer now comes with native Google Cloud Storage support, supercharging vLLM inference workloads on Google Kubernetes Engine (GKE). Accessing data for AI/ML from Cloud Storage on GKE has never been faster!

The chart above shows how quickly the model streamer can fetch a 141GB Llama 3.3-7 70B model from Cloud Storage as compared to the default vLLM model loader (lower is better).
Boost resilience and scalability with fewer cold starts
For an inference server running on Kubernetes, a “cold start” involves several steps: pulling the container image, starting the process, and — most time-consuming of all — loading the model weights into GPU memory. For large models, this loading phase can take many minutes, with painful consequences such as slow auto-scaling and idling GPUs as they wait for the workload to start up.
By streaming the model into GPU memory, the model streamer slashes potentially the most time-consuming part of the startup process. Instead of waiting for an entire model to be downloaded before loading, the streamer fetches model tensors directly from object storage and streams them concurrently to GPU memory. This dramatically reduces model loading times from minutes to seconds.
For workloads that rely on model parallelism— where a single model is partitioned and executed across multiple GPUs— the model streamer goes a step further. Its distributed streaming capability is optimized to take full advantage of NVIDIA NVLink, using high-bandwidth GPU-to-GPU communication to coordinate loading across multiple processes. Reading the weights from storage is divided efficiently and evenly across all participating processes, with each one fetching a portion of the model weights from storage and then sharing its segment with the others over NVLink. This allows even multi-GPU deployments to benefit from faster startups and fewer cold-start bottlenecks.
Performance and simplicity
The latest updates to the Model Streamer introduce first-class support for Cloud Storage, creating an integrated and high-performance experience for Google Cloud users. This integration is designed to be simple, fast, and secure, especially for workloads running on GKE.
For users of popular inference servers like vLLM, enabling the streamer is as simple as adding a single flag to your vLLM command line:
--load-format=runai_streamer
Here’s how easy it is to launch a model stored in a Cloud Storage bucket with vLLM:
- code_block
- <ListValue: [StructValue([(‘code’, ‘vllm serve gs://your-gcs-bucket/path/to/your/model rn–load-format=runai_streamer’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7ff1c95918e0>)])]>
The NVIDIA Run:ai Model Streamer is a key component for Vertex AI Model Garden’s large model deployments. With container image streaming and model weight streaming, we have been able to significantly improve the first deployment and autoscaling experience for our users, and the efficiency of NVIDIA GPUs.
When running on GKE, the Model Streamer can automatically use the cluster’s Workload Identity. This means you no longer need to manually manage and mount service account keys, simplifying your deployment manifests and enhancing your security posture. The following deployment manifest shows how to launch a container serving Llama3 70B on GKE. We have added the model loader distributed option to accelerate loads when model parallelism > 1:
- code_block
- <ListValue: [StructValue([(‘code’, ‘apiVersion: apps/v1rnkind: Deploymentrn…rn spec:rn serviceAccountName: gcs-accessrn containers:rn – args:rn – –model=gs://your-gcs-bucket/path/to/your/model rn – –load-format=runai_streamerrn tt- –model-loader-extra-config={“distributed”:true}rntt…rn command:rn – python3rn – -mrn – vllm.entrypoints.openai.api_serverrn image: vllm/vllm-openai:latestrn ….’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7ff1c9cde520>)])]>
That’s it! The streamer handles the rest, auto-tuning streaming concurrency to match your VM’s performance. For more details, see the documentation on optimizing vLLM model loading on GKE.
Combining NVIDIA Run:ai Model Streamer with Cloud Storage Anywhere Cache
Anywhere Cache provides zonally co-located SSD-backed caching for data stored in a regional or multi-regional Cloud Storage bucket. Reducing latency by up to 70% and providing up to 2.5 TB/s of read throughput, Anywhere Cache is a great solution for scale-out inference workloads where the same model is downloaded multiple times across a series of nodes. Together, Anywhere Cache server-side acceleration, along with the NVIDIA Run:ai Model Streamer’s client-side acceleration, create an easy-to-manage, extremely performant model-loading system.
Get started today
The NVIDIA Run:ai Model Streamer is evolving into a critical piece of the AI infrastructure puzzle, enabling teams to build faster, more resilient, and more flexible MLOps pipelines on GKE.
-
To learn more about how to use the model streamer on GKE see our GKE NVIDIA Run:ai Guide.
-
For detailed instructions on using the streamer with vLLM, see the official vLLM documentation.
- To learn more and contribute to the model streamers ongoing development check out the NVIDIA Run:ai Model Streamer project on GitHub.
Read More for the details.
