GCP – How to deploy Llama 3.2-1B-Instruct model with Google Cloud Run GPU
As open-source large language models (LLMs) become increasingly popular, developers are looking for better ways to access new models and deploy them on Cloud Run GPU. That’s why Cloud Run now offers fully managed NVIDIA GPUs, which removes the complexity of driver installations and library configurations. This means you’ll benefit from the same on-demand availability and effortless scalability that you love with Cloud Run’s CPU and memory, with the added power of NVIDIA GPUs. When your application is idle, your GPU-equipped instances automatically scale down to zero, optimizing your costs.
In this blog post, we’ll guide you through deploying the Meta Llama 3.2 1B Instruction model on Cloud Run. We’ll also share best practices to streamline your development process using local model testing with Text Generation Inference (TGI) Docker image, making troubleshooting easy and boosting your productivity.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e6f0cf8f040>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Why Cloud Run with GPU?
There are four critical reasons developers benefit from deploying open models on Cloud Run with GPU:
-
Fully managed: No need to worry about drivers, libraries, or infrastructure.
-
On-demand scaling: Scale up or down automatically based on demand.
-
Cost effective: Only pay for what you use, with automatic scaling down to zero when idle.
-
Performance: NVIDIA GPU-optimized for Meta Llama 3.2.
Initial Setup
-
First, create a Hugging Face token.
-
Second, check that your Hugging Face token has permission to access and download Llama 3.2 model weight here. Keep your token handy for the next step.
-
Third, use Google Cloud’s Secret Manager to store your Hugging Face token securely. In this example, we will be using Google user credentials. You may need to authenticate for using gcloud CLI, setting default project ID, and enable necessary APIs, and grant access to Secret Manager and Cloud Storage.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Authenticate CLIrngcloud auth loginrnrn# Set default projectrngcloud config set project <your_project_id>rnrn# Create new secret key, remember to update <your_huggingface_token>rngcloud secrets create HF_TOKEN –replication-policy=”automatic”rnecho -n <your_huggingface_token> | gcloud secrets versions add HF_TOKEN –data-file=-rnrn# Retrieve the keyrnHF_TOKEN=$(gcloud secrets versions access latest –secret=”HF_TOKEN”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0cf8f490>)])]>
Local debugging
-
Install
huggingface_cli
python package in your virtual environment. -
Run
huggingface-cli login
to set up a Hugging Face credential. -
Use the TGI Docker image to test your model locally. This allows you to iterate and debug your model locally before deploying it to Cloud Run.
- code_block
- <ListValue: [StructValue([(‘code’, ‘export LOCAL_MODEL_DIR=~/.cache/huggingface/hubrnexport CONTAINRE_MODEL_DIR=/root/.cache/huggingface/hubrnexport LOCAL_PORT=3002rnrndocker run –gpus all -ti –shm-size 1g -p $LOCAL_PORT:8080 \rn -e MODEL_ID=meta-llama/Llama-3.2-1B-Instruct \rn -e NUM_SHARD=1 \rn -e HF_TOKEN=$(gcloud secrets versions access latest –secret=”HF_TOKEN”) \rn -e MAX_INPUT_LENGTH=500 \rn -e MAX_TOTAL_TOKENS=1000 \rn -e HUGGINGFACE_HUB_CACHE=$CONTAINRE_MODEL_DIR \rn -v $LOCAL_MODEL_DIR:$CONTAINRE_MODEL_DIR \rnus-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0cf8f760>)])]>
Deployment to Cloud Run
-
Deploy the model to Cloud Run with NVIDIA L4 GPU: (Remember to update
SERVICE_NAME
).
- code_block
- <ListValue: [StructValue([(‘code’, ‘export LOCATION=us-central1rnexport CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310rnexport SERVICE_NAME=<your-cloudrun-service-name>rnrngcloud beta run deploy $SERVICE_NAME \rn –image=$CONTAINER_URI \rn –args=”–model-id=meta-llama/Llama-3.2-1B-Instruct,–max-concurrent-requests=1″ \rn –port=8080 \rn –cpu=8 \rn –memory=32Gi \rn –no-cpu-throttling \rn –gpu=1 \rn –gpu-type=nvidia-l4 \rn –max-instances=3 \rn –concurrency=64 \rn –region=$LOCATION \rn –no-allow-unauthenticated \rn –set-secrets=HF_TOKEN=HF_TOKEN:latest’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0cf8fcd0>)])]>
Endpoint testing
-
Test your deployed model using
curl
-
This sends a request to your Cloud Run service for a chat completion, demonstrating how to interact with the deployed model.
- code_block
- <ListValue: [StructValue([(‘code’, ‘URL=https://your-url.us-central1.run.apprnrnrncurl $URL/v1/chat/completions \rn -X POST \rn -H “Authorization: Bearer $(gcloud auth print-identity-token)” \rn -H ‘Content-Type: application/json’ \rn -d ‘{rn “model”: “tgi”,rn “messages”: [rn {rn “role”: “system”,rn “content”: “You are a helpful assistant.”rn },rn {rn “role”: “user”,rn “content”: “What is Cloud Run?”rn }rn ],rn “max_tokens”: 128rn }”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0cf8fa30>)])]>
Cold start improvements with Cloud Storage FUSE
You’ll notice that it takes more than a minute during a cold start for the response to return. Can we do better?
We can use Cloud Storage FUSE. Cloud Storage FUSE is an open-source tool that lets you mount Google Cloud Storage buckets as a file system.
First, you need to download the model files and upload them to the Cloud Storage bucket. (Remember to update GCS_BUCKET
).
- code_block
- <ListValue: [StructValue([(‘code’, ‘# 1. Download modelrnMODEL=meta-llama/Llama-3.2-1B-InstructrnLOCAL_DIR=/mnt/project/google-cloudrun-gpu/gcs_folder/hub/Llama-3.2-1B-InstructrnGCS_BUCKET=gs://<YOUR_BUCKET_WITH_MODEL_WEIGHT>rnrnhuggingface-cli download $MODEL –exclude “*.bin” “*.pth” “*.gguf” “.gitattributes” –local-dir $LOCAL_DIRrnrn# 2. Copy to GCSrngsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -e -r $LOCAL_DIR $GCS_BUCKET’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0df20910>)])]>
Now, we will create a new Cloud Run service using the deployment script as follows. (Remember to update BUCKET_NAME)
. You may also need to update the network
and subnet
name as well.
- code_block
- <ListValue: [StructValue([(‘code’, ‘export LOCATION=us-central1rnexport CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311rnexport SERVICE_NAME=cloudrun-gpu-fuse-llama32-1b-instructrnexport VOLUME_NAME=fusernexport BUCKET_NAME=<YOUR_BUCKET_WITH_MODEL_WEIGHT>rnexport MOUNT_PATH=/mnt/fusernrngcloud beta run deploy $SERVICE_NAME \rn –image=$CONTAINER_URI \rn –args=”–model-id=$MOUNT_PATH/Llama-3.2-1B-Instruct,–max-concurrent-requests=1″ \rn –port=8080 \rn –cpu=8 \rn –memory=32Gi \rn –no-cpu-throttling \rn –gpu=1 \rn –gpu-type=nvidia-l4 \rn –max-instances=3 \rn –concurrency=64 \rn –region=$LOCATION \rn –network=default \rn –subnet=default \rn –vpc-egress=all-traffic \rn –no-allow-unauthenticated \rn –update-env-vars=HF_HUB_OFFLINE=1 \rn –add-volume=name=$VOLUME_NAME,type=cloud-storage,bucket=$BUCKET_NAME \rn –add-volume-mount=volume=$VOLUME_NAME,mount-path=$MOUNT_PATH’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0df20190>)])]>
Next Steps
To learn more about Cloud Run with NVIDIA GPUs and to deploy your own open-source model from Hugging Face, check out these resources below:
Read More for the details.