GCP – Announcing Gemma 3 on Vertex AI
Today, we’re sharing the new Gemma 3 model is available on Vertex AI Model Garden, giving you immediate access for fine-tuning and deployment. You can quickly adapt Gemma 3 to your use case using Vertex AI’s pre-built containers and deployment tools.
In this post, you’ll learn how to fine-tune Gemma 3 on Vertex AI and deploy it as a production-ready endpoint.
Gemma 3 on Vertex AI: PEFT and vLLM deployment
Tuning and deploying large language models can be computationally expensive and time-consuming. That’s why we’re excited to announce Gemma 3 support for Parameter-Efficient Fine-Tuning (PEFT) and optimized deployment using vLLM on Vertex AI Model Garden.
Gemma 3 fine-tuning allows you to achieve performance gains with significantly less computational resources compared to full fine-tuning. Our vLLM-based deployment is easy-to-use and fast. vLLM’s optimized inference engine maximizes throughput and minimizes latency, ensuring a responsive and scalable endpoint for your Gemma 3 applications on Vertex AI.
Let’s look at how you can fine-tune and deploy your Gemma 3 model on Vertex AI.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Gemma 3 on Vertex AI’), (‘body’, <wagtail.rich_text.RichText object at 0x3ef612bbd4f0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Fine-tuning Gemma 3 on Vertex AI
In Vertex AI Model Garden, you can fine-tune and deploy Gemma 3 using PEFT (LoRA) from Hugging Face in only a few steps. Before you run the notebook make sure you complete all of the initial steps as described in the notebook.
Fine-tuning Gemma 3 on Vertex AI for your use case requires a custom dataset. The recommended format is a JSONL file, where each line is a valid JSON string. Here’s an example inspired by the timdettmers/openassistant-guanaco dataset:
- code_block
- <ListValue: [StructValue([(‘code’, ‘{“text”: “### Human: Hola### Assistant: \u00a1Hola! \u00bfEn qu\u00e9 puedo ayudarte hoy?”}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbd790>)])]>
The JSON object has a key text
, which should match train_column;
The value should be one training data point, i.e. a string. You can upload your dataset to Google Cloud Storage (preferred) or to Hugging Face datasets.
Choose the Gemma 3 variant that best suits your needs. For example, to use the 1B parameter model:
- code_block
- <ListValue: [StructValue([(‘code’, ‘base_model_id = “gemma-3-1b-pt”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbdfa0>)])]>
You have the flexibility to customize model parameters and job arguments. Let’s explore some key settings. LoRA (Low-Rank Adaptation) is a PEFT technique that significantly reduces the number of trainable parameters. The following parameters control LoRA’s behavior. lora_rank
controls to control dimensionality of the update matrices (smaller rank = fewer parameters), lora_alpha
that scales the LoRA updates, and lora_dropout
to add regularization. The following settings are a reasonable starting point.
- code_block
- <ListValue: [StructValue([(‘code’, ‘lora_rank = 16rnlora_alpha = 32rnlora_dropout = 0.05’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbddf0>)])]>
When fine-tuning large language models (LLMs), precision is a key consideration, impacting both memory usage and performance. Lower precision training, such as 4-bit quantization, reduces the memory footprint. However, this can come with a slight performance trade-off compared to higher precisions like 8-bit or float16. The train_precision
parameter dictates the numerical precision used during the training process. Choosing the right precision involves balancing resource limitations with desired model accuracy.
- code_block
- <ListValue: [StructValue([(‘code’, ‘finetuning_precision_mode = “4bit”rntrain_precision = “bfloat16″‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbd430>)])]>
Optimizing model performance involves tuning training parameters that impact speed, stability, and capabilities. Essential parameters include per_device_train_batch_size
, which determines the batch size per GPU, with larger sizes accelerating training but demanding more memory. gradient_accumulation_steps
allows simulating larger batch sizes by accumulating gradients over smaller batches, providing a memory-efficient alternative at the cost of increased training time. The learning_rate
dictates the optimization step size, where a rate that is too high can lead to divergence, while a rate that is too low can slow down convergence. The lr_scheduler_type
dynamically adjusts the learning rate throughout training, such as through linear decay, fostering better convergence and accuracy. And, the total training duration is defined by either max_steps
, specifying the total number of training steps, or num_train_epochs
, with max_steps
taking precedence if both are specified. Below you have the full training recipe you find in the official notebook.
- code_block
- <ListValue: [StructValue([(‘code’, ‘train_job_args = [ “–config_file=vertex_vision_model_garden_peft/deepspeed_zero2_8gpu.yaml”,rn “–task=instruct-lora”,rn “–input_masking=True”,rn “–pretrained_model_name_or_path=gg-hf-g/gemma-3-1b”,rn “–train_dataset=timdettmers/openassistant-guanaco”,rn “–train_split=train”,rn “–train_column=text”,rn “–output_dir=gs://your-adapter-repo”,rn “–merge_base_and_lora_output_dir=gs://merged-model-repo”,rn “–per_device_train_batch_size=1”,rn “–gradient_accumulation_steps=4”,rn “–lora_rank=16”,rn “–lora_alpha=32”,rn “–lora_dropout=0.05”,rn “–max_steps=-1”,rn “–max_seq_length=4096”,rn “–learning_rate=5e-05”,rn “–lr_scheduler_type=cosine”,rn “–precision_mode=4bit”,rn “–train_precision=bfloat16”,rn “–gradient_checkpointing=True”,rn “–num_train_epochs=1.0”,rn “–attn_implementation=eager”,rn “–optimizer=paged_adamw_32bit”,rn “–warmup_ratio=0.01”,rn “–report_to=tensorboard”,rn “–logging_output_dir=gs://your-logs-repo”,rn “–save_steps=10”,rn “–logging_steps=10”,rn “–train_template=openassistant-guanaco”,rn “–huggingface_access_token=your-token”,rn “–eval_dataset=timdettmers/openassistant-guanaco”,rn “–eval_column=text”,rn “–eval_template=openassistant-guanaco”,rn “–eval_split=test”,rn “–eval_steps=10”,rn “–eval_metric_name=loss,perplexity,bleu”,rn “–metric_for_best_model=perplexity”rn]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbde20>)])]>
Finally, create and run the CustomContainerTrainingJob
to start the fine-tuning job.
- code_block
- <ListValue: [StructValue([(‘code’, ‘train_job = aiplatform.CustomContainerTrainingJob(rn display_name=job_name,rn container_uri=TRAIN_DOCKER_URI,rn labels=labels,rn)rnrntrain_job.run(rn args=train_job_args,rn replica_count=replica_count,rn machine_type=training_machine_type,rn accelerator_type=training_accelerator_type,rn accelerator_count=per_node_accelerator_count,rn boot_disk_size_gb=500,rn service_account=SERVICE_ACCOUNT,rn base_output_dir=base_output_dir,rn sync=False,rn **dws_kwargs,rn)rnrntrain_job.wait_for_resource_creation()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbdb80>)])]>
You can monitor the fine-tuning progress using Tensorboard. Once the job is complete, you can upload the tuned model to the Vertex AI Model Registry and deploy it as an endpoint for inference. Let’s dive into deployment next.
Deploying Gemma 3 on Vertex AI
Deploying Gemma 3 on Vertex AI requires only three steps as described in this notebook.
First, you need to provision a dedicated endpoint for your Gemma 3 model. This provides a scalable and managed environment for hosting your model. You use the create
function to set the endpoint name (display_name)
, and ensure dedicated resources for your model (dedicated_endpoint_enabled
).
- code_block
- <ListValue: [StructValue([(‘code’, ‘from google.cloud import aiplatform as vertex_airnendpoint = vertex_ai.Endpoint.create(rn display_name=”gemma3-endpoint”, rn dedicated_endpoint_enabled=True,rn )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbd520>)])]>
Next, register the Gemma 3 model within the Vertex AI Model Registry. Think of the Model Registry as a central hub for managing your models. It keeps track of different versions of your Gemma 3 model (in case you make improvements later), and is the central place from which you’ll deploy.
- code_block
- <ListValue: [StructValue([(‘code’, ‘vllm_serving_image_uri = “us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01″rnrnenv_vars = {rn “MODEL_ID”: “google/gemma-3-1b-it”,rn “DEPLOY_SOURCE”: “notebook”,rn “HF_TOKEN”: “your-hf-token”rn}rnrnvllm_args = [rn “python”,rn “-m”,rn “vllm.entrypoints.api_server”,rn “–host=0.0.0.0”,rn “–port=8080”,rn “–model=’gs://vertex-model-garden-restricted-us/gemma3/gemma-3-1b-it'”,rn “–tensor-parallel-size=1”,rn “–swap-space=16”,rn “–gpu-memory-utilization=0.95”,rn “–max-model-len=32768”,rn “–dtype=”auto”,rn “–max-loras=1”,rn “–max-cpu-loras=8”,rn “–max-num-seqs=256”,rn “–disable-log-stats”,rn “–trust-remote-code”,rn “–enforce-eager”,rn “–enable-lora”,rn “–enable-chunked-prefill”,rn “–enable-prefix-caching”rn]rnrnmodel = aiplatform.Model.upload(rn display_name=”gemma-3-1b”,rn serving_container_image_uri=vllm_serving_image_uri,rn serving_container_args=vllm_args,rn serving_container_ports=[8080],rn serving_container_predict_route=”/generate”,rn serving_container_health_route=”/ping”,rn serving_container_environment_variables=env_vars,rn serving_container_shared_memory_size_mb=(16 * 1024),rn serving_container_deployment_timeout=7200,rn model_garden_source_model_name=”publishers/google/models/gemma3″,rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbd820>)])]>
This step involves a few important configurations including the serving container to deploy Gemma 3.
To serve Gemma 3 on Vertex AI, use the Vertex AI Model Garden vLLM pre-built Docker image for fast and efficient model serving. The vLLM recipe to set how vLLM will serve Gemma 3 which includes --tensor-parallel-size
lets you spread the model across multiple GPUs if you need extra computation resources, --gpu-memory-utilization
controls how much of the GPU memory you want to use and --max-model-len
sets the maximum length of text the model can process at once. You also have some advanced settings like --enable-chunked-prefill
, and --enable-prefix-caching
to optimize performance, especially when dealing with longer pieces of text.
There are also some of deployment configuration Vertex AI requires to serve the model including the port (8080 in our case) that the serving container will listen on, defines the URL path for making prediction requests (e.g., “/generate”) and the URL path for health checks (e.g., “/ping”), allowing Vertex AI to monitor the model’s status.
Finally, use upload()
to take this configuration – the serving container, your model-specific settings, and instructions for how to run the model – and bundle them up into a single, manageable unit within the Vertex AI Model Registry. This makes deployment and version control much easier.
Now you’re ready to deploy the model. To deploy the registered model to the endpoint, use the deploy
method as shown below.
- code_block
- <ListValue: [StructValue([(‘code’, ‘model.deploy(rn endpoint=endpoint,rn machine_type=”a3-highgpu-2g”,rn accelerator_type=”NVIDIA_L4″, rn accelerator_count=1,rn deploy_request_timeout=1800,rn )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbd700>)])]>
This is where you choose the computing power for our deployment including the type of virtual machine (like “a3-highgpu-2g”, machine_type
), the kind of accelerator (e.g., “NVIDIA_L4” GPUs, accelerator_type
), how many accelerators to use (accelerator_count
).
Deploying the model requires some time and you can monitor the status of the deployment in Cloud Logging. Once you get the endpoint running, you can use the ChatCompletion API to call the model and integrate it within your applications as shown below.
- code_block
- <ListValue: [StructValue([(‘code’, ‘import google.authrnimport openairnrncreds, project = google.auth.default()rnauth_req = google.auth.transport.requests.Request()rncreds.refresh(auth_req)rnrnuser_message = “How is your day going?” rnmax_tokens = 50 rntemperature = 1.0 rnstream = FalsernrnBASE_URL = f”https://{your-dedicated-endpoint}/v1beta1/{your-endpoint-name}”rnrnclient = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)rnrnmodel_response = client.chat.completions.create(rn model=””,rn messages=[{“role”: “user”, “content”: user_message}],rn temperature=temperature,rn max_tokens=max_tokens,rn stream=stream,rn)rnrnprint(model_response)rn# I’m doing well, thanks for asking!…’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef612bbda00>)])]>
Depending on the Gemma model you deploy, you can use the ChatCompletion API to call the model with multimodal inputs (images). You can find more in the “Deploy Gemma 3 4B, 12B and 27B multimodal models with vLLM on GPU” section of the model card notebook.
What’s next?
Visit the Gemma 3 model card on Vertex AI Model Garden to get started today. For a deeper understanding of the model’s architecture and performance, check out this developer guide on Gemma 3.
Read More for the details.