AWS Key Management Service (KMS) is announcing support for on-demand rotation of symmetric encryption KMS keys with imported key material. This new capability enables you to rotate the cryptographic key material of Bring Your Own Keys (BYOK) keys without changing the key identifier (key ARN). Rotating keys helps you meet compliance requirements and security best practices that mandate periodic key rotation.
Organizations can now better align key rotation with their internal security policies when using imported keys within AWS KMS. This new on-demand rotation capability supports both immediate rotation as well as scheduled rotation. Similar to flexible rotation for standard KMS keys, this new rotation capability offers seamless transition to new key material within an existing KMS key ARN and key alias, with zero downtime and complete backwards compatibility with existing data protected under this key.
The pace of innovation in open-source AI is breathtaking, with models like Meta’s Llama4 and DeepSeek AI’s DeepSeek. However, deploying and optimizing large, powerful models can be complex and resource-intensive. Developers and machine learning (ML) engineers need reproducible, verified recipes that articulate the steps for trying out the models on available accelerators.
Today, we’re excited to announce enhanced support and new, optimized recipes for the latest Llama4 and DeepSeek models, leveraging our cutting-edge AI Hypercomputer platform. AI Hypercomputer helps build a strong AI infrastructure foundation using a set of purpose-built infrastructure components that are designed to work well together for AI workloads like training and inference. It is a systems-level approach that draws from our years of experience serving AI experiences to billions of users, and combines purpose-built hardware, optimized software and frameworks, and flexible consumption models. Our AI Hypercomputer resourcesrepository on GitHub, your hub for these recipes, continues to grow.
In this blog, we’ll show you how to access Llama4 and DeepSeek models today on AI Hypercomputer.
Added support for new Llama4 models
Meta recently released the Scout and Maverick models in the Llama4 herd of models. Llama 4 Scout is a 17 billion active parameter model with 16 experts, and Llama 4 Maverick is a 17 billion active parameter model with 128 experts. These models deliver innovations and optimizations based on a Mixture of Experts (MoE) architecture. They support multimodal capability and long context length.
But serving these models can present challenges in terms of deployment and resource management. To help simplify this process, we’re releasing new recipes for serving Llama4 models on Google Cloud Trillium TPUs and A3 Mega and A3 Ultra GPUs.
JetStream, Google’s throughput and memory-optimized engine for LLM inference on XLA devices, now supports Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128Einference on Trillium, the sixth-generation TPU. New recipes now provide the steps to deploy these models using JetStream and MaxText on a Trillium TPU GKE cluster. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. New recipes now demonstrate how to use vLLM to serve the Llama4 Scout and Maverick models on A3 Mega and A3 Ultra GPU GKE clusters.
For serving the Maverick model on TPUs, we utilize Pathways on Google Cloud. Pathways is a system which simplifies large-scale machine learning computations by enabling a single JAX client to orchestrate workloads across multiple large TPU slices. In the context of inference, Pathways enables multi-host serving across multiple TPU slices. Pathways is used internally at Google to train and serve large models like Gemini.
MaxTextprovides high performance, highly scalable, open-source LLM reference implementations for OSS models written in pure Python/JAX and targeting Google Cloud TPUs and GPUs for training and inference. MaxText now includes reference implementations for Llama4 Scout and Maverick models and includes information on how to perform checkpoint conversion, training, and decoding for Llama4 models.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3edc3b7832b0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Added support for DeepSeek Models
Earlier this year, Deepseek released two open-source models: the DeepSeek-V3 model followed by DeepSeek-R1 model. The V3 model provides model innovations and optimizations based on an MoE-based architecture. The R1 model provides reasoning capabilities through the chain-of-thought thinking process.
To help simplify deployment and resource management, we’re releasing new recipes for serving DeepSeek models on Google Cloud Trillium TPUs and A3 Mega and A3 Ultra GPUs.
JetStream now supports DeepSeek-R1-Distill-Llama70B inference on Trillium. A new recipe now provides the steps to deploy DeepSeek-R1-Distill-Llama-70B using JetStream and MaxText on a Trillium TPU VM. With the recent ability to work with Google Cloud TPUs, vLLM users can leverage the performance-cost benefits of TPUs with a few configuration changes. vLLM on TPU now supports all DeepSeek R1 Distilled models on Trillium. Here’s a recipe which demonstrates how to use vLLM, a high-throughput inference engine, to serve the DeepSeek distilled Llama model on Trillium TPUs.
You can also deploy DeepSeek Models using the SGLang Inference stack on our A3 Ultra VMs powered by eight NVIDIA H200 GPUs with this recipe.A recipe for A3 Mega VMs with SGLang is also available, which shows you how to deploy multihost inference utilizing two A3 Mega nodes. Cloud GPU users using the vLLM Inference engine can also deploy DeepSeek Models on the A3 Mega (recipe) and A3 Ultra (recipe) VMs.
MaxText now also includes support for architectural innovations from DeepSeek such as MLA – Multi-Head Latent Attention, MoE Shared and Routed Experts with Loss Free Load Balancing, Expert Parallelism support with Dropless, Mixed Decoder Layers ( Dense and MoE ) and YARN RoPE embeddings. The reference implementations for the DeepSeek family of models allows you to rapidly experiment with your models by incorporating some of these newer architectural enhancements.
Recipe example
The reproducible recipes show the steps to deploy and benchmark inference with the new Llama4 and DeepSeek models. For example, this TPU recipe outlines the steps to deploy the Llama-4-Scout-17B-16E Model with JetStream MaxText Engine with Trillium TPU. The recipe shows steps to provision the TPU cluster, download the model weights and set up JetStream and MaxText. It then shows you how to convert the checkpoint to a compatible format for MaxText, deploy it on a JetStream server, and run your benchmarks.
You can deploy Llama4 Scout and Maverick models or DeepSeekV3/R1 models today using inference recipes from the AI Hypercomputer Github repository. These recipes provide a starting point for deploying and experimenting with Llama4 models on Google Cloud. Explore the recipes and resources linked below, and stay tuned for future updates. We hope you have fun building and share your feedback!
When you deploy open models like DeepSeek and Llama, you are responsible for its security and legal compliance. You should follow responsible AI best practices, adhere to the model’s specific licensing terms, and ensure your deployment is secure and compliant with all regulations in your area.
Looking to fine-tune multimodal AI models for your specific domain but facing infrastructure and implementation challenges? This guide demonstrates how to overcome the multimodal implementation gap using Google Cloud and Axolotl, with a complete hands-on example fine-tuning Gemma 3 on the SIIM-ISIC Melanoma dataset. Learn how to scale from concept to production while addressing the typical challenges of managing GPU resources, data preparation, and distributed training.
Filling in the Gap
Organizations across industries are rapidly adopting multimodal AI to transform their operations and customer experiences. Gartner analysts predict 40% of generative AI solutions will be multimodal (text, image, audio and video) by 2027, up from just 1% in 2023, highlighting the accelerating demand for solutions that can process and understand multiple types of data simultaneously.
Healthcare providers are already using these systems to analyze medical images alongside patient records, speeding up diagnosis. Retailers are building shopping experiences where customers can search with images and get personalized recommendations. Manufacturing teams are spotting quality issues by combining visual inspections with technical data. Customer service teams are deploying agents that process screenshots and photos alongside questions, reducing resolution times.
Multimodal AI applications powerfully mirror human thinking. We don’t experience the world in isolated data types – we combine visual cues, text, sound, and context to understand what’s happening. Training multimodal models on your specific business data helps bridge the gap between how your teams work and how your AI systems operate.
Key challenges organizations face in production deployment
Moving from prototype to production with multimodal AI isn’t easy. PwC survey data shows that while companies are actively experimenting, most expect fewer than 30% of their current experiments to reach full scale in the next six months. The adoption rate for customized models remains particularly low, with only 20-25% of organizations actively using custom models in production.
The following technical challenges consistently stand in the way of success:
Infrastructure complexity: Multimodal fine-tuning demands substantial GPU resources – often 4-8x more than text-only models. Many organizations lack access to the necessary hardware and struggle to configure distributed training environments efficiently.
Data preparing hurdles: Preparing multimodal training data is fundamentally different from text-only preparation. Organizations struggle with properly formatting image-text pairs, handling diverse file formats, and creating effective training examples that maintain the relationship between visual and textual elements.
Training workflow management: Configuring and monitoring distributed training across multiple GPUs requires specialized expertise most teams don’t have. Parameter tuning, checkpoint management, and optimization for multimodal models introduce additional layers of complexity.
These technical barriers create what we call “the multimodal implementation gap” – the difference between recognizing the potential business value and successfully delivering it in production.
How Google Cloud and Axolotl together solve these challenges
Our collaboration brings together complementary strengths to directly address these challenges. Google Cloud provides the enterprise-gradeinfrastructure foundation necessary for demanding multimodal workloads. Our specialized hardware accelerators such as NVIDIA B200 Tensor Core GPUs and Ironwood are optimized for these tasks, while our managed services like Google Cloud Batch, Vertex AI Training, and GKE Autopilot minimize the complexities of provisioning and orchestrating multi-GPU environments. This infrastructure seamlessly integrates with the broader ML ecosystem, creating smooth end-to-end workflows while maintaining the security and compliance controls required for production deployments.
Axolotl complements this foundation with a streamlined fine-tuning framework that simplifies implementation. Its configuration-driven approach abstracts away technical complexity, allowing teams to focus on outcomes rather than infrastructure details. Axolotl supports multiple open source and open weight foundation models and efficient fine-tuning methods like QLoRA. This framework includes optimized implementations of performance-enhancing techniques, backed by community-tested best practices that continuously evolve through real-world usage.
Together, we enable organizations to implement production-grade multimodal fine-tuning without reinventing complex infrastructure or developing custom training code. This combination accelerates time-to-value, turning what previously required months of specialized development into weeks of standardized implementation.
Solution Overview
Our multimodal fine-tuning pipeline consists of five essential components:
Foundational model: Choose a base model that meets your task requirements. Axolotl supports a variety of open source and open weight multimodal models including Llama 4, Pixtral, LLaVA-1.5, Mistral-Small-3.1, Qwen2-VL, and others. For this example, we’ll use Gemma 3, our latest open and multimodal model family.
Data preparation: Create properly formatted multimodal training data that maintains the relationship between images and text. This includes organizing image-text pairs, handling file formats, and splitting data into training/validation sets.
Training configuration: Define your fine-tuning parameters using Axolotl’s YAML-based approach, which simplifies settings for adapters like QLoRA, learning rates, and model-specific optimizations.
Infrastructure orchestration: Select the appropriate compute environment based on your scale and operational requirements. Options include Google Cloud Batch for simplicity, Google Kubernetes Engine for flexibility, or Vertex AI Custom Training for MLOps integration.
Production integration: Streamlined pathways from fine-tuning to deployment.
The pipeline structure above represents the conceptual components of a complete multimodal fine-tuning system. In our hands-on example later in this guide, we’ll demonstrate these concepts through a specific implementation tailored to the SIIM-ISIC Melanoma dataset, using GKE for orchestration. While the exact implementation details may vary based on your specific dataset characteristics and requirements, the core components remain consistent.
Selecting the Right Google Cloud Environment
Google Cloud offers multiple approaches to orchestrating multimodal fine-tuning workloads. Let’s explore three options with different tradeoffs in simplicity, flexibility, and integration:
Google Cloud Batch
Google Cloud Batch is best for teams seeking maximum simplicity for GPU-intensive training jobs with minimal infrastructure management. It handles all resource provisioning, scheduling, and dependencies automatically, eliminating the need for container orchestration or complex setup. This fully managed service balances performance and cost effectiveness, making it ideal for teams who need powerful computing capabilities without operational overhead.
Vertex AI Custom Training
Vertex AI Custom Training is best for teams prioritizing integration with Google Cloud’s MLOps ecosystem and managed experiment tracking. Vertex AI Custom Training jobs automatically integrate with Experiments for tracking metrics, the Model Registry for versioning, Pipelines for workflow orchestration, and Endpoints for deployment.
Google Kubernetes Engine (GKE)
GKE is best for teams seeking flexible integration with containerized workloads. It enables unified management of training jobs alongside other services in your container ecosystem while leveraging Kubernetes’ sophisticatedschedulingcapabilities. GKE offers fine-grained control over resource allocation, making it ideal for complex ML pipelines. For our hands-on example, we’ll use GKE in Autopilot mode, which maintains these integration benefits while Google Cloud automates infrastructure management including node provisioning and scaling. This lets you focus on your ML tasks rather than cluster administration, combining the flexibility of Kubernetes with the operational simplicity of a managed service.
Take a look at our code sample here for a complete implementation that demonstrates how to orchestrate a multimodal fine-tuning job on GKE:
This repository includes ready-to-use Kubernetes manifests for deploying Axolotl training jobs on GKE in Autopilot mode, covering automated cluster setup with GPUs, persistent storage configuration, job specifications, and monitoring integration.
Hands-on example: Fine-tuning Gemma 3 on the SIIM-ISIC Melanoma dataset
This section involves dermoscopic images of skin lesions with labels indicating whether they are malignant or benign. With melanoma accounting for 75% of skin cancer deaths despite its relative rarity, early and accurate detection is critical for patient survival. By applying multimodal AI to this challenge, we unlock the potential to help dermatologists improve diagnostic accuracy and potentially save lives through faster, more reliable identification of dangerous lesions. So, let’s walk through a complete example fine-tuning Gemma 3 on the SIIM-ISIC Melanoma Classification dataset.
For this implementation, we’ll leverage GKE in Autopilot mode to orchestrate our training job and monitoring, allowing us to focus on the ML workflow while Google Cloud handles the infrastructure management.
Data Preparation
The SIIM-ISIC Melanoma Classification dataset requires specific formatting for multimodal fine-tuning with Axolotl. Our data preparation process involves two main steps: (1) efficiently transferring the dataset to Cloud Storage using Storage Transfer Service, and (2) processing the raw data into the format required by Axolotl. To start, transfer the dataset.
Create a TSV file that contains the URLs for the ISIC dataset files:
Set up appropriate IAM permissions for the Storage Transfer Service:
code_block
<ListValue: [StructValue([(‘code’, ‘# Get your current project IDrnexport PROJECT_ID=$(gcloud config get-value project)rnrn# Get your project numberrnexport PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} –format=”value(projectNumber)”)rnrn# Enable the Storage Transfer APIrnecho “Enabling Storage Transfer API…”rngcloud services enable storagetransfer.googleapis.com –project=${PROJECT_ID}rnrn# Important: The Storage Transfer Service account is created only after you access the service.rn# Access the Storage Transfer Service in the Google Cloud Console to trigger its creation:rn# https://console.cloud.google.com/transfer/cloudrnecho “IMPORTANT: Before continuing, please visit the Storage Transfer Service page in the Google Cloud Console”rnecho “Go to: https://console.cloud.google.com/transfer/cloud”rnecho “This ensures the Storage Transfer Service account is properly created.”rnecho “After visiting the page, wait approximately 60 seconds for account propagation, then continue.”rnecho “”rnecho “Press Enter once you’ve completed this step…”rnread -p “”rnrn# Grant Storage Transfer Service the necessary permissionsrnexport STS_SERVICE_ACCOUNT_EMAIL=”project-${PROJECT_NUMBER}@storage-transfer-service.iam.gserviceaccount.com”rnecho “Granting permissions to Storage Transfer Service account: ${STS_SERVICE_ACCOUNT_EMAIL}”rnrngcloud storage buckets add-iam-policy-binding gs://${GCS_BUCKET_NAME} \rn–member=serviceAccount:${STS_SERVICE_ACCOUNT_EMAIL} \rn–role=roles/storage.objectViewer \rn–condition=Nonernrngcloud storage buckets add-iam-policy-binding gs://${GCS_BUCKET_NAME} \rn–member=serviceAccount:${STS_SERVICE_ACCOUNT_EMAIL} \rn–role=roles/storage.objectUser \rn–condition=None’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8424eaa8e0>)])]>
Set up a storage transfer job using the URL list:
Navigate to Cloud Storage > Transfer
Click “Create Transfer Job”
Select “URL list” as Source type and “Google Cloud Storage” as Destination type
Enter the path to your TSV file: gs://<GCS_BUCKET_NAME>/melanoma_dataset_urls.tsv
Select your destination bucket
Use the default job settings and click Create
The transfer will download approximately 32GB of data from the ISIC Challenge repository directly to your Cloud Storage bucket. Once the transfer is complete, you’ll need to extract the ZIP files before proceeding to the next step where we’ll format this data for Axolotl. See the notebook in the Github repository here for a full walk-through demonstration on how to format the data for Axolotl.
Preparing Multimodal Training Data
For multimodal models like Gemma 3, we need to structure our data following the extended chat_template format, which defines conversations as a series of messages with both text and image content.
Below is an example of a single training input example:
code_block
<ListValue: [StructValue([(‘code’, ‘{rn “messages”: [rn {rn “role”: “system”,rn “content”: [rn {“type”: “text”, “text”: “You are a dermatology assistant that helps identify potential melanoma from skin lesion images.”}rn ]rn },rn {rn “role”: “user”,rn “content”: [rn {“type”: “image”, “path”: “/path/to/image.jpg”},rn {“type”: “text”, “text”: “Does this appear to be malignant melanoma?”}rn ]rn },rn {rn “role”: “assistant”, rn “content”: [rn {“type”: “text”, “text”: “Yes, this appears to be malignant melanoma.”}rn ]rn }rn ]rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8424eaadf0>)])]>
We split the data into training (80%), validation (10%), and test (10%) sets, while maintaining the class distribution in each split using stratified sampling.
This format allows Axolotl to properly process both the images and their corresponding labels, maintaining the relationship between visual and textual elements during training.
Creating the Axolotl Configuration File
Next, we’ll create a configuration file for Axolotl that defines how we’ll fine-tune Gemma 3. We’ll use QLoRA (Quantized Low-Rank Adaptation) with 4-bit quantization to efficiently fine-tune the model while keeping memory requirements manageable. While A100 40GB GPUs have substantial memory, the 4-bit quantization with QLoRA allows us to train with larger batch sizes or sequence lengths if needed, providing additional flexibility for our melanoma classification task. The slight reduction in precision is typically an acceptable tradeoff, especially for fine-tuning tasks where we’re adapting a pre-trained model rather than training from scratch.
This configuration sets up QLoRA fine-tuning with parameters optimized for our melanoma classification task. Next, we’ll set up our GKE Autopilot environment to run the training.
Setting up GKE Autopilot for GPU Training
Now that we have our configuration file ready, let’s set up the GKE Autopilot cluster we’ll use for training. As mentioned earlier, Autopilot mode lets us focus on our ML task while Google Cloud handles the infrastructure management.
Let’s create our GKE Autopilot cluster:
code_block
<ListValue: [StructValue([(‘code’, ‘# Set up environment variables for cluster configurationrnexport PROJECT_ID=$(gcloud config get-value project)rnexport REGION=us-central1rnexport CLUSTER_NAME=melanoma-training-clusterrnexport RELEASE_CHANNEL=regularrnrn# Enable required Google APIsrnecho “Enabling required Google APIs…”rngcloud services enable container.googleapis.com –project=${PROJECT_ID}rngcloud services enable compute.googleapis.com –project=${PROJECT_ID}rnrn# Create a GKE Autopilot cluster in the same region as your datarnecho “Creating GKE Autopilot cluster ${CLUSTER_NAME}…”rngcloud container clusters create-auto ${CLUSTER_NAME} \rn –location=${REGION} \rn –project=${PROJECT_ID} \rn –release-channel=${RELEASE_CHANNEL}rnrn# Install kubectl if not already installedrnif ! command -v kubectl &> /dev/null; thenrn echo “Installing kubectl…”rn gcloud components install kubectlrnfirnrn# Install the GKE auth plugin required for kubectlrnecho “Installing GKE auth plugin…”rngcloud components install gke-gcloud-auth-pluginrnrn# Configure kubectl to use the clusterrnecho “Configuring kubectl to use the cluster…”rngcloud container clusters get-credentials ${CLUSTER_NAME} \rn –location=${REGION} \rn –project=${PROJECT_ID}rnrn# Verify kubectl is working correctlyrnecho “Verifying kubectl connection to cluster…”rnkubectl get nodes’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8424eaa040>)])]>
Now set up Workload Identity Federation for GKE to securely authenticate with Google Cloud APIs without using service account keys:
code_block
<ListValue: [StructValue([(‘code’, ‘# Set variables for Workload Identity Federationrnexport PROJECT_ID=$(gcloud config get-value project)rnexport NAMESPACE=”axolotl-training”rnexport KSA_NAME=”axolotl-training-sa”rnexport GSA_NAME=”axolotl-training-sa”rnrn# Create a Kubernetes namespace for the training jobrnkubectl create namespace ${NAMESPACE} || echo “Namespace ${NAMESPACE} already exists”rnrn# Create a Kubernetes ServiceAccountrnkubectl create serviceaccount ${KSA_NAME} \rn –namespace=${NAMESPACE} || echo “ServiceAccount ${KSA_NAME} already exists”rnrn# Create an IAM service accountrnif ! gcloud iam service-accounts describe ${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com &>/dev/null; thenrn echo “Creating IAM service account ${GSA_NAME}…”rn gcloud iam service-accounts create ${GSA_NAME} \rn –display-name=”Axolotl Training Service Account”rn rn # Wait for IAM propagationrn echo “Waiting for IAM service account creation to propagate…”rn sleep 15rnelsern echo “IAM service account ${GSA_NAME} already exists”rnfirnrn# Grant necessary permissions to the IAM service accountrnecho “Granting storage.objectAdmin role to IAM service account…”rngcloud projects add-iam-policy-binding ${PROJECT_ID} \rn –member=”serviceAccount:${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com” \rn –role=”roles/storage.objectAdmin”rnrn# Wait for IAM propagationrnecho “Waiting for IAM policy binding to propagate…”rnsleep 10rnrn# Allow the Kubernetes ServiceAccount to impersonate the IAM service accountrnecho “Binding Kubernetes ServiceAccount to IAM service account…”rngcloud iam service-accounts add-iam-policy-binding ${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \rn –role=”roles/iam.workloadIdentityUser” \rn –member=”serviceAccount:${PROJECT_ID}.svc.id.goog[${NAMESPACE}/${KSA_NAME}]”rnrn# Annotate the Kubernetes ServiceAccountrnecho “Annotating Kubernetes ServiceAccount…”rnkubectl annotate serviceaccount ${KSA_NAME} \rn –namespace=${NAMESPACE} \rn iam.gke.io/gcp-service-account=${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com –overwriternrn# Verify the configurationrnecho “Verifying Workload Identity Federation setup…”rnkubectl get serviceaccount ${KSA_NAME} -n ${NAMESPACE} -o yaml’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8424eaa550>)])]>
Now create a PersistentVolumeClaim for our model outputs. In Autopilot mode, Google Cloud manages the underlying storage classes, so we don’t need to create our own:
<ListValue: [StructValue([(‘code’, ‘# Apply the PVC configurationrnkubectl apply -f model-storage-pvc.yaml’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8424eaad30>)])]>
Deploying the Training Job to GKE Autopilot
In Autopilot mode, we specify our GPU requirements using annotations and resource requests within the Pod template section of our Job definition. We’ll create a Kubernetes Job that requests a single A100 40GB GPU:
Create a ConfigMap with our Axolotl configuration:
code_block
<ListValue: [StructValue([(‘code’, ‘# Create the ConfigMap rnkubectl create configmap axolotl-config –from-file=gemma3-melanoma.yaml -n ${NAMESPACE}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8424eaa6a0>)])]>
Create a Secret with Hugging Face credentials:
code_block
<ListValue: [StructValue([(‘code’, “# Create a Secret with your Hugging Face tokenrn# This token is required to access the Gemma 3 model from Hugging Face Hubrn# Generate a Hugging Face token at https://huggingface.co/settings/tokens if you don’t have one rnkubectl create secret generic huggingface-credentials -n ${NAMESPACE} –from-literal=token=YOUR_HUGGING_FACE_TOKEN”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83fef25340>)])]>
Apply training job YAML to start the training process:
code_block
<ListValue: [StructValue([(‘code’, ‘# Start training job rnkubectl apply -f axolotl-training-job.yaml’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83fef8aa90>)])]>
Monitor the Training Process
Fetch the pod name to monitor progress:
code_block
<ListValue: [StructValue([(‘code’, “# Get the pod name for the training jobrnPOD_NAME=$(kubectl get pods -n ${NAMESPACE} –selector=job-name=gemma3-melanoma-training -o jsonpath='{.items[0].metadata.name}’)rnrn# Monitor logs in real-timernkubectl describe pod $POD_NAME -n ${NAMESPACE}rnkubectl logs -f $POD_NAME -n ${NAMESPACE}”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e83fef8a550>)])]>
<ListValue: [StructValue([(‘code’, ‘# Deploy TensorBoardrnkubectl apply -f tensorboard.yamlrnrn# Get the external IP to access TensorBoardrnkubectl get service tensorboard -n ${NAMESPACE}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8421ee8d00>)])]>
Model Export and Evaluation Setup
After training completes, we need to export our fine-tuned model and evaluate its performance against the base model. First, let’s export the model from our training environment to Cloud Storage:
After creating the model-export.yaml file, apply it:
code_block
<ListValue: [StructValue([(‘code’, ‘# Export the modelrnkubectl apply -f model-export.yaml’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8421ee8550>)])]>
This will start the export process, which copies the fine-tuned model from the Kubernetes PersistentVolumeClaim to your Cloud Storage bucket for easier access and evaluation.
Once exported, we have several options for evaluating our fine-tuned model. You can deploy both the base and fine-tuned models to their own respective Vertex AI Endpoints for systematic testing via API calls, which works well for high-volume automated testing and production-like evaluation. Alternatively, for exploratory analysis and visualization, a GPU-enabled notebook environment such as a Vertex Workbench Instance or Colab Enterprise offers significant advantages, allowing for real-time visualization of results, interactive debugging, and rapid iteration on evaluation metrics.
In this example, we use a notebook environment to leverage its visualization capabilities and interactive nature. Our evaluation approach involves:
Loading both the base and fine-tuned models
Running inference on a test set of dermatological images from the SIIM-ISIC dataset
Computing standard classification metrics (accuracy, precision, recall, etc.)
Analyzing the confusion matrices to understand error patterns
Generating visualizations to highlight performance differences
For the complete evaluation code and implementation details, check out our evaluation notebook in the GitHub repository.
Performance Results
Our evaluation demonstrated that domain-specific fine-tuning can transform a general-purpose multimodal model into a much more effective tool for specialized tasks like medical image classification. The improvements were significant across multiple dimensions of model performance.
The most notable finding was the base model’s tendency to over-diagnose melanoma. It showed perfect recall (1.000) but extremely poor specificity (0.011), essentially labeling almost every lesion as melanoma. This behavior is problematic in clinical settings where false positives lead to unnecessary procedures, patient anxiety, and increased healthcare costs.
Fine-tuning significantly improved the model’s ability to correctly identify benign lesions, reducing false positives from 3,219 to 1,438. While this came with a decrease in recall (from 1.000 to 0.603), the tradeoff resulted in much better overall diagnostic capability, with balanced accuracy improving substantially.
In our evaluation, we also included results from the newly announced MedGemma—a collection of Gemma 3 variants trained specifically for medical text and image comprehension recently released at Google I/O. These results further contribute to our understanding of how different model starting points affect performance on specialized healthcare tasks.
Below we can see the performance metrics across all three models:
Accuracy jumped from a mere 0.028 for base Gemma 3 to 0.559 for our tuned Gemma 3 model, representing an astounding 1870.2% improvement. MedGemma achieved 0.893 accuracy without any task-specific fine-tuning—a 3048.9% improvement over the base model and substantially better than our custom-tuned version.
While precision saw a significant 34.2% increase in our tuned model (from 0.018 to 0.024), MedGemma delivered a substantial 112.5% improvement (to 0.038). The most remarkable transformation occurred in specificity—the model’s ability to correctly identify non-melanoma cases. Our tuned model’s specificity increased from 0.011 to 0.558 (a 4947.2% improvement), while MedGemma reached 0.906 (an 8088.9% improvement over the base model).
These numbers highlight how fine-tuning helped our model develop a more nuanced understanding of skin lesion characteristics rather than simply defaulting to melanoma as a prediction. MedGemma’s results demonstrate that starting with a medically-trained foundation model provides considerable advantages for healthcare applications.
The confusion matrices further illustrate these differences:
Looking at the base Gemma 3 matrix (left), we can see it correctly identified all 58 actual positive cases (perfect recall) but also incorrectly classified 3,219 negative cases as positive (poor specificity). Our fine-tuned model (center) shows a more balanced distribution, correctly identifying 1,817 true negatives while still catching 35 of the 58 true positives. MedGemma (right) shows strong performance in correctly identifying 2,948 true negatives, though with more false negatives (46 missed melanoma cases) than the other models.
To illustrate the practical impact of these differences, let’s examine a real example, image ISIC_4908873, from our test set:
Disclaimer: Image for example case use only.
The base model incorrectly classified it as melanoma. Its rationale focused on general warning signs, citing its “significant variation in color,” “irregular, poorly defined border,” and “asymmetry” as definitive indicators of malignancy, without fully contextualizing these within broader benign patterns.
In contrast, our fine-tuned model correctly identified it as benign. While acknowledging a “heterogeneous mix of colors” and “irregular borders,” it astutely noted that such color mixes can be “common in benign nevi.” Crucially, it interpreted the lesion’s overall “mottled appearance with many small, distinct color variations” as being “more characteristic of a common mole rather than melanoma.”
Interestingly, MedGemma also misclassified this lesion as melanoma, stating, “The lesion shows a concerning appearance with irregular borders, uneven coloration, and a somewhat raised surface. These features are suggestive of melanoma. Yes, this appears to be malignant melanoma.” Despite MedGemma’s overall strong statistical performance, this example illustrates that even domain-specialized models can benefit from task-specific fine-tuning for particular diagnostic challenges.
These results underscore a critical insight for organizations building domain-specific AI systems: while foundation models provide powerful starting capabilities, targeted fine-tuning is often essential to achieve the precision and reliability required for specialized applications. The significant performance improvements we achieved—transforming a model that essentially labeled everything as melanoma into one that makes clinically useful distinctions—highlight the value of combining the right infrastructure, training methodology, and domain-specific data.
MedGemma’s strong statistical performance demonstrates that starting with a domain-focused foundation model significantly improves baseline capabilities and can reduce the data and computation needed for building effective medical AI applications. However, our example case also shows that even these specialized models would benefit from task-specific fine-tuning for optimal diagnostic accuracy in clinical contexts.
Next steps for your multimodal journey
By combining Google Cloud’s enterprise infrastructure with Axolotl’s configuration-driven approach, you can transform what previously required months of specialized development into weeks of standardized implementation, bringing custom multimodal AI capabilities from concept to production with greater efficiency and reliability.
For deeper exploration, check out these resources:
AWS WAF now supports matching incoming request against Autonomous System Numbers (ASNs). By monitoring and restricting traffic from specific ASNs, you can mitigate risks associated with malicious actors, comply with regulatory requirements, and optimize the performance and availability of your web applications. This new ASN Match Statement integrates seamlessly with existing WAF rules, making it easy for you to incorporate ASN based security controls into your overall web application defense strategy.
You can specify a list of ASNs to match against incoming request and take appropriate action such as block or allow the request. You can also use ASN in your rate-based rule statements. These rules aggregate requests according to your criteria, counts and rate limits the requests based on the rule’s evaluation window, request limit, and action settings.
ASN Match statement is available in all regions where AWS WAF is available. The rate-based rule support with ASN is available in regions where the enhanced rate-based rules are currently supported. There is no additional cost for using ASN in Match statement and rate-based rules, however standard AWS WAF charges still apply. For more information about the service, visit the AWS WAF page. For more information about pricing, visit the AWS WAF Pricing page
Today, AWS announces the general availability of the Invoice Summary API. This allows you to retrieve your AWS invoice summary details programmatically via SDK. You can retrieve multiple invoice summary details by making a single API call that accepts input parameters like AWS Account ID, AWS Invoice ID, billing period, or a date range as input.
The output of the Invoice Summary API will include data elements like Invoice Amount in base currency and tax currency, purchase order number and other meta data that can be found in this link. You can integrate the API to your accounts payable systems to automate invoice processing and improve efficiency.
Invoice Summary API is available in all AWS Regions, except the AWS GovCloud (US) Regions and the China Regions.
Today, Amazon Q Developer announces support for the agentic coding experience within the JetBrains and Visual Studio IDEs. This experience, already available in Visual Studio Code and the Amazon Q Developer CLI, redefines how you write, modify, and maintain code by leveraging natural language understanding to seamlessly run complex workflows.
Agentic coding provides intelligent task execution, enabling Q Developer to perform actions beyond code suggestions, such as reading files, generating code diffs, and running command-line tasks. To get started, simply type in your prompt in your preferred spoken language. As Q Developer works through your tasks, it provides continuous status updates, instantly applying your changes and feedback along the way. This allows you to seamlessly complete tasks, while improving and streamlining the development process.
The agentic coding experience is available in all AWS regions where Q Developer is supported. To learn more about agentic coding in Visual Studio and JetBrains, read our blog.
Amazon EC2 now enables you to automatically delete underlying Amazon EBS snapshots when deregistering Amazon Machine Images (AMIs), allowing you to better manage your storage costs and simplify your AMI cleanup workflow.
Previously, when deregistering an AMI, you had to separately delete its associated EBS snapshots, which required additional steps. This process could lead to abandoned snapshots, resulting in unnecessary storage costs and resource management overhead. Now you can automatically delete EBS snapshots at the time of AMI deregistration.
This capability is available to all customers at no additional costs, and is enabled in all AWS commercial regions including AWS GovCloud (US), AWS China (Beijing) Region, operated by Sinnet, and in the AWS China (Ningxia) Region, operated by NWCD.
You can deregister AMIs from the EC2 Console, CLI, API, or SDK, and learn more in the AMI documentation.
Amazon Relational Database Service (Amazon RDS) for MariaDB now supports new community MariaDB minor versions 10.11.13 and 11.4.7. We recommend that you upgrade to the latest minor versions to fix known security vulnerabilities in prior versions of MariaDB, and to benefit from the bug fixes, performance improvements, and new functionality added by the MariaDB community.
You can leverage automatic minor version upgrades to automatically upgrade your databases to more recent minor versions during scheduled maintenance windows. You can also leverage Amazon RDS Managed Blue/Green deployments for safer, simpler, and faster updates to your MariaDB instances. Learn more about upgrading your database instances, including automatic minor version upgrades and Blue/Green Deployments, in the Amazon RDS User Guide.
Amazon RDS for MariaDB makes it straightforward to set up, operate, and scale MariaDB deployments in the cloud. Learn more about pricing details and regional availability at Amazon RDS for MariaDB. Create or update a fully managed Amazon RDS database in the Amazon RDS Management Console.
Amazon Managed Service for Prometheus is now available in Africa (Cape Town), Asia Pacific (Thailand), Asia Pacific (Hong Kong), Asia Pacific (Malaysia), Europe (Milan), Europe (Zurich), and Middle East (UAE). Amazon Managed Service for Prometheus is a fully managed Prometheus-compatible monitoring service that makes it easy to monitor and alarm on operational metrics at scale.
The list of all supported regions where Amazon Managed Service for Prometheus is generally available can be found on the user guide. Customers can send up to 1 billion active metrics to a single workspace and can create many workspaces per account, where a workspace is a logical space dedicated to the storage and querying of Prometheus metrics.
To learn more about Amazon Managed Service for Prometheus collector, visit the user guide or product page.
Amazon CloudWatch Logs Insights launches Query Results Summarization and OpenSearch PPL enhancements to help accelerate your logs analysis.
The new logs summarizer generates a natural language summary of the query results, providing users with clear, actionable insights. Interpreting log entries can be time-consuming and this natural language summarization capability transforms complex query results into clear, concise summaries that help you quickly identify issues and gain actionable insights from your log data.
With CloudWatch Logs Insights, you can interactively search and analyze your logs with Logs Insights query language, OpenSearch Service Piped Processing Language (PPL), and OpenSearch Service Structured Query Language (SQL). Customers using OpenSearch PPL can now analyze their logs more efficiently with new PPL commands and functions such as JOIN, SubQuery, Fillnull, Expand, Flatten, Cidrmatch and JSON functions. These new capabilities help accelerate your troubleshooting. For example, you can use subqueries to find those services which have more than 20 errors in the last day using an inner query, and then use the results of the inner query to get the average response times of those services from a different log group.
The logs summarizer is available in the US East (N. Virginia) region. OpenSearch PPL query enhancements are available in regions where OpenSearch Service direct query is available.
To learn about the log summarizer in CloudWatch Logs Insights, visit the Amazon CloudWatch Logs documentation. To learn about the new PPL commands and functions, visit the CloudWatch Logs documentation.
AWS Wickr announces the launch of Wickr File Previews. This new feature empowers organizations to protect sensitive files and lower the risk of data loss, by allowing network administrators to configure a “view-only” mode in the Security Groups section of the AWS Management Console for Wickr. Users within these security groups will be restricted to only viewing the supported files, and will be unable to download them.
AWS Wickr is a security-first messaging and collaboration service with features designed to help keep your communications secure, private, and compliant. AWS Wickr protects one-to-one and group messaging, voice and video calling, file sharing, screen sharing, and location sharing with end-to-end encryption. Customers have full administrative control over data, which includes addressing information governance polices, configuring ephemeral messaging options, and deleting credentials for lost or stolen devices. You can log internal and external conversations in an AWS Wickr network to a private data store that you manage for data retention and auditing purposes.
AWS Wickr is available in commercial AWS Regions that include US East (N. Virginia), AWS Canada (Central), AWS Asia Pacific (Malaysia, Singapore, Sydney, and Tokyo), and AWS Europe (London, Frankfurt, Stockholm, and Zurich). It is also available in AWS GovCloud (US-West) as FedRAMP High and Department of Defense Impact Level 5 (DoD IL5)-authorized AWS WickrGov.
To learn more and get started, see the following resources:
Today, AWS HealthOmics announces automatic parameter interpolation for Workflow Description Language (WDL) workflows to help streamline the workflow creation process. This new capability automatically identifies and extracts required and optional parameters along with their descriptions directly from WDL workflow definitions, eliminating the need for customers to manually create input parameter templates. AWS HealthOmics is a HIPAA-eligible service that helps healthcare and life sciences customers accelerate scientific breakthroughs with fully managed biological data stores and workflows.
Removing the need to define parameters simplifies and accelerates building and deploying bioinformatics workflows. Customers can now onboard new WDL workflows more rapidly while retaining complete flexibility through optional customization. For organizations with extensive WDL workflow libraries, this feature significantly reduces the time required to migrate or deploy new workflows. Additionally, HealthOmics customers still maintain full control by providing custom input parameter templates when needed to override the automatic interpolation.
Input parameter interpolation for WDL workflows is now supported in all regions where AWS HealthOmics is available: US East (N. Virginia), US West (Oregon), Europe (Frankfurt, Ireland, London), Asia Pacific (Singapore), and Israel (Tel Aviv).
To learn more about automatic parameter interpolation and how to implement WDL workflows, see the AWS HealthOmics documentation.
We are excited to announce that Amazon OpenSearch Serverless is now available in Asia Pacific (Hyderabad) and Asia Pacific (Osaka) regions. OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it simple to run search and analytics workloads without the complexities of infrastructure management. OpenSearch Serverless’ compute capacity used for data ingestion, search, and query is measured in OpenSearch Compute Units (OCUs). To control costs, customers can configure the maximum number of OCUs per account.
Amazon Connect external voice transfer is now available in the Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), and Europe (London) AWS Regions.
Amazon Connect external voice transfer enables Amazon Connect to directly transfer voice calls and metadata to other voice systems without using the public telephone network. You can use Amazon Connect telephony and Interactive Voice Response (IVR) with your existing voice systems to help improve customer experience and reduce costs.
AWS announces two updates to Amazon EC2 instances accelerated by NVIDIA GPUs.
Availability of savings plans for Amazon EC2 P6-B200 instances which were available at launch only through EC2 Capacity Blocks for ML.
Reduced pricing for Amazon EC2 P5 and P5en instances and Amazon EC2 P4d and P4de instances: The following pricing reductions apply to On Demand pricing beginning June 1, 2025 and to Savings Plan purchases effective after June 4, 2025: P5 – up to 45% reduction, P5en – up to 26% reduction, and P4d and P4de – up to 33% reduction. The percentages apply to instances running Amazon Linux, slightly smaller reductions apply to instances running other operating systems. To provide increased accessibility to reduced pricing, we are also making at-scale, On-Demand capacity available for:
P4d in Asia Pacific (Seoul), Asia Pacific (Sydney), Canada (Central), Europe (London) Regions
P4de in US East (N. Virginia) Region
P5 in Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Jakarta), South America (São Paulo) Regions
P5en in Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Jakarta) Regions
The new pricing reflects AWS’s commitment to making advanced GPU computing more accessible while passing cost savings directly to customers. To learn more about the updated pricing please consult the EC2 pricing page.
Amazon Q Developer plugin for the Eclipse IDE is now generally available. With this launch, developers can leverage the power of Amazon Q Developer, the most capable generative AI-powered assistant for software development, within the Eclipse IDE.
Within the Eclipse IDE, you’ll now be able to utilize Amazon Q Developer’s agentic coding experience to seamlessly execute complex workflows. With this coding experience, Q Developer can intelligently take actions on your behalf. It can read your project files to intelligently build the context it needs, suggest code diffs, and run shell commands. As Q Developer works through your tasks, it provides continuous status updates, instantly applying your changes and feedback along the way. This helps Q Developer create code, generate unit tests, and perform code reviews significantly faster, streamlining development workflows across the entire software development lifecycle.
Amazon Managed Workflows for Apache Airflow (MWAA) is now available in AWS Region Asia Pacific (Malaysia).
Amazon MWAA is a managed service for Apache Airflow that lets you use the same familiar Apache Airflow platform as you do today to orchestrate your workflows and enjoy improved scalability, availability, and security without the operational burden of having to manage the underlying infrastructure. Learn more about using Amazon MWAA on the product page. Please visit the AWS region table for more information on AWS regions and services. To learn more about Amazon MWAA visit the Amazon MWAA documentation.
Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
Today, AWS announced the opening of a new AWS Direct Connect location within the Chief Telecom HD data center near Taipei, ROC. By connecting your network to AWS at the new location, you gain private, direct access to all public AWS Regions (except those in China), AWS GovCloud Regions, and AWS Local Zones. This site is the third AWS Direct Connect location within ROC. This Direct Connect location offers dedicated 10 Gbps and 100 Gbps connections with MACsec encryption available.
AWS also announced the addition of 10Gbps and 100Gbps MACsec services in the existing Chunghwa Telecom data center near Taipei, ROC.
The Direct Connect service enables you to establish a private, physical network connection between AWS and your data center, office, or colocation environment. These private connections can provide a more consistent network experience than those made over the public internet.
For more information on the Direct Connect locations worldwide, visit the locations section of the Direct Connect product detail pages. Or, visit our getting started page to learn more about how to purchase and deploy Direct Connect.
BigQuery provides a powerful platform for analyzing large-scale datasets with high performance. However, as data volumes and query complexity increase, maintaining operational efficiency is essential. BigQuery workload management provides comprehensive control mechanisms to optimize workloads and resource allocation, preventing performance issues and resource contention, especially in high-volume environments. And today, we’re excited to announce several updates to BigQuery workload management that make it more effective and easy to use.
But first, what exactly is BigQuery workload management?
At its core, BigQuery workload management is a suite of features that allows you to prioritize, isolate, and manage the execution of queries and other operations (aka workloads) within your BigQuery project. It provides granular control over how BigQuery resources are allocated and consumed, enabling you to:
Ensure critical workloads get the resources they need:
Reservationsfacilitate dedicated BigQuery slots, representing defined compute capacity.
Control and optimize cost with:
Slot commitments: Establish a predictable expenditure for BigQuery compute capacity in a specific Edition.
Spend-based commitments: Hourly spend-based commitment with 1yr and 3yr discount options for BigQuery compute working across Editions
Auto-scaling, which allows reservations to dynamically adjust their slot capacity in response to demand fluctuations, operating within predefined parameters. This lets you accommodate peak workloads while preventing over-provisioning during periods of reduced activity.
Enjoy reliability and availability:
Dedicated reservations and commitments provide predictable performance for critical workloads by reducing resource contention.
Help ensure business continuity through managed disaster recovery, providing compute and data availability resilience.
Implementing BigQuery workload management is crucial for organizations seeking to maximize the efficiency, reliability, and cost-effectiveness of their cloud-based data analytics infrastructure.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7cf26c1280>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Updates to BigQuery workload management
BigQuery workload management is focused on providing efficiency and control. The newest features and updates provide better resource allocation, and optimized performance. Key improvements include reservation fairness for optimal slot distribution, reservation predictability for consistent performance, runtime reservation specification for flexibility, reservation labels for enhanced visibility, and autoscaler improvements for rapid and granular scalability.
Reservation fairness Previously, using the fair-sharing method, BigQuery distributed capacity equally across projects. With reservation fairness, BigQuery prioritizes and allocates idle slots equally across all reservations within the same admin project, regardless of the number of projects running jobs in each reservation. Each reservation receives a similar share of available capacity in the idle slot pool, and then its slots are distributed fairly within its projects. Note: allocation assumes presence of demand. Idle slots are not allocated to reservations if no queries are running. This feature is only applicable to BigQuery Enterprise or Enterprise Plus editions, as Standard Edition does not support idle slots.
Figure 1: Project-based fairness
Configurations represent reservations with 0 baseline: The “Number” under the reservation is the total slots the projects in that reservation get through (Project) fair sharing. Note: Allocation assumes presence of demand. Idle slots are not allocated if no queries are running.
Figure 2: Reservation fairness enabled
Here, configurations represent reservations with 0 baseline: Under the reservation, you can see the total slots the projects in that reservation gets through (Reservation) fair-sharing. Note: Allocation assumes presence of demand. Idle slots are not allocated if no queries are running.
Reservation predictability This feature allows you to set the absolute maximum number of consumed slots on a reservation, enhancing control over cost and performance fluctuations in your slot consumption. BigQuery offers baseline slots, idle slots, and autoscaling slots as potential capacity resources. When you create a reservation with a maximum size, confirm the number of baseline slots and the appropriate configuration of autoscaling and idle slots based on your past workloads. Note: To use predictable reservations, you must enable reservation fairness. Baselines are optional.
Reservation – flexibility and securability BigQuery lets you specify which reservation a query should run on at runtime. Enhanced flexibility and securability features provide greater control over resource allocation and improved flexibility, including the ability to grant role-based access. You can specify a reservation at runtime using the CLI, UI, SQL, or API, overriding the default reservation assignment for your project, folder, or organization. The assigned reservation must be in the same region as the query you are running.
Reservation labels When you add labels to your reservations, they are included in your billing data. This adds granular visibility into BigQuery slot consumption for specific workloads or teams, making tracking and optimization easier. You can then use these labels to filter your Cloud Billing data by the Analysis Slots Attribution SKU, giving you a powerful tool to track and analyze your spending on BigQuery slots based on the specific labels you have assigned.
Autoscaler improvements Last but not least, the BigQuery autoscaler now delivers enhanced performance and adaptability for resource management. You can enjoy near-instant scale up, improved granularity (improved from 100 slot increments to 50 slot increments), and faster scale down. These features provide rapid capacity adjustments to meet workload demands, greater predictability and understanding of usage. This 50-slot increment also applies to setting Baseline and ReservationMax capacities.
BigQuery workload management is an essential tool for optimizing both your performance and costs. By using reservations, spend-based commitments, and new features such as reservation predictability and fairness, you can significantly improve your data analysis performance. This leads to better data-driven decision-making by optimizing resource allocation and cutting costs, allowing your team to gain more meaningful insights from their data and experience consistent performance.
Today, we are excited to announce that Gartner® has named Google as a Leader in the 2025 Magic Quadrant™ for Data Science and Machine Learning Platforms report (DSML). We believe that this recognition is a reflection of continued innovations to address the needs of data science and machine learning teams, as well as new types of practitioners working alongside data scientists in the dynamic space of generative AI.
Download the complimentary 2025 Gartner Magic Quadrant™ for Data Science and Machine Learning Platforms.
AI is driving a radical transformation in how organizations operate, compete, and innovate. Working closely with customers, we’re delivering the innovations for a unified data and AI platform to meet the demands of the AI era, including data engineering and analysis, data science, MLOps, gen AI application and agent development tools, and a central layer of governance.
Unified AI platform with best-in-class multimodal AI
Google Cloud offers a wide spectrum of AI capabilities, starting from the foundational hardware like Tensor Processing Units (TPUs) to AI agents, and the tools for building them. These capabilities are powered by our pioneering AI research and development, and our expertise in taking AI to production with large-scale applications such as YouTube, Maps, Search, Ads, Workspace, Photos, and more.
All of this research and experience fuels our Vertex AI platform, our unified AI platform for MLOps tooling, predictive and gen AI use cases, that sits at the heart of Google’s DSML offering. Vertex AI provides a comprehensive suite of tools covering the entire AI lifecycle, including data engineering and analysis tools, data science workbenches, MLOps capabilities for deploying and managing models, and specialized features for developing gen AI applications and agents. Moreover, our Self-Deploy capability enables our partners to not only build and host their models within Vertex AI for internal users, but also distribute and commercialize those models. Customer use of Vertex AI has grown 20x in the last year driven by Gemini, Imagen, and Veo models.
Vertex AI Model Garden offers a curated selection of over 200 enterprise-ready models from Google like Gemini, partners like Anthropic, and the open ecosystem. Model Garden helps customers access the highest performing foundation models suited for their business needs and easily customize them with their own data, deploy to applications with just one click, and scale with end-to-end MLOps built-in.
Building on Google DeepMind research, we recently announced Gemini 2.5, our most intelligent AI model yet. Gemini 2.5 models are now thinking models, capable of reasoning (and showing its reasoning) before responding, resulting in dramatically improved performance. Transparent step-by-step reasoning is crucial for enterprise trust and compliance. We also launched Gemini 2.5 Flash, our cost-effective, low-latency workhorse model. Gemini 2.5 Flash will be generally available for all Vertex AI users in early June, with 2.5 Pro generally available soon after.
Vertex AI is now the only platform with generative media models across all modalities — video, image, speech, and music. At Google I/O, we announced several innovations in this portfolio, including the availability of Veo 3, Lyria 2, and Imagen 4 on Vertex AI. Veo 3 combines video and audio generation, taking content generation to a new level. The state-of-the-art model features improved quality when generating videos from text and image prompts. In addition, Veo 3 also generates videos with speech (dialogue and voice-overs) and audio (music and sound effects). Lyria 2, Google’s latest music generation model, features high-fidelity music across a range of styles. And Imagen 4, Google’s highest-quality image generation model, delivers outstanding text rendering and prompt adherence, higher overall image quality across all styles, and multilingual prompt support to help creators globally. Imagen 4 also supports multiple model variants to help customers optimize around quality, speed and cost.
All of this innovation resides on Vertex AI, so that AI projects can reach production and deliver business value while teams collaborate to improve models throughout the development lifecycle.
For instance, customers like Radisson Hotel Group have redefined personalized marketing with Google Cloud. Partnering with Accenture, the global hotel chain leveraged BigQuery, Vertex AI, Google Ads, and Google’s multimodal Gemini models to build a generative AI agent to help create locally relevant ad content and translate it into more than 30 languages — reducing content creation time from weeks to hours. This AI-driven approach has increased team productivity by 50%, boosted return on ad spend by 35%, and driven a 22% increase in ad-driven revenue.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7cf026cf40>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
A new era of multi-agent management
Eventually, we believe that every enterprise will rely on multi-agent systems, including those built on different frameworks or providers. We recently announced multiple enhancements to Vertex AI so you can build agents with an open approach and deploy them with enterprise-grade controls. This includes anAgent Development Kit (ADK), available for Python and Java, with an open-source framework for designing agents built on the same framework that powers Google Agentspace and Google Customer Engagement Suite agents.Many powerful examples and extensible sample agents are readily available in Agent Garden. You can also take advantage of Agent Engine, a fully managed runtime in Vertex AI that helps you deploy your custom agents to production with built-in testing, release, and reliability at global scale.
Connecting all your data to AI
Enterprise agents need to be grounded in relevant data to be successful. Whether helping a customer learn more about a product catalog or helping an employee navigate company policies, agents are only as effective as the data they are connected to. At Google Cloud, we do this by making it easy to leverage any data source. Whether it’s structured data in a relational database or unstructured content like presentations and videos, Google Cloud tools let customers easily use their existing data architectures as retrieval-augmented generation (RAG) solutions. With this approach, developers get the benefits of Google’s decades of search experience from out-of-the-box offerings, or can build their own RAG system with best-in-class components.
For RAG on an enterprise corpus, Vertex AI Search is our out-of-the-box solution that delivers high quality at scale, with minimal development or maintenance overhead. Customers who prefer to fully customize their solution can use our suite of individual components including the Layout Parser to prepare unstructured data, Vertex embedding models to create multimodal embeddings, Vertex Vector Search to index and serve the embeddings at scale, and the Ranking API to optimize the results. And RAG Engine provides an easy way for developers to orchestrate these components, or mix and match with third-party and open-source tools. BigQuery customers can also use its built-in vector search capabilities for RAG, or leverage the new connector with Vertex Vector Search to get the best of both worlds, by combining the data in BigQuery with a purpose-built high performance vector search tool.
Unified data and AI governance
With built-in governance, customers can simplify how they discover, manage, monitor, govern, and use their data and AI assets. Dataplex Universal Catalog brings together a data catalog and a fully managed, serverless metastore, enabling interoperability across Vertex AI, BigQuery, and open-source formats such as Apache Spark and Apache Iceberg with a common metadata layer. Customers can also use a business glossary for a shared understanding of data and define company terms, creating a consistent foundation for AI.
At Google Cloud, we’re committed to helping organizations build and deploy AI and we are investing heavily in bringing new predictive and gen AI capabilities to Vertex AI. For more, download the full 2025 Gartner Magic Quadrant™ for Data Science and Machine Learning Platforms report.
Gartner Magic Quadrant for Data Science and Machine Learning Platforms – Afraz Jaffri, Maryam Hassanlou, Tong Zhang, Deepak Seth, Yogesh Bhatt, May 28, 2025
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Google.
GARTNER is a registered trademark and service mark of Gartner Inc., and/or its affiliates in the U.S and internationally, and MAGIC QUADRANT is a registered trademark of Gartner Inc., and/or its affiliates and are used herein with permission. All rights reserved.