2025 01 24

GCP – Announcing smaller machine types for A3 High VMs

Today, an increasing number of organizations are using GPUs to run inference¹ on their AI/ML models. Since the number of GPUs needed to serve a single inference workload varies, organizations need more granularity in the number of GPUs in their virtual machines (VMs) to keep costs low while scaling with user demand.

You can use A3 High VMs powered by NVIDIA H100 80GB GPUs in multiple generally available machine types of 1^NEW, 2^NEW, 4^NEW, and 8 GPUs.

Accessing smaller H100 machine types

All A3 machine types are available through the fully managed Vertex AI, as nodes through Google Kubernetes Engine (GKE), and as VMs through Google Compute Engine.

The 1, 2, and 4 A3 High GPU machine types are available as Spot VMs and through Dynamic Workload Scheduler (DWS) Flex Start mode.

A3 VMs portfolio powered by NVIDIA H100 GPUs
Machine Type(GPUs count, GPU memory)	Vertex AI	Google Kubernetes Engine Google Compute Engine
a3-highgpu-1g ^NEW(1 GPUs, 80 GB)	Vertex AI Model Garden and Online Prediction (Spot)	Spot DWS Flex Start mode
a3-highgpu-2g ^NEW(2 GPUs, 160 GB)	Vertex AI Model Garden and Online Prediction (On-demand^a, Spot)
a3-highgpu-4g ^NEW(4 GPUs, 320 GB)
a3-highgpu-8g (8 GPUs, 640 GB)	Vertex AI Online Prediction (On-Demand, Spot) Vertex AI Training (On-demand, Spot, DWS Flex Start mode )	On-demand Spot DWS Flex Start mode DWS Calendar mode
a3-megagpu-8g (8 GPUs, 640 GB)

^aAvailable only through Model Garden owned capacity.

Google Kubernetes Engine

For almost a decade, GKE has been the platform-of-choice for running web applications and microservices, and now it provides a cost efficient, highly scalable, and open platform for training and serving AI workloads. GKE Autopilot reduces operational cost and offers workload-level SLAs, and is a fantastic choice for inference workloads — bring your workload and let Google do the rest. You can use the 1, 2, and 4 A3 High GPU machine types through both GKE Standard and GKE Autopilot modes of operation.

Below are two examples of creating node pools in your GKE cluster with a3-highgpu-1g machine type using Spot VMs and Dynamic Workload Scheduler Flex Start mode.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e447b6804c0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>

Using Spot VMs with GKE

Here’s how to request and deploy a3-highgpu-1g Spot VM on GKE using the gcloud API.

code_block: <ListValue: [StructValue([(‘code’, ‘gcloud container node-pools create NODEPOOL_NAME \rn –cluster CLUSTER_NAME \rn –region CLUSTER_REGION \rn –node-locations GPU_ZONE1,GPU_ZONE2 \rn –machine-type a3-highgpu-1g \rn –accelerator type=nvidia-h100-80gb,count=1,gpu-driver-version=latest \rn –image-type COS_CONTAINERD \rn –spot’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e447b680550>)])]>

Using Dynamic Workload Scheduler Flex Start mode with GKE

Here’s how to request a3-highgpu-1g using Dynamic Workload Scheduler Flex Start mode with GKE.

code_block: <ListValue: [StructValue([(‘code’, ‘gcloud beta container node-pools create NODEPOOL_NAME \rn –cluster CLUSTER_NAME \rn –region CLUSTER_REGION \rn –node-locations GPU_ZONE1,GPU_ZONE2 \rn –enable-queued-provisioning \rn –machine-type=a3-highgpu-1g \rn –accelerator type=nvidia-h100-80gb,count=1,gpu-driver-version=latest \rn –enable-autoscaling \rn –num-nodes=0 \rn –total-max-nodes TOTAL_MAX_NODES \rn –location-policy=ANY \rn –reservation-affinity=none \rn –no-enable-autorepair’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e447b680970>)])]>

This creates a GKE node pool with Dynamic Workload Scheduler enabled and that contains zero nodes. You can then run your workloads with Dynamic Workload Scheduler.

Vertex AI

Vertex AI is Google Cloud’s fully managed, unified AI development platform for building and using predictive and generative AI. With the new 1, 2, and 4 A3 High GPU machine types, Model Garden customers can deploy hundreds of open models cost-effectively and with strong performance.

What our customers are saying

“We use Google Kubernetes Engine to run the backend for our AI-assisted software development product. Smaller A3 machine types have enabled us to reduce the latency of our real-time code assist models by 36% compared to A2 machine types, significantly improving user experience.” – Eran Dvey Aharon, VP R&D, Tabnine

Get started today

At Google Cloud, our goal is to provide you with the flexibility you need to run inference for your AI and ML models cost-effectively as well as with great performance. The availability of A3 High VMs using NVIDIA H100 80GB GPUs in smaller machine types provides you with the granularity you need to scale with user demand while keeping costs in check.

^{1. AI or ML inference is the process by which a trained AI model uses its training data to calculate output data or make predictions about new data points or scenarios.}

GCP – Announcing smaller machine types for A3 High VMs

Accessing smaller H100 machine types

Google Kubernetes Engine

Using Spot VMs with GKE

Using Dynamic Workload Scheduler Flex Start mode with GKE

Vertex AI

What our customers are saying

Get started today

Related Posts

AWS – AWS Parallel Computing Service (PCS) now supports Slurm v25.05

AWS – CloudWatch Database Insights now supports tag based access control

GCP – New BigQuery Studio experience: Boost your data analysis productivity