2025 07 28

GCP – The global endpoint offers improved availability for Anthropic’s Claude on Vertex AI

Anthropic’s Claude models on Vertex AI now have improved overall availability with the global endpoint for Claude models. Now generally available, the global endpoint unlocks the ability to dynamically route your requests to any region with available capacity supported by the Claude model you’re using. This helps you deploy Claude-powered applications and agents with more uptime and dependability.

During the preview period, customers like Replicate experienced firsthand the benefits of the global endpoint. Zeke Sikelianos, founding designer at Replicate noted, “people use Replicate because they want to deploy AI models at scale. Claude on Vertex AI fits perfectly with that — we get one of the best language models available, with Google’s solid infrastructure and the global endpoint that delivers fast responses worldwide. It just works.”

The global endpoint is launching with support for pay-as-you-go traffic for the following Claude models:

Claude Opus 4
Claude Sonnet 4
Claude Sonnet 3.7
Claude Sonnet 3.5 v2

What are global endpoints and when should you use them?

When you send a request to Anthropic’s Claude models on Vertex AI, you typically specify a region (e.g., us-central1). This is a regional endpoint, which keeps your data and processing within that geographical boundary—ideal for applications with strict data residency requirements.

The global endpoint, by contrast, does not tie your request to a single region. Instead, it directs traffic to a global entry point that dynamically routes your request to a region with available capacity. This multi-region approach is designed to maximize availability and reduce errors that can arise from high traffic in a given region.

So, when is the global endpoint the right choice?

If your application requires the highest possible availability and your data is not subject to residency restrictions, the global endpoint is an excellent fit.
If your services are facing regional capacity limits or if you are architecting for maximum resilience against regional disruptions.

However, if you have data residency requirements (specifically for ML processing), you should continue to use regional endpoints, as the global endpoint does not guarantee that requests will be processed in any specific location. Here’s a simple breakdown of global versus regional endpoints:

Global versus regional endpoint

	Global endpoint	Regional endpoint
Availability	Maximized by leveraging multi-region resources	Dependent on single-region capacity and quota
Latency	May be higher in some cases due to dynamic global routing	Optimized for low latency within the specified region
Quota	Uses a separate, independent global quota	Uses the quota assigned to the specific region
Use case	High-availability applications without data residency needs.	Applications with strict data residency requirements
Traffic type	Pay-as-you-go	Pay-as-you-go & Provisioned Throughput (PT)

By giving you the choice between global and regional endpoints, Vertex AI empowers you to build more sophisticated, resilient, and scalable generative AI applications and agents that meet your specific architectural and business needs.

Prompt caching and pay-as-you-go pricing

As part of this launch, prompt caching is fully supported with global endpoints. When a prompt is cached, subsequent identical requests will be routed to the region holding the cache for the lowest latency. If that region is at capacity, the system will automatically try the next available region to serve the request. This integration ensures that users of global endpoints still receive the benefits of prompt caching (lower latency and lower costs).

Note that at this point, the global endpoint for Claude models only supports pay-as-you-go traffic. Provision throughput is available on regional-endpoints only.

Global endpoint requests are charged the same price as regional endpoint requests.

Best practices

To get the most out of this new feature, we recommend routing your primary traffic to the global endpoint. Use regional endpoints as a secondary option, specifically for workloads that must adhere to data residency rules. To ensure the best performance and avoid unnecessary cost, please do not submit the same request to both a global and a regional endpoint simultaneously.

A new, separate global quota is available for this feature. You can view and manage this quota on the “Quotas & Systems Limits” page in your Google Cloud console and request an increase if needed. The pricing for requests made to the Global Endpoint remains the same as for regional endpoints.

How to get started

To get started with the global endpoint for Anthropic’s Claude models on Vertex AI, There are only two steps:

Step 1: Select and enable a global endpoint supported Claude model on Vertex AI (Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 v2).

Step 2: In the configuration, set “GLOBAL” as the location variable value, and use global endpoint cURL:

code_block: <ListValue: [StructValue([(‘code’, ‘https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/PUBLISHER_NAME/models/MODEL_NAME’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0fd978b520>)])]>

Please visit our documentation for detailed instructions and start building directly in the Vertex AI console.

GCP – The global endpoint offers improved availability for Anthropic’s Claude on Vertex AI

What are global endpoints and when should you use them?

Global versus regional endpoint

Prompt caching and pay-as-you-go pricing

Best practices

How to get started

Related Posts

AWS – SageMaker HyperPod now supports Managed tiered KV cache and intelligent routing

AWS – Amazon SageMaker HyperPod now supports custom Kubernetes labels and taints

AWS – Amazon Kinesis Video Streams now supports a new cost effective warm storage tier