2025 05 05

GCP – Announcing new Vertex AI Prediction Dedicated Endpoints

For AI developers building cutting-edge applications with large model sizes, a reliable foundation is non-negotiable. You need your AI to perform consistently, delivering results without hiccups, even under pressure. This means having dedicated resources that won’t get bogged down by other users’ activity. While existing Vertex AI Prediction Endpoints – managed pools of resources to deploy AI models for online inference – provide a capable serving solution, developers need better ways to reach consistent performance and resource isolation in case of shared resource contention.

Today, we are pleased to announce Vertex AI Prediction Dedicated Endpoints, a new family of Vertex AI Prediction endpoints, designed to address the needs of modern AI applications, including those related with large-scale generative AI models.

Dedicated endpoint architected for generative AI and large models

Serving generative AI and other large-scale models introduces unique challenges related to payload size, inference time, interactivity, and performance demands. The new Vertex AI Prediction Dedicated Endpoints have been specifically engineered to help you build more reliably with the following new integrated features:

Native support for streaming inference: Essential for interactive applications like chatbots or real-time content generation, Vertex AI Endpoints now provide native support for streaming, simplifying development and architecture, via the following APIs:

streamRawPredict: Utilize this dedicated API method for bidirectional streaming to send prompts and receive sequences of responses (e.g., tokens) as they become available.
OpenAI Chat Completion: To facilitate interoperability and ease migration, endpoints serving compatible models can optionally expose an interface conforming to the widely used OpenAI Chat Completion streaming API standard.

gRPC protocol support: For latency-sensitive applications or high-throughput scenarios often encountered with large models, endpoints now natively support gRPC. Leveraging HTTP/2 and Protocol Buffers, gRPC can offer performance advantages over standard REST/HTTP.

Customizable request timeouts: Large models can have significantly longer inference times. We now provide the flexibility, via API, to configure custom timeouts for prediction requests, accommodating a wider range of model processing durations beyond the default settings.

Optimized resource handling: The underlying infrastructure is designed to better handle the resource demands (CPU/GPU, memory, network bandwidth) of large models, contributing to the overall stability and performance, especially when paired with Private Endpoints.

The newly integrated capabilities of Vertex AI Prediction Dedicated Endpoints offer a unified and robust serving solution tailored for demanding modern AI workloads. From today, Vertex AI Model Garden will use Vertex AI Prediction Dedicated Endpoints as the standard serving method for self-deployed models.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7fac8ca580>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Optimized networking via Private Service Connect (PSC)

While Dedicated Endpoints Public remain available for models accessible over the public internet, we are enhancing networking options on Dedicated Endpoints utilizing Google Cloud Private Service Connect (PSC). The new Dedicated Endpoints Private (via PSC) provide a secure and performance-optimized path for prediction requests. By leveraging PSC, traffic routes entirely within Google Cloud’s network, offering significant benefits:

Enhanced security: Requests originate from within your Virtual Private Cloud (VPC) network, eliminating public internet exposure for the endpoint.
Improved performance consistency: Bypassing the public internet reduces latency variability.
Reduced performance interference: PSC facilitates better network traffic isolation, mitigating potential “noisy neighbor” effects and leading to more predictable performance, especially for demanding workloads.

For production workloads with strict security requirements and predictable latency, Private Endpoints using Private Service Connect are the recommended configuration.

How Sojern is using the new Vertex AI Prediction Dedicated Endpoints to serve models at scale

Sojern is a marketing company focusing on the hospitality industry, matching potential customers to travel businesses around the globe. As part of their growth plans, Sojern turned to Vertex AI. Leaving their self-managed ML stack behind, Sojern can focus more on innovation, while scaling out far beyond their historical footprint.

Given the nature of Sojern’s business, their ML deployments follow a unique deployment model, requiring several high throughput endpoints to be available and agile at all times, allowing for constant model evolution. Using Public Endpoints would cause rate limiting and ultimately degrade user experience; moving to a Shared VPC model would have required a major design change for existing consumers of the models.

With Private Service Connect (PSC) and Dedicated Endpoint, Sojern avoided hitting the quotas / limits enforced on Public Endpoints, while also avoiding a network redesign to accommodate Shared VPC.

The ability to quickly promote tested models, take advantage of Dedicated Endpoint’s enhanced featureset, and improve latency for their customers strongly aligned with Sojern’s goals. The Sojern team continues to onboard new models, always improving accuracy and customer satisfaction, powered by Private Service Connect and Dedicated Endpoint.

Get started

Are you struggling to scale your prediction workloads on Vertex AI? Check out the resources below to start using the new Vertex AI Prediction Dedicated Endpoints:

Documentation

Github samples

Your experience and feedback are important as we continue to evolve Vertex AI. We encourage you to explore these new endpoint capabilities and share your insights through Google Cloud community forum.

GCP – Announcing new Vertex AI Prediction Dedicated Endpoints

Dedicated endpoint architected for generative AI and large models

Optimized networking via Private Service Connect (PSC)

How Sojern is using the new Vertex AI Prediction Dedicated Endpoints to serve models at scale

Get started

Related Posts

GCP – Google is a Leader in the 2025 IDC MarketScape: FinOps Cloud Costs Optimization

GCP – Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners

AWS – Amazon OpenSearch Serverless now supports backup and restore