We are excited to announce that AWS Deadline Cloud is now available in Asia Pacific (Seoul) and Europe (London). Deadline Cloud is a fully managed service that simplifies render management for teams creating computer-generated graphics and visual effects for films, television, broadcasting, web content, and design. Customers can now use Deadline Cloud to scale their render farms in regions that are close to their creative teams, enabling better integration with existing AWS services and creative pipelines.
Deadline Cloud is now available in 10 AWS regions worldwide: US East (N. Virginia and Ohio), US West (Oregon), Asia Pacific (Seoul, Singapore, Sydney and Tokyo), and Europe (Frankfurt, Ireland, and London). For more information about AWS Regions and where Deadline Cloud is available, see the AWS Region table. To learn more about AWS Deadline Cloud and its regional availability, visit the AWS Deadline Cloud product page or refer to the AWS Regional Services List.
Today, AWS announces enhanced scanning capabilities for GuardDuty Malware Protection for Amazon S3. This launch increases scanning capabilities by raising the maximum file size limit from 5GB to 100 GB. Additionally, the archive processing capacity has been expanded to handle up to 10,000 files per archive, up from the previous limit of 1,000 files.
GuardDuty Malware Protection for S3 is a fully managed threat detection service that automatically scans objects uploaded to S3 buckets and alerts customers of malware, viruses, and other malicious code before they can impact workloads or downstream processes. With this launch, GuardDuty S3 malware scanning now offers customers even better protection for large files and comprehensive archive collections stored in Amazon S3.
The enhanced scanning capabilities are automatically enabled in all AWS Regions where GuardDuty Malware Protection for S3 is supported. To learn more about GuardDuty Malware Protection for S3 and its features, please visit the AWS Documentation.
Amazon Relational Database Service (RDS) Proxy now supports end-to-end IAM authentication for connections to Amazon Aurora and RDS database instances. This feature allows you to connect from your applications to your databases through RDS Proxy using AWS Identity and Access Management (IAM) authentication. End-to-end IAM authentication simplifies credential management, reduces credential rotation overhead, and enables you to leverage IAM’s robust authentication and authorization capabilities throughout your database connection path.
With end-to-end IAM authentication, you can now connect to your databases through RDS Proxy without needing to register or store credentials in Secrets Manager. End-to-end IAM authentication is available for MySQL and PostgreSQL database engines in all AWS Regions where RDS Proxy is supported.
Many applications, including those built on modern serverless architectures, may need to have a high number of open connections to the database or may frequently open and close database connections, exhausting the database memory and compute resources. Amazon RDS Proxy allows applications to pool and share database connections, improving your database efficiency and application scalability. RDS Proxy helps improve application scalability, resiliency, and security.
For information on supported database engine versions and regional availability of RDS Proxy, refer to our RDS and Aurora documentations.
Generative AI is no longer just an experiment. The real challenge now is quantifying its value. For leaders, the path is clear: make AI projects drive business growth, not just incur costs. Today, we’ll share a simple three-part plan to help you measure the effect and see the true worth of your AI initiatives.
This methodology connects your technology solution to a concrete business outcome. It creates a logical narrative that justifies investment and measures success.
1. Define what success looks like (the value)
The first step is to define the project’s desired outcome by identifying its “value drivers.” For any AI initiative, these drivers typically fall into four universal business categories:
Operational efficiency & cost savings: This involves quantifying improvements to core business processes. Value is measured by reducing manual effort, optimizing resource allocation, lowering error rates in production or operations, or streamlining complex supply chains.
Revenue & growth acceleration: While many organizations initially focus on efficiency, true market leadership is achieved through growth. This category of value drivers is the critical differentiator, as it focuses on top-line impact. Value can come from accelerating time-to-market for new products, identifying new revenue streams through data analysis, or improving sales effectiveness and customer lifetime value.
Experience & engagement: This captures the enhancement of human interaction with technology. It applies broadly to improving customer satisfaction (CX), boosting employee productivity and morale with intelligent tools (EX), or creating more seamless partner experiences.
Strategic advancement & risk mitigation: This covers long-term competitive advantages and downside protection. Value drivers include accelerating R&D cycles, gaining market-differentiating insights from proprietary data, strengthening operational resiliency, or ensuring regulatory compliance and reducing fraud.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e52975fffa0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
2. Specify what it costs to succeed (your investment)
The second part of the framework demands transparency regarding the investment. This requires a complete view of the Total Cost of Ownership (TCO), which extends beyond service fees to include model training, infrastructure, and the operational support needed to maintain the system. For a detailed guide, we encourage a review of our post, How to calculate your AI costs on Google Cloud.
3. State the ROI
This is the synthesis of the first two steps. The ROI calculation makes the business case explicit by stating the time required to pay back the initial investment and the ongoing financial return the project will generate.
The framework in action: An AI chatbot for customer service
Now, let’s apply the universal framework to a specific use case. Consider an e-commerce company implementing an AI chatbot. Here, the four general value drivers become tailored to the world of customer service.
Step 1: Define success (the value) The team uses the customer-service-specific quadrants to build a comprehensive value estimate.
Quadrant 1: Operational efficiency
Reduced agent handling time: By automating 60% of routine inquiries, the company frees up thousands of agent hours. This enables agents to serve more customers or perhaps provide better quality service to premium customers.
Estimated hours saved: ~725 hrs (lets say this equate to $15,660 in value)
Lower onboarding & training costs: New agents become productive faster as the AI handles the most common questions, reducing the burden of repetitive training.
Estimated monthly value: $1,000
Quadrant 2: Revenue growth
24/7 Sales & support: The chatbot assists customers and captures sales leads around the clock, converting shoppers who would otherwise leave.
Estimated mMonthly vValue: $5,000
Improved customer retention: Faster resolution and a better experience lead to a small, measurable increase in customer loyalty and repeat purchases.
Estimated monthly value: $1,000
Quadrant 3: Customer and employee experience
Enhanced agent experience & retention: Human agents are freed from monotonous tasks to focus on complex, rewarding problems. This improves morale and reduces costly agent turnover.
Estimated monthly value: $500
Quadrant 4: Strategic enablement
Expanding business to more languages: Enabling human agents to provide support in 15+ additional languages, thanks to the translation service built into the system.
Estimated revenue increase: $1,750
Total estimated monthly value = $15,660 + $1,000 + $5,000 + $1,000 + $500 + $1,750 = $24,910
Step 2: Define the cost (the investment) Following a TCO analysis from our earlier blog post, we calculated the total ongoing monthly cost for the fully managed AI solution on Google Cloud would be approximately $2,700.
Step 3: State the ROI The final story was simple and powerful. With a monthly value of around $25,000 and a cost of only $2,700, the project generated significant positive cash flow. The initial setup cost was paid back in less than two weeks, securing an instant “yes” from leadership.
For large enterprises adopting a cloud platform, managing network connectivity across VPCs, on-premises data centers, and other clouds is critical. However, traditional models often lack scalability and increase management overhead. Google Cloud’s Network Connectivity Center is a compelling alternative.
As a centralized hub-and-spoke service for connecting and managing network resources, Network Connectivity Center offers a scalable and resilient network foundation. In this post, we explore Network Connectivity Center’s architecture, availability model, and design principles, highlighting its value and design considerations for maximizing resilience and minimizing the “blast radius” of issues. Armed with this information, you’ll be better able to evaluate how Network Connectivity Center fits within your organization, and to get started.
The challenges of large-scale enterprise networks
Large-scale VPC networks consistently face three core challenges: scalability, complexity, and the need for centralized management. Network Connectivity Center is engineered specifically to address these pain points head-on, thanks to:
Massively scalable connectivity: Scale far beyond traditional limits and VPC Peering quotas. Network Connectivity Center supports up to 250 VPC spokes per hub and millions of VMs, while enhanced cross-cloud connectivity upcoming features like firewall insertion will help ensure your network is prepared for future demands.
Smooth workload mobility and service networking: Easily migrate workloads between VPCs. Network Connectivity Center natively solves transitivity challenges through features like producer VPC spoke integration to support private service access (PSA) and Private Service Connect (PSC) propagation, streamlining service sharing across your organization.
Reduced operational overhead: Network Connectivity Center offers a single control point for VPC and on-premises connections, automating full-mesh connectivity between spokes to dramatically reduce operational burdens.
Under the hood: Architected for resilience
Let’s home in on how Network Connectivity Center stays resilient. A key part of that is its architecture, which is built on three distinct, decoupled planes.
A very simplified view of the Network Connectivity Center & Google Cloud networking stack
Management plane: This is your interaction layer — the APIs, gcloud commands, and Google Cloud console actions you use to configure your network. It’s where you create hubs, attach spokes, and manage settings.
Control plane: This is the brains of the operation. It takes your configuration from the management plane and programs the underlying network. It’s a distributed, sharded system responsible for the software-defined networking (SDN) that makes everything work.
Data plane: This is where your actual traffic flows. It’s the collection of network hardware and individual hosts that move packets based on the instructions programmed by the control plane.
A core principle that Network Connectivity Center uses across this architecture is fail-static behavior. This means that if a higher-level plane (like the management or control plane) experiences an issue, the planes below it continue to operate based on the last known good configuration, and existing traffic flows are preserved. This helps ensure that, say, a control plane issue doesn’t bring down your entire network.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 to try Google Cloud networking’), (‘body’, <wagtail.rich_text.RichText object at 0x3e52981be580>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
How Network Connectivity Center handles failures
A network’s strength is revealed by how it behaves under pressure. Network Connectivity Center’s design is fundamentally geared towards stability, so that potential issues are contained and their impact is minimized. Consider the following Network Connectivity Center design points:
Contained infrastructure impact: An underlying infrastructure issue such as a regional outage only affects resources within that specific scope. Because Network Connectivity Center hubs are global resources, a single regional failure won’t bring down your entire network hub. Connectivity between all other unaffected spokes remains intact.
Isolated configuration faults: We intentionally limit the “blast radius” of a configuration error with careful fault isolation. A mistake made on one spoke or hub is isolated and will not cascade to cause failures in other parts of your network. This fault isolation is a crucial advantage over intricate VPC peering topologies, where a single routing misconfiguration can have far-reaching consequences.
Uninterrupted data flows: The fail-static principle ensures that existing data flows are highly insulated from management or control plane disruptions. In the event of a failure, the network continues to forward traffic based on the last successfully programmed state, maintaining stability and continuity for your applications.
Managing the blast radius of configuration changes
Even if an infrastructure outage affects resources in its scope, Network Connectivity Center connectivity in other zones or regions remains functional. Critically, Network Connectivity Center configuration errors are isolated to the specific hub or spoke being changed and don’t cascade to unrelated parts of the network — a key advantage over complex VPC peering approaches.
To further enhance stability and operational efficiency, we also streamlined configuration management in Network Connectivity Center. Updates are handled dynamically by the underlying SDN, eliminating the need for traditional maintenance windows for configuration changes. Changes are applied transparently at the API level and are designed to be backward-compatible, for smooth and non-disruptive network evolution.
Connecting multiple regional hubs
Network Connectivity Center hub is a global resource. A multi-region resilient design may involve regional deployments with a dedicated hub per region. This requires connectivity across multiple hubs. Though Network Connectivity Center does not offer native hub-to-hub connectivity, alternative methods allow communication across Network Connectivity Center hubs, fulfilling specific controlled-access needs:
Cloud VPN or Cloud Interconnect: Use dedicated HA VPN tunnels or VLAN attachments to connect Network Connectivity Center hubs.
Private Service Connect (PSC): Leverage a producer/consumer model with PSC to provide controlled, service-specific access across Network Connectivity Center hubs.
Multi-NIC VMs: Route traffic between Network Connectivity Center hubs using VMs with network interfaces in spokes of different hubs.
Full-mesh VPC Peering: For specific use cases like database synchronization, establish peering between spokes of different Network Connectivity Center hubs.
Frequently asked questions
What happens to traffic if the Network Connectivity Center control plane fails? Due to the fail-static design, existing data flows continue to function based on the last known successful configuration. Dynamic routing updates will stop, but existing routes remain active.
Does adding a new VPC spoke impact existing connections? No. When a new spoke is added, the process is dynamic and existing data flows should not be interrupted.
Is there a performance penalty for traffic traversing between VPCs via Network Connectivity Center? No. Traffic between VPCs connected by Network Connectivity Center will experience the same performance compared to VPC peering.
Best practices for resilience
While Network Connectivity Center is a powerful and resilient platform, designing a network for maximum availability requires careful planning on your part. Consider the following best practices:
Leverage redundancy: Data plane availability is localized. To survive a localized infrastructure failure, be sure to deploy critical applications across multiple zones and regions.
Plan your topology carefully: Choosing your hub topology is a critical design decision. A single global hub offers operational simplicity and is the preferred approach for most use cases. Consider multiple regional hubs only if strict regional isolation or minimizing control plane blast radius is a primary requirement, and be aware of the added complexity. Finally, even though they are regional, Network Connectivity Center hubs are still “global resources” — that means in the event of global outages, the management plane operations may be impacted independent of regional availability.
Choose Network Connectivity Center for transitive connectivity: For large-scale networks that require transitive connectivity for shared services, choosing Network Connectivity Center over traditional VPC peering can simplify operations and allow you to leverage features like PSC/PSA propagation.
Embrace infrastructure-as-code: Use tools like Terraform to manage your Network Connectivity Center configuration, which reduces the risk of manual errors and makes your network deployments repeatable and reliable.
Plan for scale: Be aware of Network Connectivity Center’s high, but finite, scale limits (e.g., 250 VPC spokes per hub) and plan your network growth accordingly.
A simple approach to scalable, resilient networking
Network Connectivity Center removes much of the complexity from enterprise networking, providing a simple, scalable and resilient foundation for your organization. By understanding its layered architecture, fail-static behavior, and design principles, you can build a network that not only meets your needs today but is ready for the challenges of tomorrow.
Security leaders are clear about their priorities: After AI, cloud security is the top training topic for decision-makers. As threats against cloud workloads become more sophisticated, organizations are looking for highly-skilled professionals to help defend against these attacks.
To help organizations meet their need for experts who can manage a modern security team’s advanced tools, Google Cloud’s new Professional Security Operations Engineer (PSOE) certification can help train specialists to detect and respond to new and emerging threats.
Unlock your potential as a security operations expert
Earning a Google Cloud certification can be a powerful catalyst for career advancement. Eight in 10 learners report that having a Google Cloud certification contributes to faster career advancement, and 85% say that cloud certifications equip them with the skills to fill in-demand roles, according to an Ipsos study published in 2025 and commissioned by Google Cloud.
Foresite, a leading Google Cloud managed security service provider (MSSP), said that the certification has been instrumental in helping them provide security excellence to their clients.
“As a leader at Foresite, our commitment is to deliver unparalleled security outcomes for our clients using the power of Google Cloud. The Google Cloud Professional Security Operations Engineer (PSOE) certification is fundamental to that mission. For us, it’s the definitive validation that our engineers have mastered the advanced Google Security Operations platform we use to protect our clients’ businesses. Having a team of PSOE-certified experts provides our clients with direct assurance of our capabilities and expertise. It solidifies our credibility as a premier Google Cloud MSSP and gives us a decisive edge in the market. Ultimately, it’s a benchmark of the excellence we deliver daily,” said Brad Thomas, director, Security Engineering.
The PSOE certification can help validate practical skills needed to protect a company’s data and infrastructure in real-world scenarios, a key ingredient for professional success. It also can help security operations engineers demonstrate their ability to directly address evolving and daily challenges.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e5299c5ef10>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
Gain a decisive edge with certified security talent
For organizations, including MSSPs and other Google Cloud partners, this certification is a powerful way to help ensure that your security professionals are qualified to effectively implement, respond to, and remediate security events using Google Cloud’s suite of solutions.
Hiring managers are increasingly looking for a specific skill set. The Ipsos study also found that eight in 10 leaders prefer to recruit and hire professionals who hold cloud certifications, seeing them as a strong indicator of expertise.
“We are excited about Google’s new Professional Security Operations Engineer certification, which will help Accenture demonstrate our leading expertise in security engineering and operations to clients. This validation is important because it gives our clients confidence in knowing Accenture has certified professionals with structured training as they choose the best service partner for their security transformations. For our teams, this new certification offers a clear path for professional development and career advancement. Google’s Professional Security Operations Engineer certification will enable Accenture to support clients better as they successfully adopt and get the most out of the Google Security Operations and Security Command Center platforms,” said Rex Thexton, chief technology officer, Accenture Cybersecurity.
Demonstrate comprehensive expertise across Google Cloud Security tools
A Google Cloud-certified PSOE can effectively use Google Cloud security solutions to detect, monitor, investigate, and respond to security threats across an enterprise environment. This role encompasses identity, workloads, services, infrastructure, and more.
Plus, PSOEs can perform critical tasks such as writing detection rules, remediating misconfigurations, investigating threats, and developing orchestration workflows. The PSOE certification validates the candidate’s abilities with Google Cloud security tools and services, including:
Google Security Operations
Google Threat Intelligence
Security Command Center
Specifically, the exam assesses ability across six key domains:
Platform operations (~14%): Enhancing detection and response with the right telemetry sources and tools, and configuring access authorization.
Data management (~14%): Ingesting logs for security tooling and identifying a baseline of user, asset, and entity context.
Threat hunting (~19%): Performing threat hunting across environments and using threat intelligence for threat hunting.
Detection engineering (~22%): Developing and implementing mechanisms to detect risks and identify threats, and using threat intelligence for detection.
Incident response (~21%): Containing and investigating security incidents; building, implementing, and using response playbooks; and implementing the case management lifecycle.
Observability (~10%): Building and maintaining dashboards and reports to provide insights, and configuring health monitoring and alerting.
While there are no formal prerequisites to take the exam, we recommend that candidates have:
At least three years of security industry experience.
At least one year of hands-on experience using Google Cloud security tooling.
The certification is relevant for experienced professionals, including those in advanced career stages and roles, such as security architects.
Your path to security operations starts here
To prepare for the exam, Google Cloud offers resources that include online training and hands-on labs. The official Professional Security Operations Engineer Exam Guide provides a complete list of topics covered, helping candidates align their skills with the exam content. Candidates can also start preparing through the recommended learning path.
Amazon Elastic Container Service (Amazon ECS), a fully managed container orchestration service, now makes it easier to create and update task definitions in the AWS Management Console with generative AI assistance from Amazon Q Developer.
This new capability helps customers complete their task definitions faster and more efficiently using AI-generated code suggestions. Customers can use the inline chat capability to ask Amazon Q Developer to generate, explain, or refactor task definition JSON with a conversational interface. You can inject generated suggestions at any point in the task definition and accept or reject the changes proposed. Amazon ECS has also enhanced the existing inline suggestions feature to utilize Amazon Q Developer. Now in addition to the existing property-based inline suggestions, the Amazon Q Developer suggestions can autocomplete whole blocks of sample code.
AWS announces VPC endpoints for Amazon CloudWatch Observability Access Manager (OAM). CloudWatch OAM enables you to programmatically manage cross-account observability settings within a region. The new VPC endpoints enhance your security posture by keeping traffic between your VPC and CloudWatch OAM within the AWS network, eliminating the need to traverse the public internet.
You can use Observability Access Manager to create and manage links between source accounts and monitoring accounts, enabling you to monitor and troubleshoot applications that span multiple accounts within a Region. With the new VPC endpoints, you can establish secure, private, and reliable connections between your VPC and CloudWatch Observability Access Manager. This allows you to maintain private connectivity while managing cross-account observability links and sinks, even from VPCs without internet access. This feature supports both IPv4 and IPv6 addressing, and you can use AWS PrivateLink’s built-in security controls—like security groups and VPC endpoint policies—to help secure access to your observability resources.
CloudWatch Observability Access Manager VPC endpoints are now available in all commercial AWS regions, the AWS GovCloud (US) Regions, and the China Regions.
Amazon EventBridge is expanding the availability of its API destinations feature to the AWS Asia Pacific (Melbourne) and AWS Asia Pacific (Thailand) Regions.
EventBridge API destinations are HTTPS endpoints that you can invoke as the target of an event bus rule, similar to how you invoke an AWS service or resource as a target. API destinations provides flexible authentication options for HTTPS endpoints, such as API key and OAuth, storing and managing credentials securely in AWS Secrets Manager on your behalf.
To get started, visit the Amazon EventBridge documentation to learn more about configuring API destinations.
AWS launches LocalStack integration in Visual Studio Code (VS Code), enabling developers to easily test and debug serverless applications in their local IDE. With this new integration, developers can use LocalStack to locally emulate and test their serverless applications using familiar VS Code interface without switching between tools or managing complex setup, thus simplifying their local serverless development process.
LocalStack, an AWS Partner Network (APN) partner, enables developers to emulate AWS services such as AWS Lambda, Amazon SQS, Amazon API Gateway, and DynamoDB for local application development and testing. Previously, to use LocalStack to emulate AWS services in VS Code, developers had to manually configure ports, make code changes, and switch context between the IDE and LocalStack interface. Now, with LocalStack integration in VS Code, developers can connect to LocalStack environment from their IDE without manual configuration or code changes. This gives developers access to emulated AWS resources in the IDE, making it easy to build and test serverless applications locally. For example, they can now easily test and debug Lambda functions and their interactions with AWS services in a LocalStack emulated environment from their IDE.
This integration is now available to developers using the AWS Toolkit for VS Code (v3.74.0 or later). There is no additional cost from AWS for using this integration. To get started, follow the guided AWS Walkthrough in VS Code, which automatically installs the LocalStack CLI, guides through LocalStack account setup, and creates a LocalStack profile. Then, switch to LocalStack profile and deploy applications directly to the LocalStack environment. To learn more, visit the AWS News Blog, AWS Toolkit documentation, and the Lambda Developer Guide.
Amazon Managed Service for Prometheus collector, a fully-managed agentless collector for Prometheus metrics, adds support for vending logs to Amazon CloudWatch Logs.
With Amazon Managed Service for Prometheus collector logs you can now troubleshoot issues in your setup, from context on the Prometheus target discovery process including authentication issues to scraping process to status and errors such as timeouts to information on ingesting collected metrics to your Amazon Managed Service for Prometheus workspace, for example, remote-write failures due to workspace issues.
Amazon Managed Service for Prometheus collector logs are now generally available in all regions where Amazon Managed Service for Prometheus is available.
Please visit the Amazon CloudWatch pricing page to learn more about logs pricing. Get started with Managed Service for Prometheus collector logs by visiting our user guide.
Amazon Athena announces single sign-on support for its JDBC and ODBC drivers through AWS IAM Identity Center’s trusted identity propagation. This makes it simpler for organizations to manage end-user’s access to data when using 3rd party tools and implement identity-based data governance policies with a seamless sign-on experience.
With this new capability, data teams can seamlessly access data through their preferred 3rd party tools using their organizational credentials. When analysts run queries using the updated Athena JDBC (3.6.0) and ODBC (2.0.5.0) drivers, their access permissions defined in Lake Formation are applied and their actions logged. This streamlined workflow eliminates credential management overhead while ensuring consistent security policies, allowing data teams to focus on insights rather than access management. For example, data analysts using 3rd party BI tools or SQL clients can now connect to Athena using their corporate credentials, and their access to data will be restricted based on policies defined for their respective user identity or group membership in Lake Formation.
This feature is available in regions where Amazon Athena and AWS Identity Center’s trusted identity propagation are supported. To learn more about configuring identity support when using Athena drivers, see the Amazon Athena driver documentation.
AWS Cloud Development Kit (CDK) CLI now enables safe infrastructure refactoring through the new ‘cdk refactor’ command in preview. This feature allows developers to rename constructs, move resources between stacks, and reorganize CDK applications while preserving the state of deployed resources. By leveraging AWS CloudFormation’s refactor capabilities with automated mapping computation, CDK Refactor eliminates the risk of unintended resource replacement during code restructuring. Previously, infrastructure as code maintenance often requires reorganizing resources and improving code structure, but these changes traditionally risked replacing existing resources due to logical ID changes. With the CDK Refactor feature, developers can confidently implement architectural improvements like breaking down monolithic stacks, introducing inheritance patterns, or upgrading to higher-level constructs without complex migration procedures or risking downtime of stateful resources. This allows teams to continuously evolve their infrastructure code while maintaining the stability of their production environments.
The AWS CDK Refactor feature is available in all AWS Regions where the AWS CDK is supported.
For more information and a walkthrough of the feature, check out the blog post and the documentation. You can read more about the AWS CDK here.
Our inference solution is based on AI Hypercomputer, a system built on our experience running models like Gemini and Veo 3, which serve over 980 trillion tokens a month to more than 450 million users. AI Hypercomputer services provide intelligent and optimized inferencing, including resource management, workload optimization and routing, and advanced storage for scale and performance, all co-designed to work together with industry leading GPU and TPU accelerators.
Today, GKE Inference Gateway is generally available, and we are launching new capabilities that deliver even more value. This underscores our commitment to helping companies deliver more intelligence, with increased performance and optimized costs for both training and serving.
Let’s take a look at the new capabilities we are announcing.
Efficient model serving and load balancing
A user’s experience of a generative AI application highly depends on both a fast initial response to a request and a smooth streaming of the response through to completion. With these new features, we’ve improved time-to-first-token (TTFT) and time-per-output-token (TPOT) on AI Hypercomputer. TTFT is based on the prefill phase, a compute-bound process where a full pass through the model creates a key-value (KV) cache. TPOT is based on the decode phase, a memory-bound process where tokens are generated using the KV cache from the prefill stage.
We improve both of these in a variety of ways. Generative AI applications like chatbots and code generation often reuse the same prefix in API calls. To optimize for this, GKE Inference Gateway now offers prefix-aware load balancing. This new, generally available feature improves TTFT latency by up to 96% at peak throughput for prefix-heavy workloads over other clouds by intelligently routing requests with the same prefix to the same accelerators, while balancing the load to prevent hotspots and latency spikes.
Consider a chatbot for a financial services company that helps users with account inquiries. A user starts a conversation to ask about a recent credit card transaction. Without prefix-aware routing, when the user asks follow up questions, such as the date of the charge or the confirmation number, the LLM has to re-read and re-process the entire initial query before it can answer the follow up question. The re-computation of the prefill phase is very inefficient and adds unnecessary latency, with the user experiencing delays between each question. With prefix-aware routing, the system intelligently reuses the data from the initial query by routing the request back to the same KV cache. This bypasses the prefill phase, allowing the model to answer almost instantly. Less computation also means fewer accelerators for the same workload, providing significant cost savings.
To further optimize inference performance, you can now also run disaggregated serving using AI Hypercomputer, which can improve throughput by 60%. Enhancements in GKE Inference Gateway, llm-d, and vLLM, work together to enable dynamic selection of prefill and decode nodes based on query size. This significantly improves both TTFT and TPOT by increasing the utilization of compute and memory resources at scale.
Take an example of an AI-based code completion application, which needs to provide low-latency responses to maintain interactivity. When a developer submits a completion request, the application must first process the input codebase; this is referred to as the prefill phase. Next, the application generates a code suggestion token by token; this is referred to as the decode phase. These tasks have dramatically different demands on accelerator resources — compute-intensive vs. memory-intensive processing. Running both phases on a single node results in neither being fully optimized, causing higher latency and poor response times. Disaggregated serving assigns these phases to separate nodes, allowing for independent scaling and optimization of each phase. For example, if the user base of developers submit a lot of requests based on large codebases, you can scale the prefill nodes. This improves latency and throughput, making the entire system more efficient.
Just as prefix-aware routing optimizes the reuse of conversational context, and disaggregated serving enhances performance by intelligently separating the computational demands of model prefill and token decoding, we have also addressed the fundamental challenge of getting these massive models running in the first place. As generative AI models grow to hundreds of gigabytes in size, they can often take over ten minutes to load, leading to slow startup and scaling. To solve this, we now support the Run:ai model streamer with Google Cloud Storage and Anywhere Cache for vLLM, with support for SGLang coming soon. This enables 5.4 GiB/s of direct throughput to accelerator memory, reducing model load times by over 4.9x, resulting in a better end user experience.
vLLM Model Load Time
Get started faster with data-driven decisions
Finding the ideal technology stack for serving AI models is a significant industry challenge. Historically, customers have had to navigate rapidly evolving technologies, the switching costs that impact hardware choices, and hundreds of thousands of possible deployment architectures. This inherent complexity makes it difficult to quickly achieve the best price-performance for your inference environment.
The GKE Inference Quickstart, now generally available, can save you time, improve performance, and reduce costs when deploying AI workloads by helping determine the right accelerator for your workloads in the right configuration, suggesting the best accelerators, model server, and scaling configuration for your AI/ML inference applications. New improvements to GKE Inference Quickstart include cost insights and benchmarked performance best practices, so you can easily compare costs and understand latency profiles, saving you months on evaluation and qualification.
GKE Inference Quickstart’s recommendations are grounded in a living repository of model and accelerator performance data that we generate by benchmarking our GPU and TPU accelerators against leading large language models like Llama, Mixtral, and Gemma more than 100 times per week. This extensive performance data is then enriched with the same storage, network, and software optimizations that power AI inferencing on Google’s global-scale services like Gemini, Search, and YouTube.
Let’s say you’re tasked with deploying a new, public-facing chatbot. The goal is to provide fast, high-quality responses at the lowest cost. Until now, finding the most optimal and cost-effective solution for deploying AI models was a significant challenge. Developers and engineers had to rely on a painstaking process of trial and error. This involved manually benchmarking countless combinations of different models, accelerators, and serving architectures, with all the data logged into a spreadsheet to calculate the cost per query for each scenario. This manual, weeks-long, or even months-long, project was prone to human error and offered no guarantee that the best possible solution was ever found.
Using Google Colab and the built-in optimizations in the Google Cloud console, GKE Inference Quickstart lets you choose the most cost-effective accelerators for, say, serving a Llama 3-based chatbot application that needs a TTFT of less than 500ms. These recommendations are deployable manifests, making it easy to choose a technology stack that you can provision from GKE in your Google Cloud environment. With GKE Inference Quickstart, your evaluation and qualification effort has gone from months to days.
Views from the Google Colab that helps the engineer with their evaluation.
Try these new capabilities for yourself. To get started with GKEInference QuickStart, from the Google Cloud console, go to Kubernetes Engine > AI/ML, and select “+ Deploy Models” near the top of the screen. Use the Filter to select Optimized > Values = True. This will show you all of the models that have price/performance optimization to select from. Once you select a model, you’ll see a sliding bar to select latency. The compatible accelerators from the drop-down will change to ones that match the performance of the latency you are selecting. You will notice that the cost/million output token will also change based on your selections.
Then, via Google Colab, you can plot and view the price/performance of leading AI models on Google Cloud. Chatbot Arena ratings are integrated to help you determine the best model for your needs based on model size, rating, and price per million tokens. You can also pull in your organization’s in-house quality measures into the colab to join with Google’s comprehensive benchmarks to make data-driven decisions.
Dedicated to optimizing inference
At Google Cloud, we are committed to helping companies deploy and improve their AI inference workloads at scale. Our focus is on providing a comprehensive platform that delivers unmatched performance and cost-efficiency for serving large language models and other generative AI applications. By leveraging a codesigned stack of industry-leading hardware and software innovations — including the AI Hypercomputer, GKE Inference Gateway, and purpose-built optimizations like prefix-aware routing, disaggregated serving, and model streaming — we ensure that businesses can deliver more intelligence with faster, more responsive user experiences and lower total cost of ownership. Our solutions are designed to address the unique challenges of inference, from model loading times to resource utilization, enabling you to deliver on the promise of generative AI. To learn more and get started, visit our AI Hypercomputer site.
As generative AI becomes more widespread, it’s important for developers and ML engineers to be able to easily configure infrastructure that supports efficient AI inference, i.e., using a trained AI model to make predictions or decisions based on new, unseen data. While great at training models, traditional GPU-based serving architectures struggle with the “multi-turn” nature of inference, characterized by back-and-forth conversations where the model must maintain context and understand user intent. Further, deploying large generative AI models can be both complex and resource-intensive.
At Google Cloud, we’re committed to providing customers with the best choices for their AI needs. That’s why we are excited to announce a new recipe for disaggregated inferencing with NVIDIA Dynamo, a high-performance, low-latency platform for a variety of AI models. Disaggregated inference separates out model processing phases, offering a significant leap in performance and cost-efficiency.
Specifically, this recipe makes it easy to deploy NVIDIA Dynamo on Google Cloud’s AI Hypercomputer, including Google Kubernetes Engine (GKE), vLLM inference engine, and A3 Ultra GPU-accelerated instances powered by NVIDIA H200 GPUs. By running the recipe on Google Cloud, you can achieve higher performance and greater inference efficiency while meeting your AI applications’ latency requirements. You can find this recipe, along with other resources, in our growing AI Hypercomputer resources repository on GitHub.
Let’s take a look at how to deploy it.
The two phases of inference
LLM inference is not a monolithic task; it’s a tale of two distinct computational phases. First is the prefill (or context) phase, where the input prompt is processed. Because this stage is compute-bound, it benefits from access to massive parallel processing power. Following prefill is the decode (or generation) phase, which generates a response, token by token, in an autoregressive loop. This stage is bound by memory bandwidth, requiring extremely fast access to the model’s weights and the KV cache.
In traditional architectures, these two phases run on the same GPU, creating resource contention. A long, compute-heavy prefill can block the rapid, iterative decode steps, leading to poor GPU utilization, higher inference costs, and increased latency for all users.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e9b92743d60>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
A specialized, disaggregated inference architecture
Our new solution tackles this challenge head-on by disaggregating, or physically separating, the prefill and decode stages across distinct, independently managed GPU pools.
Here’s how the components work in concert:
A3 Ultra instances and GKE: The recipe uses GKE to orchestrate separate node pools of A3 Ultra instances, powered by NVIDIA H200 GPUs. This creates specialized resource pools — one optimized for compute-heavy prefill tasks and another for memory-bound decode tasks.
NVIDIA Dynamo: Acting as the inference server, NVIDIA Dynamo’s modular front end and KV cache-aware router processes incoming requests. It then pairs GPUs from the prefill and decode GKE node pools and orchestrates workload execution between them, transferring the KV cache that’s generated in the prefill pool to the decode pool to begin token generation.
vLLM: Running on pods within each GKE pool, the vLLM inference engine helps ensure best-in-class performance for the actual computation, using innovations like PagedAttention to maximize throughput on each individual node.
This disaggregated approach allows each phase to scale independently based on real-time demand, helping to ensure that compute-intensive prompt processing doesn’t interfere with fast token generation. Dynamo supports popular inference engines including SGLang, TensorRT-LLM and vLLM. The result is a dramatic boost in overall throughput and maximized utilization of every GPU.
Experiment with Dynamo Recipes for Google Cloud
The reproducible recipe shows the steps to deploy disaggregated inference with NVIDIA Dynamo on the A3 Ultra (H200) VMs on Google Cloud using GKE for orchestration and vLLM as the inference engine. The single node recipe demonstrates disaggregated inference with one node of A3 Ultra using four GPUs for prefill and four GPUs for decode. The multi-node recipe demonstrates disaggregated inference with one node of A3 Ultra for prefill and one node of A3 Ultra for decode for the Llama-3.3-70B-Instruct Model.
Future recipes will provide support for additional NVIDIA GPUs (e.g. A4, A4X) and inference engines with expanded coverage of models.
The recipe highlights the following key steps:
Perform initial setup – This sets up environment variables and secrets; this needs to be done one-time only.
Install Dynamo Platform and CRDs – This sets up the various Dynamo Kubernetes components; this needs to be done one-time only.
Deploy inference backend for a specific model workload – This deploys vLLM/SGLang as the inference backend for Dynamo disaggregated inference for a specific model workload. Repeat this step for every new model inference workload deployment.
Process inference requests – Once the model is deployed for inference, incoming queries are processed to provide responses to users.
Once the server is up, you will see the prefill and decode workers along with the frontend pod which acts as the primary interface to serve the requests.
We can verify if everything works as intended by sending a request to the server like this. The response is generated and truncated to max_tokens.
code_block
<ListValue: [StructValue([(‘code’, ‘curl -s localhost:8000/v1/chat/completions \rn -H “Content-Type: application/json” \rn -d ‘{rn “model”: “meta-llama/Llama-3.3-70B-Instruct”,rn “messages”: [rn {rn “role”: “user”,rn “content”: “what is the meaning of life ?”rn }rn ],rn “stream”:false,rn “max_tokens”: 30rn }’ | jq -r ‘.choices[0].message.content’rnrnrn—rnThe question of the meaning of life is a complex and deeply philosophical one that has been debated by scholars, theologians, philosophers, and scientists for’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9b92743130>)])]>
Get started today
By moving beyond the constraints of traditional serving, the new disaggregated inference recipe represents the future of efficient, scalable LLM inference. It enables you to right-size resources for each specific task, unlocking new performance paradigms and significant cost savings for your most demanding generative AI applications. We are excited to see how you will leverage this recipe to build the next wave of AI-powered services. We encourage you to try out our Dynamo Disaggregated Inference Recipe which provides a starting point with recommended configurations and easy steps. We hope you have fun experimenting and share your feedback!
Amazon Interactive Video Service (Amazon IVS) now supports media ingest via interface VPC endpoints powered by AWS PrivateLink. With this launch, you can securely broadcast RTMP(S) streams to IVS Low-Latency channels or IVS Real-Time stages without sending traffic over the public internet. You can create interface VPC endpoints to privately connect your applications to Amazon IVS from within your VPC or from on-premises environments over AWS Direct Connect. This provides private, reliable connectivity for your live video workflows.
Amazon IVS support for media ingest via interface VPC endpoints is available today in the US West (Oregon), Europe (Frankfurt), and Europe (Ireland) AWS Regions. Standard AWS PrivateLink pricing applies. See the AWS PrivateLink pricing page for details.
Today, AWS announced new capabilities for native anomaly detection in AWS IoT SiteWise. This release includes automated model retraining, flexible promotion modes, and exposed model metrics, all designed to enhance the anomaly detection feature.
The automated retraining capability allows models to be automatically retrained on a schedule ranging from a minimum of 30 days to a maximum of one year, eliminating the need to manually retrain models. This feature ensures that models stay up-to-date with changing equipment conditions or configurations, thereby maintaining optimal performance over time.
Additionally, flexible promotion modes give customers the choice between service-managed and customer-managed model promotion. Automatic promotion enables AWS IoT SiteWise to evaluate and promote the best-performing model without customer intervention, while manual promotion allows customers to review comprehensive, exposed model metrics—including precision, recall, and Area Under the ROC Curve (AUC)—before deciding which model version to activate. This flexibility allows choice between a hands-off or human oversight approach.
Multivariate anomaly detection is available in US East (N. Virginia) , Europe (Ireland) , and Asia Pacific (Sydney) AWS Regions where AWS IoT SiteWise is offered. To learn more, read the launch blog and user guide.
Data centers are the engines of the cloud, processing and storing the information that powers our daily lives. As digital services grow, so do our data centers and we are working to responsibly manage them. Google thinks of infrastructure at the full stack level, not just as hardware but as hardware abstracted through software, allowing us to innovate.
We have previously shared how we’re working to reduce the embodied carbon impact at our data centers by optimizing our technical infrastructure hardware. In this post, we shine a spotlight on our “central fleet” program, which has helped us shift our internal resource management system from a machine economy to a more sustainable resource and performance economy.
What is Central Fleet?
At its core, our central fleet program is a resource distribution approach that allows us to manage and allocate computing resources, like processing power, memory, and storage in a more efficient and sustainable way. Instead of individual teams or product teams within Google ordering and managing their own physical machines, our central fleet acts as a centralized pool of resources that can be dynamically distributed to where they are needed most.
Think of it like a shared car service. Rather than each person owning a car they might only use for a couple of hours a day, a shared fleet allows for fewer cars to be used more efficiently by many people. Similarly, our central fleet program ensures our computing resources are constantly in use, minimizing waste and reducing the need to procure new machines.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7beef4a2e0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
How it works: A shift to a resource economy
The central fleet approach fundamentally changes how we provision and manage resources. When a team needs more computing power, instead of ordering specific hardware, they place an order for “quota” from the central fleet. This makes the computing resources fungible, that is, interchangeable and flexible. For instance, a team will ask for a certain amount of processing power or storage capacity, not a particular server model.
This “intent-based” ordering system provides flexibility in how demand is fulfilled. Our central fleet can intelligently fulfill requests using either existing inventory or procure at scale, which can lower cost and environmental impact. It also facilitates the return of unneeded resources that can then be reallocated to other teams, further reducing waste.
All of this is possible with our full-stack infrastructure and built on the Borg cluster management system to abstract away the physical hardware into a single, fungible resource pool. This software-level intelligence allows us to treat our infrastructure as a fluid, optimizable system rather than a collection of static machines, unlocking massive efficiency gains.
The sustainability benefits of central fleet
The central fleet approach aligns with Google’s broader dedication to sustainability and a circular economy. By optimizing the use of our existing hardware, we can achieve carbon savings. For example, in 2024, our central fleet program helped avoid procurement of new components and machines with an embodied impact equivalent to approximately 260,000 metric tons of CO2e. This roughly equates to avoiding 660 million miles driven by an average gasoline-powered passenger vehicle.1
This fulfillment flexibility leads to greater resource efficiency and a reduced carbon footprint in several ways:
Reduced electronic waste: By extending the life of our machines through reallocation and reuse, we minimize the need to manufacture new hardware and reduce the amount of electronic waste.
Lower embodied carbon: The manufacturing of new servers carries an embodied carbon footprint. By avoiding the creation of new machines, we avoid these associated CO2e emissions.
Increased energy efficiency: Central fleet allows for the strategic placement of workloads on the most power-efficient hardware available, optimizing energy consumption across our data centers.
Promote a circular economy: This model is a prime example of circular economy principles in action, shifting from a linear “take-make-dispose” model to one that emphasizes reuse and longevity.
The central fleet initiative is more than an internal efficiency project; it’s a tangible demonstration of embedding sustainability into our core business decisions. By rethinking how we manage our infrastructure, we can meet growing AI and cloud demand while simultaneously paving the way for a more sustainable future. Learn more at sustainability.google.
1. Estimated avoided emissions were calculated by applying internal LCA emissions factors to machines and component resources saved through our central fleet initiative in 2024. We input the estimated avoided emissions into theEPA’s Greenhouse Gas Equivalencies Calculatorto calculate the equivalent number of miles driven by an average gasoline-powered passenger vehicle (accessed August 2025). The data and claims have not been verified by an independent third-party.
Consumer search behavior is shifting, with users now entering longer, more complex questions into search bars in pursuit of more relevant results. For instance, instead of a simple “best kids snacks,” queries have evolved to “What are some nutritious snack options for a 7-year-old’s birthday party?”
However, many digital platforms have yet to adapt to this new era of discovery, leaving shoppers frustrated as they find themselves sifting through extensive catalogs and manually applying filters. This results in quick abandonment and lost transactions, including an estimatedannual global loss of $2 trillion.
We are excited to announce the general availability of Google Cloud’s Conversational Commerce agent designed to engage shoppers in natural, human-like conversations to guide them from initial intent to a completed purchase. Companies like Albertsons Cos., who was a marquee collaborator on this product and is using Conversational Commerce agent within their Ask AI tool, are already seeing an impact. Early results show customers using Ask AI often add one or more additional items to their cart, uncovering products they might not have found otherwise.
You can access Conversational Commerce agent today in the Vertex AI console.
Shoppers can ask complex questions in their own words and find exactly what they’re looking for through back-and-forth conversation that drives them to purchase.
Introducing the next generation of retail experiences
Go beyond traditional keyword search to deliver a personalized and streamlined shopping experience to drive revenue. Conversational Commerce agent integrates easily into your website and applications, guiding customers from discovery to purchase.
Conversational Commerce agent turns e-commerce challenges into opportunities through a more intuitive shopping experience:
Turn your search into a sales associate: Unlike generic chatbots, our agent is built to sell. Its intelligent intent classifier understands how your customers are shopping and tailors their experience. Just browsing? Guide them with personalized, conversational search that inspires them to find—and buy—items they wouldn’t have found otherwise. Know exactly what they want? The agent defaults to traditional search results for simple queries.
Drive revenue with natural conversation: Our agent leverages the power of Gemini to understand complex and ambiguous requests, suggest relevant products from your catalog, answer questions on product details, and even provide helpful details such as store hours.
Re-engage returning shoppers: The agent retains context across site interactions and devices. This allows returning customers to pick up exactly where they left off, creating a simplified journey that reduces friction and guides them back to their cart.
Safety and responsibility built-in: You have complete control to boost, bury, or restrict products and categories from conversations. There are also safety controls in place, ensuring all interactions are helpful and brand-appropriate.
Coming soon: Unlock new methods of discovery for your customers. Shoppers can soon search with images and video, locate in-store products, find store hours, and connect with customer support.
Albertsons Cos. is leading the way in AI-powered product discovery
Albertsons Cos., is redefining how customers discover, plan and shop for groceries with Conversational Commerce agent. When Albertsons Cos. customers interacted with the Ask AI platform, more than 85% of conversations started with open-ended or exploratory questions demonstrating the need for personalized guidance.
“At Albertsons Cos., we are focused on giving our customers the best experience possible for when and how they choose to shop,” said Jill Pavlovich, SVP, Digital Customer Experience for Albertsons Cos. “By collaborating with Google Cloud to bring Conversational Commerce agent to market, we are delivering a more personalized interaction to help make our customers’ lives easier. Now they can digitally shop across aisles, plan quick meal ideas, discover new products, and even get recommendations for unexpected items that pair well together.”
The Ask AI tool is accessible now via the search bar in all Albertsons Cos. banner apps, to help customers build smarter, faster baskets through simplified product discovery, personalized recommendations and a more intuitive shopping experience.
Get started
Conversational Commerce agent guides customers to purchase, is optimized for revenue-per-visitor, and is available 24/7. Built on Vertex AI, onboarding is quick and easy, requiring minimal development effort.
Gartner, Magic Quadrant for Search and Product Discovery – Mike Lowndes, Noam Dorros, Sandy Shen, Aditya Vasudevan, June 24, 2025
Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Google.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved.
Amazon Bedrock AgentCore Gateway now supports AWS PrivateLink invocation and invocation logging through Amazon CloudWatch, Amazon S3 and Amazon Data Firehose. Amazon Bedrock AgentCore Gateway provides an easy and secure way for developers to build, deploy, discover, and connect to agent tools at scale. With the PrivateLink support and invocation logging, you can apply network and governance requirements to agents and tools through AgentCore Gateway.
The AWS PrivateLink support allows users and agents from a virtual private cloud (VPC) network to access AgentCore Gateway without going through the public internet. With invocation logging, you gain visibility into each invocation log and can deep dive into issues or audit activities.
Amazon Bedrock AgentCore is currently in preview and it is available in US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), and Europe (Frankfurt). Learn more about the features from the AWS documentation. Learn more about Amazon Bedrock AgentCore and it’s services in the News Blog.