Amazon Athena announces single sign-on support for its JDBC and ODBC drivers through AWS IAM Identity Center’s trusted identity propagation. This makes it simpler for organizations to manage end-user’s access to data when using 3rd party tools and implement identity-based data governance policies with a seamless sign-on experience.
With this new capability, data teams can seamlessly access data through their preferred 3rd party tools using their organizational credentials. When analysts run queries using the updated Athena JDBC (3.6.0) and ODBC (2.0.5.0) drivers, their access permissions defined in Lake Formation are applied and their actions logged. This streamlined workflow eliminates credential management overhead while ensuring consistent security policies, allowing data teams to focus on insights rather than access management. For example, data analysts using 3rd party BI tools or SQL clients can now connect to Athena using their corporate credentials, and their access to data will be restricted based on policies defined for their respective user identity or group membership in Lake Formation.
This feature is available in regions where Amazon Athena and AWS Identity Center’s trusted identity propagation are supported. To learn more about configuring identity support when using Athena drivers, see the Amazon Athena driver documentation.
AWS Cloud Development Kit (CDK) CLI now enables safe infrastructure refactoring through the new ‘cdk refactor’ command in preview. This feature allows developers to rename constructs, move resources between stacks, and reorganize CDK applications while preserving the state of deployed resources. By leveraging AWS CloudFormation’s refactor capabilities with automated mapping computation, CDK Refactor eliminates the risk of unintended resource replacement during code restructuring. Previously, infrastructure as code maintenance often requires reorganizing resources and improving code structure, but these changes traditionally risked replacing existing resources due to logical ID changes. With the CDK Refactor feature, developers can confidently implement architectural improvements like breaking down monolithic stacks, introducing inheritance patterns, or upgrading to higher-level constructs without complex migration procedures or risking downtime of stateful resources. This allows teams to continuously evolve their infrastructure code while maintaining the stability of their production environments.
The AWS CDK Refactor feature is available in all AWS Regions where the AWS CDK is supported.
For more information and a walkthrough of the feature, check out the blog post and the documentation. You can read more about the AWS CDK here.
Our inference solution is based on AI Hypercomputer, a system built on our experience running models like Gemini and Veo 3, which serve over 980 trillion tokens a month to more than 450 million users. AI Hypercomputer services provide intelligent and optimized inferencing, including resource management, workload optimization and routing, and advanced storage for scale and performance, all co-designed to work together with industry leading GPU and TPU accelerators.
Today, GKE Inference Gateway is generally available, and we are launching new capabilities that deliver even more value. This underscores our commitment to helping companies deliver more intelligence, with increased performance and optimized costs for both training and serving.
Let’s take a look at the new capabilities we are announcing.
Efficient model serving and load balancing
A user’s experience of a generative AI application highly depends on both a fast initial response to a request and a smooth streaming of the response through to completion. With these new features, we’ve improved time-to-first-token (TTFT) and time-per-output-token (TPOT) on AI Hypercomputer. TTFT is based on the prefill phase, a compute-bound process where a full pass through the model creates a key-value (KV) cache. TPOT is based on the decode phase, a memory-bound process where tokens are generated using the KV cache from the prefill stage.
We improve both of these in a variety of ways. Generative AI applications like chatbots and code generation often reuse the same prefix in API calls. To optimize for this, GKE Inference Gateway now offers prefix-aware load balancing. This new, generally available feature improves TTFT latency by up to 96% at peak throughput for prefix-heavy workloads over other clouds by intelligently routing requests with the same prefix to the same accelerators, while balancing the load to prevent hotspots and latency spikes.
Consider a chatbot for a financial services company that helps users with account inquiries. A user starts a conversation to ask about a recent credit card transaction. Without prefix-aware routing, when the user asks follow up questions, such as the date of the charge or the confirmation number, the LLM has to re-read and re-process the entire initial query before it can answer the follow up question. The re-computation of the prefill phase is very inefficient and adds unnecessary latency, with the user experiencing delays between each question. With prefix-aware routing, the system intelligently reuses the data from the initial query by routing the request back to the same KV cache. This bypasses the prefill phase, allowing the model to answer almost instantly. Less computation also means fewer accelerators for the same workload, providing significant cost savings.
To further optimize inference performance, you can now also run disaggregated serving using AI Hypercomputer, which can improve throughput by 60%. Enhancements in GKE Inference Gateway, llm-d, and vLLM, work together to enable dynamic selection of prefill and decode nodes based on query size. This significantly improves both TTFT and TPOT by increasing the utilization of compute and memory resources at scale.
Take an example of an AI-based code completion application, which needs to provide low-latency responses to maintain interactivity. When a developer submits a completion request, the application must first process the input codebase; this is referred to as the prefill phase. Next, the application generates a code suggestion token by token; this is referred to as the decode phase. These tasks have dramatically different demands on accelerator resources — compute-intensive vs. memory-intensive processing. Running both phases on a single node results in neither being fully optimized, causing higher latency and poor response times. Disaggregated serving assigns these phases to separate nodes, allowing for independent scaling and optimization of each phase. For example, if the user base of developers submit a lot of requests based on large codebases, you can scale the prefill nodes. This improves latency and throughput, making the entire system more efficient.
Just as prefix-aware routing optimizes the reuse of conversational context, and disaggregated serving enhances performance by intelligently separating the computational demands of model prefill and token decoding, we have also addressed the fundamental challenge of getting these massive models running in the first place. As generative AI models grow to hundreds of gigabytes in size, they can often take over ten minutes to load, leading to slow startup and scaling. To solve this, we now support the Run:ai model streamer with Google Cloud Storage and Anywhere Cache for vLLM, with support for SGLang coming soon. This enables 5.4 GiB/s of direct throughput to accelerator memory, reducing model load times by over 4.9x, resulting in a better end user experience.
vLLM Model Load Time
Get started faster with data-driven decisions
Finding the ideal technology stack for serving AI models is a significant industry challenge. Historically, customers have had to navigate rapidly evolving technologies, the switching costs that impact hardware choices, and hundreds of thousands of possible deployment architectures. This inherent complexity makes it difficult to quickly achieve the best price-performance for your inference environment.
The GKE Inference Quickstart, now generally available, can save you time, improve performance, and reduce costs when deploying AI workloads by helping determine the right accelerator for your workloads in the right configuration, suggesting the best accelerators, model server, and scaling configuration for your AI/ML inference applications. New improvements to GKE Inference Quickstart include cost insights and benchmarked performance best practices, so you can easily compare costs and understand latency profiles, saving you months on evaluation and qualification.
GKE Inference Quickstart’s recommendations are grounded in a living repository of model and accelerator performance data that we generate by benchmarking our GPU and TPU accelerators against leading large language models like Llama, Mixtral, and Gemma more than 100 times per week. This extensive performance data is then enriched with the same storage, network, and software optimizations that power AI inferencing on Google’s global-scale services like Gemini, Search, and YouTube.
Let’s say you’re tasked with deploying a new, public-facing chatbot. The goal is to provide fast, high-quality responses at the lowest cost. Until now, finding the most optimal and cost-effective solution for deploying AI models was a significant challenge. Developers and engineers had to rely on a painstaking process of trial and error. This involved manually benchmarking countless combinations of different models, accelerators, and serving architectures, with all the data logged into a spreadsheet to calculate the cost per query for each scenario. This manual, weeks-long, or even months-long, project was prone to human error and offered no guarantee that the best possible solution was ever found.
Using Google Colab and the built-in optimizations in the Google Cloud console, GKE Inference Quickstart lets you choose the most cost-effective accelerators for, say, serving a Llama 3-based chatbot application that needs a TTFT of less than 500ms. These recommendations are deployable manifests, making it easy to choose a technology stack that you can provision from GKE in your Google Cloud environment. With GKE Inference Quickstart, your evaluation and qualification effort has gone from months to days.
Views from the Google Colab that helps the engineer with their evaluation.
Try these new capabilities for yourself. To get started with GKEInference QuickStart, from the Google Cloud console, go to Kubernetes Engine > AI/ML, and select “+ Deploy Models” near the top of the screen. Use the Filter to select Optimized > Values = True. This will show you all of the models that have price/performance optimization to select from. Once you select a model, you’ll see a sliding bar to select latency. The compatible accelerators from the drop-down will change to ones that match the performance of the latency you are selecting. You will notice that the cost/million output token will also change based on your selections.
Then, via Google Colab, you can plot and view the price/performance of leading AI models on Google Cloud. Chatbot Arena ratings are integrated to help you determine the best model for your needs based on model size, rating, and price per million tokens. You can also pull in your organization’s in-house quality measures into the colab to join with Google’s comprehensive benchmarks to make data-driven decisions.
Dedicated to optimizing inference
At Google Cloud, we are committed to helping companies deploy and improve their AI inference workloads at scale. Our focus is on providing a comprehensive platform that delivers unmatched performance and cost-efficiency for serving large language models and other generative AI applications. By leveraging a codesigned stack of industry-leading hardware and software innovations — including the AI Hypercomputer, GKE Inference Gateway, and purpose-built optimizations like prefix-aware routing, disaggregated serving, and model streaming — we ensure that businesses can deliver more intelligence with faster, more responsive user experiences and lower total cost of ownership. Our solutions are designed to address the unique challenges of inference, from model loading times to resource utilization, enabling you to deliver on the promise of generative AI. To learn more and get started, visit our AI Hypercomputer site.
As generative AI becomes more widespread, it’s important for developers and ML engineers to be able to easily configure infrastructure that supports efficient AI inference, i.e., using a trained AI model to make predictions or decisions based on new, unseen data. While great at training models, traditional GPU-based serving architectures struggle with the “multi-turn” nature of inference, characterized by back-and-forth conversations where the model must maintain context and understand user intent. Further, deploying large generative AI models can be both complex and resource-intensive.
At Google Cloud, we’re committed to providing customers with the best choices for their AI needs. That’s why we are excited to announce a new recipe for disaggregated inferencing with NVIDIA Dynamo, a high-performance, low-latency platform for a variety of AI models. Disaggregated inference separates out model processing phases, offering a significant leap in performance and cost-efficiency.
Specifically, this recipe makes it easy to deploy NVIDIA Dynamo on Google Cloud’s AI Hypercomputer, including Google Kubernetes Engine (GKE), vLLM inference engine, and A3 Ultra GPU-accelerated instances powered by NVIDIA H200 GPUs. By running the recipe on Google Cloud, you can achieve higher performance and greater inference efficiency while meeting your AI applications’ latency requirements. You can find this recipe, along with other resources, in our growing AI Hypercomputer resources repository on GitHub.
Let’s take a look at how to deploy it.
The two phases of inference
LLM inference is not a monolithic task; it’s a tale of two distinct computational phases. First is the prefill (or context) phase, where the input prompt is processed. Because this stage is compute-bound, it benefits from access to massive parallel processing power. Following prefill is the decode (or generation) phase, which generates a response, token by token, in an autoregressive loop. This stage is bound by memory bandwidth, requiring extremely fast access to the model’s weights and the KV cache.
In traditional architectures, these two phases run on the same GPU, creating resource contention. A long, compute-heavy prefill can block the rapid, iterative decode steps, leading to poor GPU utilization, higher inference costs, and increased latency for all users.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e9b92743d60>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
A specialized, disaggregated inference architecture
Our new solution tackles this challenge head-on by disaggregating, or physically separating, the prefill and decode stages across distinct, independently managed GPU pools.
Here’s how the components work in concert:
A3 Ultra instances and GKE: The recipe uses GKE to orchestrate separate node pools of A3 Ultra instances, powered by NVIDIA H200 GPUs. This creates specialized resource pools — one optimized for compute-heavy prefill tasks and another for memory-bound decode tasks.
NVIDIA Dynamo: Acting as the inference server, NVIDIA Dynamo’s modular front end and KV cache-aware router processes incoming requests. It then pairs GPUs from the prefill and decode GKE node pools and orchestrates workload execution between them, transferring the KV cache that’s generated in the prefill pool to the decode pool to begin token generation.
vLLM: Running on pods within each GKE pool, the vLLM inference engine helps ensure best-in-class performance for the actual computation, using innovations like PagedAttention to maximize throughput on each individual node.
This disaggregated approach allows each phase to scale independently based on real-time demand, helping to ensure that compute-intensive prompt processing doesn’t interfere with fast token generation. Dynamo supports popular inference engines including SGLang, TensorRT-LLM and vLLM. The result is a dramatic boost in overall throughput and maximized utilization of every GPU.
Experiment with Dynamo Recipes for Google Cloud
The reproducible recipe shows the steps to deploy disaggregated inference with NVIDIA Dynamo on the A3 Ultra (H200) VMs on Google Cloud using GKE for orchestration and vLLM as the inference engine. The single node recipe demonstrates disaggregated inference with one node of A3 Ultra using four GPUs for prefill and four GPUs for decode. The multi-node recipe demonstrates disaggregated inference with one node of A3 Ultra for prefill and one node of A3 Ultra for decode for the Llama-3.3-70B-Instruct Model.
Future recipes will provide support for additional NVIDIA GPUs (e.g. A4, A4X) and inference engines with expanded coverage of models.
The recipe highlights the following key steps:
Perform initial setup – This sets up environment variables and secrets; this needs to be done one-time only.
Install Dynamo Platform and CRDs – This sets up the various Dynamo Kubernetes components; this needs to be done one-time only.
Deploy inference backend for a specific model workload – This deploys vLLM/SGLang as the inference backend for Dynamo disaggregated inference for a specific model workload. Repeat this step for every new model inference workload deployment.
Process inference requests – Once the model is deployed for inference, incoming queries are processed to provide responses to users.
Once the server is up, you will see the prefill and decode workers along with the frontend pod which acts as the primary interface to serve the requests.
We can verify if everything works as intended by sending a request to the server like this. The response is generated and truncated to max_tokens.
code_block
<ListValue: [StructValue([(‘code’, ‘curl -s localhost:8000/v1/chat/completions \rn -H “Content-Type: application/json” \rn -d ‘{rn “model”: “meta-llama/Llama-3.3-70B-Instruct”,rn “messages”: [rn {rn “role”: “user”,rn “content”: “what is the meaning of life ?”rn }rn ],rn “stream”:false,rn “max_tokens”: 30rn }’ | jq -r ‘.choices[0].message.content’rnrnrn—rnThe question of the meaning of life is a complex and deeply philosophical one that has been debated by scholars, theologians, philosophers, and scientists for’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9b92743130>)])]>
Get started today
By moving beyond the constraints of traditional serving, the new disaggregated inference recipe represents the future of efficient, scalable LLM inference. It enables you to right-size resources for each specific task, unlocking new performance paradigms and significant cost savings for your most demanding generative AI applications. We are excited to see how you will leverage this recipe to build the next wave of AI-powered services. We encourage you to try out our Dynamo Disaggregated Inference Recipe which provides a starting point with recommended configurations and easy steps. We hope you have fun experimenting and share your feedback!
Amazon Interactive Video Service (Amazon IVS) now supports media ingest via interface VPC endpoints powered by AWS PrivateLink. With this launch, you can securely broadcast RTMP(S) streams to IVS Low-Latency channels or IVS Real-Time stages without sending traffic over the public internet. You can create interface VPC endpoints to privately connect your applications to Amazon IVS from within your VPC or from on-premises environments over AWS Direct Connect. This provides private, reliable connectivity for your live video workflows.
Amazon IVS support for media ingest via interface VPC endpoints is available today in the US West (Oregon), Europe (Frankfurt), and Europe (Ireland) AWS Regions. Standard AWS PrivateLink pricing applies. See the AWS PrivateLink pricing page for details.
Today, AWS announced new capabilities for native anomaly detection in AWS IoT SiteWise. This release includes automated model retraining, flexible promotion modes, and exposed model metrics, all designed to enhance the anomaly detection feature.
The automated retraining capability allows models to be automatically retrained on a schedule ranging from a minimum of 30 days to a maximum of one year, eliminating the need to manually retrain models. This feature ensures that models stay up-to-date with changing equipment conditions or configurations, thereby maintaining optimal performance over time.
Additionally, flexible promotion modes give customers the choice between service-managed and customer-managed model promotion. Automatic promotion enables AWS IoT SiteWise to evaluate and promote the best-performing model without customer intervention, while manual promotion allows customers to review comprehensive, exposed model metrics—including precision, recall, and Area Under the ROC Curve (AUC)—before deciding which model version to activate. This flexibility allows choice between a hands-off or human oversight approach.
Multivariate anomaly detection is available in US East (N. Virginia) , Europe (Ireland) , and Asia Pacific (Sydney) AWS Regions where AWS IoT SiteWise is offered. To learn more, read the launch blog and user guide.
Data centers are the engines of the cloud, processing and storing the information that powers our daily lives. As digital services grow, so do our data centers and we are working to responsibly manage them. Google thinks of infrastructure at the full stack level, not just as hardware but as hardware abstracted through software, allowing us to innovate.
We have previously shared how we’re working to reduce the embodied carbon impact at our data centers by optimizing our technical infrastructure hardware. In this post, we shine a spotlight on our “central fleet” program, which has helped us shift our internal resource management system from a machine economy to a more sustainable resource and performance economy.
What is Central Fleet?
At its core, our central fleet program is a resource distribution approach that allows us to manage and allocate computing resources, like processing power, memory, and storage in a more efficient and sustainable way. Instead of individual teams or product teams within Google ordering and managing their own physical machines, our central fleet acts as a centralized pool of resources that can be dynamically distributed to where they are needed most.
Think of it like a shared car service. Rather than each person owning a car they might only use for a couple of hours a day, a shared fleet allows for fewer cars to be used more efficiently by many people. Similarly, our central fleet program ensures our computing resources are constantly in use, minimizing waste and reducing the need to procure new machines.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7beef4a2e0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
How it works: A shift to a resource economy
The central fleet approach fundamentally changes how we provision and manage resources. When a team needs more computing power, instead of ordering specific hardware, they place an order for “quota” from the central fleet. This makes the computing resources fungible, that is, interchangeable and flexible. For instance, a team will ask for a certain amount of processing power or storage capacity, not a particular server model.
This “intent-based” ordering system provides flexibility in how demand is fulfilled. Our central fleet can intelligently fulfill requests using either existing inventory or procure at scale, which can lower cost and environmental impact. It also facilitates the return of unneeded resources that can then be reallocated to other teams, further reducing waste.
All of this is possible with our full-stack infrastructure and built on the Borg cluster management system to abstract away the physical hardware into a single, fungible resource pool. This software-level intelligence allows us to treat our infrastructure as a fluid, optimizable system rather than a collection of static machines, unlocking massive efficiency gains.
The sustainability benefits of central fleet
The central fleet approach aligns with Google’s broader dedication to sustainability and a circular economy. By optimizing the use of our existing hardware, we can achieve carbon savings. For example, in 2024, our central fleet program helped avoid procurement of new components and machines with an embodied impact equivalent to approximately 260,000 metric tons of CO2e. This roughly equates to avoiding 660 million miles driven by an average gasoline-powered passenger vehicle.1
This fulfillment flexibility leads to greater resource efficiency and a reduced carbon footprint in several ways:
Reduced electronic waste: By extending the life of our machines through reallocation and reuse, we minimize the need to manufacture new hardware and reduce the amount of electronic waste.
Lower embodied carbon: The manufacturing of new servers carries an embodied carbon footprint. By avoiding the creation of new machines, we avoid these associated CO2e emissions.
Increased energy efficiency: Central fleet allows for the strategic placement of workloads on the most power-efficient hardware available, optimizing energy consumption across our data centers.
Promote a circular economy: This model is a prime example of circular economy principles in action, shifting from a linear “take-make-dispose” model to one that emphasizes reuse and longevity.
The central fleet initiative is more than an internal efficiency project; it’s a tangible demonstration of embedding sustainability into our core business decisions. By rethinking how we manage our infrastructure, we can meet growing AI and cloud demand while simultaneously paving the way for a more sustainable future. Learn more at sustainability.google.
1. Estimated avoided emissions were calculated by applying internal LCA emissions factors to machines and component resources saved through our central fleet initiative in 2024. We input the estimated avoided emissions into theEPA’s Greenhouse Gas Equivalencies Calculatorto calculate the equivalent number of miles driven by an average gasoline-powered passenger vehicle (accessed August 2025). The data and claims have not been verified by an independent third-party.
Consumer search behavior is shifting, with users now entering longer, more complex questions into search bars in pursuit of more relevant results. For instance, instead of a simple “best kids snacks,” queries have evolved to “What are some nutritious snack options for a 7-year-old’s birthday party?”
However, many digital platforms have yet to adapt to this new era of discovery, leaving shoppers frustrated as they find themselves sifting through extensive catalogs and manually applying filters. This results in quick abandonment and lost transactions, including an estimatedannual global loss of $2 trillion.
We are excited to announce the general availability of Google Cloud’s Conversational Commerce agent designed to engage shoppers in natural, human-like conversations to guide them from initial intent to a completed purchase. Companies like Albertsons Cos., who was a marquee collaborator on this product and is using Conversational Commerce agent within their Ask AI tool, are already seeing an impact. Early results show customers using Ask AI often add one or more additional items to their cart, uncovering products they might not have found otherwise.
You can access Conversational Commerce agent today in the Vertex AI console.
Shoppers can ask complex questions in their own words and find exactly what they’re looking for through back-and-forth conversation that drives them to purchase.
Introducing the next generation of retail experiences
Go beyond traditional keyword search to deliver a personalized and streamlined shopping experience to drive revenue. Conversational Commerce agent integrates easily into your website and applications, guiding customers from discovery to purchase.
Conversational Commerce agent turns e-commerce challenges into opportunities through a more intuitive shopping experience:
Turn your search into a sales associate: Unlike generic chatbots, our agent is built to sell. Its intelligent intent classifier understands how your customers are shopping and tailors their experience. Just browsing? Guide them with personalized, conversational search that inspires them to find—and buy—items they wouldn’t have found otherwise. Know exactly what they want? The agent defaults to traditional search results for simple queries.
Drive revenue with natural conversation: Our agent leverages the power of Gemini to understand complex and ambiguous requests, suggest relevant products from your catalog, answer questions on product details, and even provide helpful details such as store hours.
Re-engage returning shoppers: The agent retains context across site interactions and devices. This allows returning customers to pick up exactly where they left off, creating a simplified journey that reduces friction and guides them back to their cart.
Safety and responsibility built-in: You have complete control to boost, bury, or restrict products and categories from conversations. There are also safety controls in place, ensuring all interactions are helpful and brand-appropriate.
Coming soon: Unlock new methods of discovery for your customers. Shoppers can soon search with images and video, locate in-store products, find store hours, and connect with customer support.
Albertsons Cos. is leading the way in AI-powered product discovery
Albertsons Cos., is redefining how customers discover, plan and shop for groceries with Conversational Commerce agent. When Albertsons Cos. customers interacted with the Ask AI platform, more than 85% of conversations started with open-ended or exploratory questions demonstrating the need for personalized guidance.
“At Albertsons Cos., we are focused on giving our customers the best experience possible for when and how they choose to shop,” said Jill Pavlovich, SVP, Digital Customer Experience for Albertsons Cos. “By collaborating with Google Cloud to bring Conversational Commerce agent to market, we are delivering a more personalized interaction to help make our customers’ lives easier. Now they can digitally shop across aisles, plan quick meal ideas, discover new products, and even get recommendations for unexpected items that pair well together.”
The Ask AI tool is accessible now via the search bar in all Albertsons Cos. banner apps, to help customers build smarter, faster baskets through simplified product discovery, personalized recommendations and a more intuitive shopping experience.
Get started
Conversational Commerce agent guides customers to purchase, is optimized for revenue-per-visitor, and is available 24/7. Built on Vertex AI, onboarding is quick and easy, requiring minimal development effort.
Gartner, Magic Quadrant for Search and Product Discovery – Mike Lowndes, Noam Dorros, Sandy Shen, Aditya Vasudevan, June 24, 2025
Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Google.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved.
Amazon Bedrock AgentCore Gateway now supports AWS PrivateLink invocation and invocation logging through Amazon CloudWatch, Amazon S3 and Amazon Data Firehose. Amazon Bedrock AgentCore Gateway provides an easy and secure way for developers to build, deploy, discover, and connect to agent tools at scale. With the PrivateLink support and invocation logging, you can apply network and governance requirements to agents and tools through AgentCore Gateway.
The AWS PrivateLink support allows users and agents from a virtual private cloud (VPC) network to access AgentCore Gateway without going through the public internet. With invocation logging, you gain visibility into each invocation log and can deep dive into issues or audit activities.
Amazon Bedrock AgentCore is currently in preview and it is available in US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), and Europe (Frankfurt). Learn more about the features from the AWS documentation. Learn more about Amazon Bedrock AgentCore and it’s services in the News Blog.
Find and fix security vulnerabilities. Deploy your app to the cloud. All without leaving your command-line.
Today, we’re closing the gap between your terminal and the cloud with a first look at the future of Gemini CLI, delivered through two new extensions: security extension and Cloud Run extension. These extensions are designed to handle critical parts of your workflows with simple, intuitive commands:
1) /security:analyzeperforms a comprehensive scan right in your local repository, with support for GitHub pull requests coming soon. This makes security a natural part of your development cycle.
2) /deploy deploys your application to Cloud Run, our fully managed serverless platform, in just a few minutes.
These commands are the first expression of a new extensibility framework for Gemini CLI. While we’ll be sharing more about the full Gemini CLI extension world soon, we couldn’t wait to get these capabilities into your hands. Consider this a sneak peak of what’s coming next!
Security extension: automate security analysis with /security:analyze
To help teams address software vulnerabilities early in the development lifecycle, we are launching the Gemini CLI Security extension. This new open-source tool automates security analysis, enabling you to proactively catch and fix issues using the /security:analyze command at the terminal or through a soon-coming GitHub Actions integration.
Integrated directly into your local development workflow and CI/CD pipeline, this extension:
Analyzes code changes: When triggered, the extension automatically takes the git diff of your local changes or pull request.
Identifies vulnerabilities: Using a specialized prompt and tools, Gemini CLI analyzes the changes for a wide range of potential vulnerabilities, such as hardcoded-secrets, injection vulnerabilities, broken access control, and insecure data handling.
Provides actionable feedback: Gemini returns a detailed, easy-to-understand report directly in your terminal or as a comment on your pull request. This report doesn’t just flag issues; it explains the potential risks and provides concrete suggestions for remediation, helping you fix issues quickly and learn as you go.
And after the report is generated, you can also ask Gemini CLI to save it to disk or even implement fixes for each issue.
Getting started with /security:analyze
Integrating security analysis into your workflow is simple. First, download the Gemini CLI and install the extension (requires Gemini CLI v0.4.0+):
Locally: After making local changes, simply run /security:analyze in the Gemini CLI.
In CI/CD (Coming Soon): We’re bringing security analysis directly into your CI/CD workflow. Soon,you’ll be able to configure the GitHub Action to automatically review pull requests as they are opened.
This is just the beginning. The team is actively working on further enhancing the extension’s capabilities, and we are also inviting the community to contribute to this open source project by reporting bugs, suggesting features, continuously improving security practices and submitting code improvements.
Cloud Run extension: automate deployment with /deploy
The/deploy command in Gemini CLI automates the entire deployment pipeline for your web applications. You can now deploy a project directly from your local workspace. Once you issue the command, Gemini returns a public URL for your live application.
The /deploy command automates a full CI/CD pipeline to deploy web applications and cloud services from the command line using the Cloud Run MCP server. What used to be a multi-step process of building, containerizing, pushing, and configuring is now a single, intuitive command from within the Gemini CLI.
You can access this feature across three different surfaces – in Gemini CLI in the terminal, in VS Code via Gemini Code Assist agent mode, and in Gemini CLI in Cloud Shell.
Use /deploy command in Gemini CLI at the terminal to deploy application to Cloud Run
Get started with /deploy:
For existing Google Cloud users, getting started with /deploy is straightforward in Gemini CLI at the terminal:
Prerequisites: You’ll need the gcloud CLI installed and configured on your machine and have an existing app or use Gemini CLI to create one.
Step 1: Install the Cloud Run extension The /deploy command is enabled through a Model Context Protocol (MCP) server, which is included in the Cloud Run extension. To install the Cloud Run extension (Requires Gemini CLI v0.4.0+), run this command:
Step 3: Deploy your app Navigate to your application’s root directory in your terminal and type gemini to launch Gemini CLI. Once inside, type /deploy to deploy your app to Cloud Run.
That’s it! In a few moments, Gemini CLI will return a public URL where you can access your newly deployed application. You can also visit the Google Cloud Console to see your new service running in Cloud Run.
Besides Gemini CLI at the terminal, this feature can also be accessed in VS Code via Gemini Code Assist agent mode, powered by Gemini CLI, and in Gemini CLI in Cloud Shell, where the authentication step will be automatically handled out of the box.
Use /deploy command to deploy application to Cloud Run in VS Code via Gemini Code Assist agent mode.
Building a robust extension ecosystem
The Security and Cloud Run extensions are two of the first extensions from Google built on our new framework, which is designed to create a rich and open ecosystem for the Gemini CLI. We are building a platform that will allow any developer to extend and customize the CLI’s capabilities, and this is just an early preview of the full platform’s potential. We will be sharing a more comprehensive look at our extensions platform soon, including how you can start building and sharing your own.
AWS HealthImaging now supports OAuth 2.0-compatible identity providers for authentication of DICOMweb requests using OpenID Connect (OIDC). With OIDC authentication, you can manage secure access to DICOM resources using your organization’s standard procedures for creating, enabling, and disabling user accounts.
With this launch, you can now use existing identity providers (IdPs)—such as Amazon Cognito, Okta, or Auth0—to issue JSON Web Tokens (JWTs) that authorize secure access to your DICOMweb endpoints. This launch makes it simpler to integrate AWS HealthImaging into existing medical imaging applications and expands HealthImaging’s support of DICOMweb standard interfaces that rely on OAuth 2.0-compatible authentication. Support for OIDC is limited to DICOMweb REST API requests. HealthImaging includes native support for AWS Identity and Access Management (IAM) users and roles for authentication of all API requests.
Support for OpenID Connect (OIDC) is available in all AWS Regions where AWS HealthImaging is generally available: US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), and Europe (Ireland).
With flow monitors in Amazon CloudWatch Network Monitoring, you can now monitor network performance of traffic flowing between AWS Regions across the AWS global network. Flow monitors provide near real-time visibility of network performance for workloads between compute instances such as Amazon EC2 and Amazon EKS, and AWS services such as Amazon S3 and Amazon DynamoDB. Flow monitors provide metrics to help you rapidly detect and attribute network-driven impairments for your workloads.
With this release, flow monitors now help you to assess whether network performance issues on the AWS global network between a local and a remote Region are impacting your workloads. Because the flow monitor’s network health indicator (NHI) now also captures the health of the AWS global network on your workload’s network paths between Regions, you can quickly identify whether impairments in a local Region, in the AWS global network, or in the remote Region are affecting your workloads. This feature extends network visibility for flows to a remote Region’s public IP address, and for private traffic flowing to a remote Region over Amazon VPC peering or AWS Transit Gateway peering.
For the full list of the AWS Regions where Network Monitoring for AWS workloads is available, visit the Regions list. To learn more, visit the Amazon CloudWatch Network Monitoring documentation.
AWS is announcing the general availability of Amazon EC2 Storage Optimized I8g instances in US East (Ohio) region. I8g instances offer the best performance in Amazon EC2 for storage-intensive workloads. I8g instances are powered by AWS Graviton4 processors that deliver up to 60% better compute performance compared to previous generation I4g instances. I8g instances use the latest third generation AWS Nitro SSDs, local NVMe storage that deliver up to 65% better real-time storage performance per TB while offering up to 50% lower storage I/O latency and up to 60% lower storage I/O latency variability. These instances are built on the AWS Nitro System, which offloads CPU virtualization, storage, and networking functions to dedicated hardware and software enhancing the performance and security for your workloads.
Amazon EC2 I8g instances are designed for I/O intensive workloads that require rapid data access and real-time latency from storage. These instances excel at handling transactional, real-time, distributed databases, including MySQL, PostgreSQL, Hbase and NoSQL solutions like Aerospike, MongoDB, ClickHouse, and Apache Druid. They’re also optimized for real-time analytics platforms such as Apache Spark, data lakehouse and AI LLM pre-processing for training. I8g instances are available in 10 different sizes with up to 48xlarge including one metal size, 1.5 TiB of memory, and 45 TB local instance storage. They deliver up to 100 Gbps of network performance bandwidth, and 60 Gbps of dedicated bandwidth for Amazon Elastic Block Store (EBS).
AWS Backup now lets you choose whether to include Access Control Lists (ACLs) and ObjectTags when backing up your Amazon S3 buckets.
Previously, AWS Backup included these metadata components for all objects by default. This new capability lets you customize your backup approach based on your recovery needs, so you can include only the metadata you need.
This capability is available in all AWS Regions where AWS Backup for Amazon S3 is available. For pricing and regional availability information, see the AWS Backup pricing page.
AWS Elastic Beanstalk now supports dual-stack configuration for both Application Load Balancers (ALB) and Network Load Balancers (NLB), allowing environments to serve both IPv4 and IPv6 protocols. You can now set the IpAddressType option to “dualstack,” and Elastic Beanstalk will automatically configure your load balancer with dual-stack support, creating both A and AAAA DNS records. You can seamlessly update existing IPv4 environments to dual-stack or revert back as needed.
This capability helps you reach users on IPv6-only networks while maintaining full IPv4 compatibility, supporting global accessibility requirements and IPv6 adoption mandates. The feature automatically handles DNS record management, simplifying IPv6 deployment for your applications and ensuring optimal performance for all users.
This feature is available in all AWS regions that support Elastic Beanstalk and Application and Network Load Balancers.
Amazon Managed Service for Prometheus is now available in the AWS GovCloud (US) Regions. Amazon Managed Service for Prometheus is a fully managed Prometheus-compatible monitoring service that makes it easy to monitor and alarm on operational metrics at scale.
The list of all supported regions where Amazon Managed Service for Prometheus is generally available can be found in the user guide. Customers can send up to 1 billion active metrics to a single workspace and can create multiple workspaces per account, where a workspace is a logical space dedicated to the storage and querying of Prometheus metrics.
To learn more about Amazon Managed Service for Prometheus, visit the product page.
AWS Fault Injection Service (FIS) is a fully managed service for running controlled fault injection experiments to improve application performance, observability, and resilience. Customers can test how their applications and people respond to real-world scenarios, including AZ Availability: Power Interruption and Cross-Region: Connectivity. Customers can create experiment templates in FIS to integrate experiments with continuous integration and release testing. Customers can also generate detailed reports of their FIS experiments and store them in Amazon S3, enabling them to audit and demonstrate compliance with both organizational and regulatory resilience testing requirements.
With this launch, FIS expands to 24 regions, including: US East (Ohio and N. Virginia), US West (N. California, Oregon), Europe (Spain, Stockholm, Paris, Frankfurt, Ireland, London and Milan), Asia Pacific (Hong Kong, Mumbai, Seoul, Singapore, Sydney and Tokyo), Middle East (Bahrain), Canada (Central), South America (São Paulo), Africa (Cape Town), AWS GovCloud (US-East, US-West), and now Europe (Zurich).
Starting today, Amazon Elastic Compute Cloud (Amazon EC2) C6in instances are available in AWS Region Asia Pacific (Thailand). These sixth-generation network optimized instances, powered by 3rd Generation Intel Xeon Scalable processors and built on the AWS Nitro System, deliver up to 200Gbps network bandwidth, for 2x more network bandwidth over comparable fifth-generation instances.
Customers can use C6in instances to scale the performance of applications such as network virtual appliances (firewalls, virtual routers, load balancers), Telco 5G User Plane Function (UPF), data analytics, high performance computing (HPC), and CPU based AI/ML workloads. C6in instances are available in 10 different sizes with up to 128 vCPUs, including bare metal size. Amazon EC2 sixth-generation x86-based network optimized EC2 instances deliver up to 100Gbps of Amazon Elastic Block Store (Amazon EBS) bandwidth, and up to 400K IOPS. C6in instances offer Elastic Fabric Adapter (EFA) networking support on 32xlarge and metal sizes.
C6in instances are available in these AWS Regions: US East (Ohio, N. Virginia), US West (N. California, Oregon), Europe (Frankfurt, Ireland, London, Milan, Paris, Spain, Stockholm, Zurich), Middle East (Bahrain, UAE), Israel (Tel Aviv), Asia Pacific (Hong Kong, Hyderabad, Jakarta, Malaysia, Melbourne, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo, Thailand), Africa (Cape Town), South America (Sao Paulo), Canada (Central), Canada West (Calgary), and AWS GovCloud (US-West, US-East). To learn more, see the Amazon EC2 C6in instances. To get started, see the AWS Management Console, AWS Command Line Interface (AWS CLI), and AWS SDKs.
At Google Cloud, our services are built with interoperability and openness in mind to enable customer choice and multicloud strategies. We pioneered amulticloud data warehouse, enabling workloads to run across clouds. We were the first company to provide digital sovereignty solutions for European governments and towaive exit feesfor customers who stop using Google Cloud.
We continue this open approach with the launch today of our new Data Transfer Essentials service for customers in the European Union and the United Kingdom. Built in response to the principles of cloud interoperability and choice outlined in the EU Data Act, Data Transfer Essentials is a new, simple solution for data transfers between Google Cloud and other cloud service providers. Although the Act allows cloud providers to pass through costs to customers, Data Transfer Essentials is available today at no cost to customers.
Designed for “in-parallel” processing of workloads belonging to the same organization that are distributed across two or more cloud providers, Data Transfer Essentials enables you to build flexible, multicloud strategies and use the best-of-breed solutions across different cloud providers. This can foster greater digital operational resilience – without incurring outbound data transfer costs from Google Cloud.
To get started, please read our configuration guide to learn how to opt in and specify your multicloud traffic. Qualifying multicloud traffic will be metered separately, and will appear on your bill at a zero charge, while all other traffic will continue to be billed at existing Network Service Tier rates.
The original promise of the cloud is one that is open, elastic, and free from artificial lock-ins. Google Cloud continues to embrace this openness and the ability for customers to choose the cloud service provider that works best for their workload needs. Read more about Data Transfer Essentials here.
Amazon Bedrock now supports synchronous inference for TwelveLabs’ Marengo 2.7, expanding the capabilities of this multimodal embedding model to deliver low-latency text and image embeddings directly within the API response. This update enables developers to build more responsive, interactive search and retrieval experiences while maintaining the same powerful video understanding capabilities that have made Marengo 2.7 a breakthrough in multimodal AI.
Since its introduction to Amazon Bedrock earlier this year, Marengo 2.7 has transformed how organizations work with video content through asynchronous inference—ideal for processing large video, audio, and image files. The model generates sophisticated multi-vector embeddings, enabling precise temporal and semantic retrieval across long-form content. Now with synchronous inference support, users can leverage these advanced embedding capabilities for text and image inputs with significantly reduced latency. This makes it perfect for applications such as instant video search where users find specific scenes using natural language queries, or interactive product discovery through image similarity search. For generating embeddings from video, audio, and large-scale image files, continue using asynchronous inference for optimal performance.
Marengo 2.7 with synchronous inference is now available in Amazon Bedrock in US East (N. Virginia), Europe (Ireland), and Asia Pacific (Seoul). To get started, visit the Amazon Bedrock console and request model access. To learn more, read the blog, product page, Amazon Bedrock pricing, and documentation.