Starting today, Amazon EC2 High Memory U7i instances with 6TB of memory (u7i-6tb.112xlarge) are now available in the Europe (London) region. U7i-6tb instances are part of AWS 7th generation and are powered by custom fourth generation Intel Xeon Scalable Processors (Sapphire Rapids). U7i-6tb instances offer 6TB of DDR5 memory, enabling customers to scale transaction processing throughput in a fast-growing data environment.
U7i-6tb instances offer 448 vCPUs, support up to 100Gbps Elastic Block Storage (EBS) for faster data loading and backups, deliver up to 100Gbps of network bandwidth, and support ENA Express. U7i instances are ideal for customers using mission-critical in-memory databases like SAP HANA, Oracle, and SQL Server.
Amazon CloudWatch Database Insights expands the availability of its on-demand analysis experience to the RDS for SQL Server database engine. CloudWatch Database Insights is a monitoring and diagnostics solution that helps database administrators and developers optimize database performance by providing comprehensive visibility into database metrics, query analysis, and resource utilization patterns. This feature leverages machine learning models to help identify performance bottlenecks during the selected time period, and gives advice on what to do next.
Previously, database administrators had to manually analyze performance data, correlate metrics, and investigate root cause. This process is time-consuming and requires deep database expertise. With this launch, you can now analyze database performance monitoring data for any time period with automated intelligence. The feature automatically compares your selected time period against normal baseline performance, identifies anomalies, and provides specific remediation advice. Through intuitive visualizations and clear explanations, you can quickly identify performance issues and receive step-by-step guidance for resolution. This automated analysis and recommendation system reduces mean-time-to-diagnosis from hours to minutes.
You can get started with this feature by enabling the Advanced mode of CloudWatch Database Insights on your RDS for SQL Server databases using the RDS service console, AWS APIs, the AWS SDK, or AWS CloudFormation. Please refer to RDS documentation and Aurora documentation for information regarding the availability of Database Insights across different regions, engines and instance classes.
Amazon Connect can now automatically initiate follow-up evaluations to analyze specific situations identified during initial evaluations. For example, when an initial customer service evaluation detects customer interest in a product, Amazon Connect can automatically trigger a follow-up evaluation focused on the agent’s sales performance. This enables managers to maintain consistent evaluation standards across agent cohorts and over time, while capturing deeper insights on specific scenarios such as sales opportunities, escalations, and other critical interaction moments.
This feature is available in all regions where Amazon Connect is offered. To learn more, please visit our documentation and our webpage.
Amazon Bedrock Data Automation (BDA) now supports AVI, MKV, and WEBM file formats along with the AV1 and MPEG-4 Visual (Part 2) codecs, enabling you to generate structured insights across a broader range of video content. Additionally, BDA delivers up to 50% faster image processing.
BDA automates the generation of insights from unstructured multimodal content such as documents, images, audio, and videos for your GenAI-powered applications. With support for AVI, MKV, and WEBM formats, you can now analyze content from archival footage, high-quality video archives with multiple audio tracks and subtitles, and web-based and open-source video content. This expanded video format and codec support enables you to process video content directly in the formats your organization uses, streamlining your workflows and accelerating time-to-insight. With faster image processing on BDA, you you can extract insights from visual content faster than ever before. You can now analyze larger volumes of images in less time, helping you scale your AI applications and deliver value to your customers more quickly.
Amazon Bedrock Data Automation is available in 8 AWS regions: Europe (Frankfurt), Europe (London), Europe (Ireland), Asia Pacific (Mumbai), Asia Pacific (Sydney), US West (Oregon) and US East (N. Virginia), and GovCloud (US-West) AWS Regions.
Amazon Nova models now support the customization of content moderation settings for approved business use cases that require processing or generating sensitive content.
Organizations with approved business use cases can adjust content moderation settings across four domains: safety, sensitive content, fairness, and security. These settings allow customers to adjust specific settings relevant to their business requirements. Amazon Nova enforces essential, non-configurable controls to ensure responsible use of AI, such as controls to prevent harm to children and preserve privacy.
Customization of content moderation settings is available for Amazon Nova Lite and Amazon Nova Pro in the US East (N. Virginia) region.
To learn more about Amazon Nova, visit the Amazon Nova product page and to learn about Amazon Nova responsible use of AI, visit the AWS AI Service Cards, or see the User Guide. To see if your business model is appropriate to customize content moderation settings, contact your AWS Account Manager.
Amazon Elastic Container Service (Amazon ECS) now supports AWS CloudTrail data events, providing detailed visibility into Amazon ECS Agent API activities. This new capability enables customers to monitor, audit, and troubleshoot container instance operations.
With CloudTrail data event support, security and operations teams can now maintain comprehensive audit trails of ECS Agent API activities, detect unusual access patterns, and troubleshoot agent communication issues more effectively. Customers can opt in to receive detailed logging through the new data event resource type AWS::ECS::ContainerInstance for ECS agent activities, including when the ECS agent polls for work (ecs:Poll), starts telemetry sessions (ecs:StartTelemetrySession), and submits ECS Managed Instances logs (ecs:PutSystemLogEvents). This enhanced visibility enables teams to better understand how container instance roles are utilized, meet compliance requirements for API activity monitoring, and quickly diagnose operational issues related to agent communications.
This new feature is available for Amazon ECS on EC2 in all AWS Regions and ECS Managed Instances in select regions. Standard CloudTrail data event charges apply. To learn more, visit the Developer Guide.
AI Agents are now a reality, moving beyond chatbots to understand intent, collaborate, and execute complex workflows. This leads to increased efficiency, lower costs, and improved customer and employee experiences. This is a key opportunity for System Integrator (SI) Partners to deliver Google Cloud’s advanced AI to more customers. This post details how to build, scale, and manage enterprise-grade agentic systems using Google Cloud AI products to enable SI Partners to offer these transformative solutions to enterprise clients.
Enterprise challenges
The limitations of traditional, rule-based automation are becoming increasingly apparent in the face of today’s complex business challenges. Its inherent rigidity often leads to protracted approval processes, outdated risk models, and a critical lack of agility, thereby impeding the ability to seize new opportunities and respond effectively to operational demands.
Modern enterprises are further compounded by fragmented IT landscapes, characterized by legacy systems and siloed data, which collectively hinder seamless integration and scalable growth. Furthermore, static systems are ill-equipped to adapt instantaneously to market volatility or unforeseen “black swan” events. They also fall short in delivering the personalization and operational optimization required to manage escalating complexity—such as in cybersecurity and resource allocation—at scale. In this dynamic environment, AI agents offer the necessary paradigm shift to overcome these persistent limitations.
How SI Partners are solving business challenges with AI agents
Let’s discuss how SIs are working with Google Cloud to solve some of the discussed business challenges;
Deloitte: A major retail client sought to enhance inventory accuracy and streamline reconciliation across its diverse store locations. The client needed various users—Merchants, Supply Chain, Marketing, and Inventory Controls—to interact with inventory data through natural language prompts. This interaction would enable them to check inventory levels, detect anomalies, research reconciliation data, and execute automated actions.
Deloitte leveraged Google Cloud AI Agents and Gemini Enterprise to create a solution that generates insights, identifies discrepancies, and offers actionable recommendations based on inventory data. This solution utilizes Agentic AI to integrate disparate data sources and deliver real-time recommendations, ultimately aiming to foster trust and confidence in the underlying inventory data.
Quantiphi: To improve customer experience and optimize sales operations, a furniture manufacturer partnered with Quantiphi to deploy Generative AI. to create a dynamic intelligent assistant on Google Cloud. The multi-agent system automates the process of quotation response creation thereby accelerating and speeding the process. At its core is an orchestrator, built with Agent Development Kit (ADK) and an Agent to Agent (A2A) framework that seamlessly coordinates between agents to summarize the right response – whether you’re researching market trends, asking about product details, or analyzing sales data. Leveraging the cutting-edge capabilities of Google Cloud’s Gemini models and BigQuery, the assistant delivers unparalleled insights, transforming how one can access data and make decisions.
These examples represent just a fraction of the numerous use cases spanning diverse industry verticals, including healthcare, manufacturing, and financial services, that are being deployed in the field by SIs working in close collaboration with Google Cloud.
Architecture and design patterns used by SIs
The strong partnership between Google Cloud and SIs is instrumental in delivering true business value to customers. Let’s examine the scalable architecture patterns employed by Google Cloud SIs in the field to tackle Agentic AI challenges.
To comprehend Agentic AI architectures, it’s crucial to first understand what an AI agent is. An AI agent is a software entity endowed with the capacity to plan, reason, and execute complex actions for users with minimal human intervention. AI agents leverage advanced AI models for reasoning and informed decision-making, while utilizing tools to fetch data from external sources for real-time and grounded information. Agents typically operate within a compute runtime. The visual diagram illustrates the basic components of an agent;
Base AI Agent Components
The snippet below also demonstrates how an Agent’s code appears in the Python programming language;
Code snippet of an AI Agent
This agent code snippet showcases the components depicted in the first diagram, where we observe the Agent with a Name, Large Language Model (LLM), Description, Instruction and Tools, all of which are utilized to enable the agent to perform its designated functions.
To build enterprise-grade agents at scale, several factors must be considered during their ground-up development. Google Cloud has collaborated closely with its Partner ecosystem to employ cutting-edge Google Cloud products to build scalable and enterprise-ready agents.
A key consideration in agent development is the framework. Without it, developers would be compelled to build everything from scratch, including state management, tool handling, and workflow orchestration. This often results in systems that are complex, difficult to debug, insecure, and ultimately unscalable. Google Cloud Agent Development Kit (ADK) provides essential scaffolding, tools, and patterns for efficient and secure enterprise agent development at scale. It offers developers the flexibility to customize agents to suit nearly every applicable use case.
Agent development with any framework, especially multi-agent architectures in enterprises, necessitates robust compute resources and scalable infrastructure. This includes strong security measures, comprehensive tracing, logging, and monitoring capabilities, as well as rigorous evaluation of the agent’s decisions and output.
Furthermore, agents typically lack inherent memory, meaning they cannot recall past interactions or maintain context for effective operation. While frameworks like ADK offer ephemeral memory storage for agents, enterprise-grade agents demand persistent memory. This persistent memory is vital for equipping agents with the necessary context to enhance their performance and the quality of their output.
Google Cloud’s Vertex AI Agent Engine provides a secure runtime for agents that manages their lifecycle, orchestrates tools, and drives reasoning. It features built-in security, observability, and critical building blocks such as a memory bank, session service, and sandbox. Agent Engine is accessible to SIs and customers on Google Cloud. Alternative options for running agents at scale includeCloud Run orGKE.
Customers often opt for these alternatives when they already have existing investments in Cloud Run or GKE infrastructure on Google Cloud, or when they require configuration flexibility concerning compute, storage, and networking, as well as flexible cost management. However, when choosing Cloud Run or GKE, functions like memory and session management must be built and managed from the ground up.
Model Context Protocol (MCP) is a crucial element for modern AI agent architectures. This open protocol standardizes how applications provide context to LLMs, thereby improving agent responses by connecting agents and underlying AI models to various data sources and tools. It’s important to note that Agents also communicate with enterprise systems using APIs, which are referred to as Tools when employed with agents. MCP enables agents to access fresh external data.
When developing enterprise agents at scale, it is recommended to deploy the MCP servers separately on a serverless platform like Cloud Run or GKE on Google Cloud, with agents running on Agent Engine configured as clients. The sample architecture illustrates the recommended deployment model for MCP integration with ADK agents;
AI agent tool integration with MCP
The reference architecture demonstrates howADK built agents can integrate with MCP to connect data sources and provide context to underlying LLM models. The MCP utilizes Get, Invoke, List, and Call functions to enable tools to connect agents to external data sources. In this scenario, the agent can interact with a Graph database through application APIs using MCP, allowing the agent and the underlying LLM to access up-to-date data for generating meaningful responses.
Furthermore, when building multi-agent architectures that demand interoperability and communication among agents from different systems, a key consideration is how to facilitate Agent-to-Agent communication. This addresses complex use cases that require workflow execution across various agents from different domains.
Google Cloud launched theAgent-to-Agent Protocol (A2A) with native support within Agent Engine to tackle the challenge of inter-agent communication at scale. Learn how to implement A2A from this blog.
Google Cloud has collaborated with SIs on agentic architecture and design considerations to build multiple agents, assisting clients in addressing various use cases across industry domains such as Retail, Manufacturing, Healthcare, Automotive, and Financial Services. The reference architecture below consolidates these considerations.
Reference architecture – Agentic AI system with ADK, MCP, A2A and Agent Engine
This reference architecture depicts an enterprise-grade Agent built on Google Cloud to address a supply chain use case. In this architecture, all agents are built with the ADK framework and deployed on Agent Engine. Agent Engine provides a secure compute runtime with authentication, context management using managedsessions,memory, and quality assurance throughExample Store andEvaluation Services, while also offering observability into the deployed agents. Agent Engine delivers all these features and many more as a managed service at scale on GCP.
This architecture outlines an Agentic supply chain featuring an orchestration agent (Root) and three dedicated sub-agents: Tracking, Distributor, and Order Agents. Each of these agents are powered by Gemini. For optimal performance and tailored responses, especially in specific use cases, we recommend tuning your model with domain-specific data before integration with an agent. Model tuning can also help optimize responses for conciseness, potentially leading to reduced token size and lower operational costs.
For instance, a user might send a request such as “show me the inventory levels for men’s backpack.” The Root agent receives this request and is capable of routing it to the Order agent, which is responsible for inventory and order operations. This routing is seamless because the A2A protocol utilizesagent cards to advertise the capabilities of each respective agent. A2A isconfigured with a few steps as a wrapper for your agents for Agent Engine deployment.
In this example, inventory and order details are stored inBigQuery. Therefore, the agent uses its tool configuration to leverage the MCP server to fetch the inventory details from the BigQuery data warehouse. The response is then returned to the underlying LLM, which generates a formatted natural language response and provides the inventory details for men’s backpacks to the Root agent and subsequently to the user. Based on this response, the user can, for example, place an order to replenish the inventory.
When such a request is made, the Root agent routes it to the Distributor agent. This agent possesses knowledge of all suppliers who provide stock to the business. Depending on the item being requested, the agent will use its tools to initiate an MCP server connection to the correct external API endpoints for the respective supplier to place the order. If the suppliers have agents configured, the A2A protocol can also be utilized to send the request to the supplier’s agent for processing. Any acknowledgment of the order is then sent back to the Distributor agent.
In this reference architecture, when the Distributor agent receives acknowledgment, A2A enables the agent to detect the presence of a Tracking agent that monitors new orders until delivery. The Distributor agent will pass the order details to the Tracking agent and also send updates back to the user. The Tracking agent will then send order updates to the user via messaging, utilizing the public API endpoint of the supplier. This is merely one example of a workflow that could be built with this reference architecture.
This modular architecture can be adapted to solve various use cases with Agentic AI built with ADK and deployed to Agent Engine.
The reference architecture allows this multi-agent system to be consumed via a chat interface through a website or a custom-built user interface. It is also possible to integrate this agentic AI architecture with Google Cloud Gemini Enterprise.
Learn how enterprises can start by using Gemini Enterprise as the front door to Google Cloud AI from this blog from Alphabet CEO Sundar Pichai. This approach helps enterprises to start small using low code out of the box agents. As they mature, they can now implement complex use cases with advanced high code AI agents using this reference archiecture .
Getting started
This blog post has explored the design patterns for building intelligent enterprise AI agents. For enterprise decision makers, use the 5 essential elements to start implementing agentic solutions to help guide your visionary strategy and decision making when it comes to running enterprise agents at scale.
We encourage you to embark on this journey today by collaborating with Google Cloud Partner Ecosystem to understand your enterprise landscape and identify complex use cases that can be effectively addressed with AI Agents. Utilize these design patterns as your guide and leverage the ADK to transform your enterprise use case into a powerful, scalable solution that delivers tangible business value on Agent Engine with Google Cloud.
Google Cloud Dataproc is a managed service for Apache Spark and Hadoop, providing a fast, easy-to-use, and cost-effective platform for big data analytics. In June, we announced the general availability (GA) of the Dataproc 2.3 image on Google Compute Engine, whose lightweight design offers enhanced security and operational efficiency.
“With Dataproc 2.3, we have a cutting edge, high performance and trusted platform that empowers our machine learning scientists and analysts to innovate at scale.” – Sela Samin, Machine Learning Manager, Booking.com
The Dataproc 2.3 image represents a deliberate shift towards a more streamlined and secure environment for your big data workloads. Today, let’s take a look at what makes this lightweight approach so impactful:
1. Reduced attack surface and enhanced security
Dataproc on Google Compute Engine 2.3 is a FedRamp High compliant image designed for superior security and efficiency.
At its core, we designed Dataproc 2.3 to be lightweight, meaning it contains only the essential core components required for Spark and Hadoop operations. This minimalist approach drastically reduces the exposure to Common Vulnerabilities and Exposures (CVEs). For organizations with strict security and compliance requirements, this is a game-changer, providing a robust and hardened environment for sensitive data.
We maintain a robust security posture through a dual-pronged approach to CVE (Common Vulnerabilities and Exposures) remediation, so that our images consistently meet compliance standards. This involves a combination of automated processes and targeted manual intervention:
Automated remediation: We use a continuous scanning system to automatically build and patch our images with fixes for known vulnerabilities, enabling us to handle issues efficiently at scale.
Manual intervention: For complex issues where automation could cause breaking changes or has intricate dependencies, our engineers perform deep analysis and apply targeted fixes to guarantee stability and security.
2. On-demand flexibility for optional components
While the 2.3 image is lightweight, it doesn’t sacrifice functionality. Instead of pre-packaging every possible component, Dataproc 2.3 adopts an on-demand model for optional components. If your workload requires specific tools like Apache Flink, Hive WebHCat, Hudi, Pig, Docker, Ranger, Solr, Zeppelin, you can simply deploy them when creating your cluster. This helps keep your clusters lean by default, but still offers the full breadth of Dataproc’s capabilities when you need it.
3. Faster cluster creation (with custom images)
When you deploy optional components on-demand, they are downloaded and installed while the cluster is being created, which may increase the startup time a bit. However, Dataproc 2.3 offers a powerful solution to this: custom images. You can now create custom Dataproc images with your required optional components pre-installed. This allows you to combine the security benefits of the lightweight base image with the speed and convenience of pre-configured environments, drastically reducing cluster provisioning and setup time for your specific use cases.
Getting started with Dataproc 2.3
Using the new lightweight Dataproc 2.3 image is straightforward. When creating your Dataproc clusters, simply specify 2.3 (or a specific sub-minor version like 2.3.10-debian12, 2.3.10-ubuntu22, or 2.3.10-rocky9).
The Dataproc 2.3 image sets a new standard for big data processing on Google Cloud by prioritizing a lightweight, secure and efficient foundation. By minimizing the included components by default and offering flexible on-demand installation or custom image creation, Dataproc 2.3 can help you achieve higher security compliance and optimized cluster performance.
Start leveraging the enhanced security and operational efficiency of Dataproc 2.3 today and experience a new level of confidence in your big data initiatives!
Unlocking real value with AI in the enterprise calls for more than just intelligence. It requires a seamless, end-to-end platform where your model and operational controls are fully integrated. This is the core of our strategy at Google Cloud: combining the most powerful models with the scale and security required for production.
Today, we are excited to announce that Google has been recognized as a Leader for our Gemini model family in the 2025 IDC MarketScape for Worldwide GenAI Life-Cycle Foundation Model Software (doc # US53007225, October 2025) report.
We believe the result validates our multi-year commitment to building the most capable, multimodal AI and delivering it to the enterprise through the Vertex AI platform. It is this combined approach that leads organizations, from innovative startups to the most demanding enterprises, to choose Google Cloud for their critical generative AI deployments.
Source: “IDC MarketScape: Worldwide GenAI Life-Cycle Foundation Model Software 2025 Vendor Assessment,” Doc. #US53007225
Gemini 2.5: adaptive thinking and cost control
For companies moving AI workloads into production, the focus quickly shifts from raw intelligence to optimization, speed, and cost control. That’s why in August, we announced General Availability (GA) of the Gemini 2.5 model family, dramatically increasing both intelligence and enterprise readiness. Our pace of innovation hasn’t slowed; we quickly followed up in September with an improved Gemini 2.5 Flash and Flash-Lite release.
Gemini 2.5 models are thinking models, meaning they can perform complex, internal reasoning to solve multi-step problems with better accuracy. This advanced capability addresses the need for depth of reasoning while still offering tools to manage compute costs:
Thinking budgets:We introduced thinking budgets for models like Gemini 2.5 Flash and Gemini 2.5 Flash-Lite. Developers can now set a maximum computational effort, allowing for fine-grained control over cost and latency. You get the full power of a thinking model when the task demands it, and maximum speed for high-volume, low-latency tasks.
Thought summaries: Developers also gain transparency with thought summaries in the API and Vertex AI, providing a clear, structured view of the model’s reasoning process. This is essential for auditability.
Model choice and flexibility
By providing an open ecosystem of multimodal models, enterprises can choose to deploy the best model for any task, and the right modality for any use case.
Vertex AI Model Garden ensures you always have access to the latest intelligence. This includes our first-party models, leading open source options, and powerful third-party models like Anthropic’s Claude Sonnet 4.5, which we made available upon its release. This empowers you to pick the right tool for every use case.
Native multimodality: Gemini’s core strength is its native multimodal capability, or the ability to understand and combine information across text, code, images, and audio.
Creative control with Nano Banana: Nano Banana (Gemini 2.5 Flash Image) provides creators and developers sharp control for visual tasks, enabling conversational editing and maintaining character and product consistency across multiple generations.
Building AI agents: Code, speed, and the CLI
To accelerate the transition to AI agents that can execute complex tasks, we prioritized investment in coding performance and tooling for developers:
Coding performance leap: Gemini 2.5 Pro now excels at complex code generation and problem-solving, offering developers a dramatically improved resource for high-quality software development.
Agentic developer tools: The launch of the Gemini Command Line Interface (CLI) brings powerful, agentic problem-solving directly to the terminal. This provides developers with the kind of immediate, interactive coding assistance necessary to close gaps and accelerate development velocity.
Unlocking value with Vertex AI
In addition to powerful models, organizations need a managed, governed platform to move AI projects from pilot to production and achieve real business value. That’s why Vertex AI is the critical component for enterprise AI workloads.
Vertex AI provides the secure, end-to-end environment that transforms Gemini’s intelligence into a scalable business solution. It is the single place for developers to manage the full AI lifecycle, allowing companies to stop managing infrastructure and start building innovative agentic AI applications.
We focus on three core pillars:
Customization for differentiation: Tailor model behavior using techniques like Supervised Fine-Tuning (SFT) to embed your unique domain expertise directly into the model’s knowledge.
Grounding for accuracy: Easily connect Gemini to your enterprise data – whether structured data in BigQuery, internal documents via Vertex AI Search, or web data from Google Search or Google Maps – to ensure model responses are accurate, relevant, and trusted.
Security, governance, and compliance: Maintain control over data and models with enterprise-grade security, governance, and data privacy controls built directly into the platform, ensuring stability and protection for your mission-critical applications.
Get started today
Download the 2025 IDC MarketScape for Worldwide GenAI Life-Cycle Foundation Model Software excerpt to learn why organizations are choosing Google Cloud.
IDC MarketScape vendor analysis model is designed to provide an overview of the competitive fitness of technology and suppliers in a given market. The research methodology utilizes a rigorous scoring methodology based on both qualitative and quantitative criteria that results in a single graphical illustration of each supplier’s position within a given market. The Capabilities score measures supplier product, go-to-market and business execution in the short-term. The Strategy score measures alignment of supplier strategies with customer requirements in a 3-5-year timeframe. Supplier market share is represented by the size of the icons.
The Google Cloud AI Hypercomputer combines AI-optimized hardware, leading software, and flexible consumption models to help you tackle any AI workload efficiently. Every three months we share a roundup of the latest AI Hypercomputer news, resources, events, learning opportunities, and more. Today, we’re excited to share the latest developments to make your AI journey faster, more efficient, and more insightful, starting with awesome news about inference.
Announcing the new vLLM TPU
For ML practitioners working with large language models (LLMs), serving inference workloads with amazing price-performance is the ultimate goal. That’s why we are thrilled to announce our biggest update this quarter: bringing the performance of JAX and our industry leading Cloud TPUs to vLLM, the most popular open-source LLM inference engine.
vLLM TPU is now powered bytpu-inference, an expressive and powerful new hardware plugin unifying JAX and Pytorch under a single runtime. It is not only faster than the previous generation of vLLM TPU, but also offers broader model coverage (e.g., Gemma, Llama, Qwen) and feature support. vLLM TPU is a framework for developers to:
Push the limits of TPU hardware performance in open source
Provide more flexibility to JAX and Pytorch users by running Pytorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX
Retain vLLM standardization: keep the same user experience, telemetry, and interface
Today, vLLM TPU is significantly more performant than the first TPU backend prototype that we released back in Feb 2025, with improved model support and feature coverage. With this new foundation in place, our customers will now be able to push the boundaries of TPU inference performance further than ever before in open source, in as little as a few configuration changes.
You can read more about the technical details in vLLM’s most recent blog post here.
More tools for your AI toolkit
Here are some additional AI Hypercomputer updates to give you more control, insight, and choice.
Find and fix bottlenecks faster with the improved XProf Profiler Debugging performance is one of the most time-consuming parts of ML development. To make it easier, we’ve supercharged the XProf profiler and released the new Cloud Diagnostics XProf library. This gives you a unified, advanced profiling experience across JAX and PyTorch/XLA, helping you pinpoint model bottlenecks with powerful tools previously used only by internal Google teams. Spend less time hunting for performance issues and more time innovating.
Openness in action: A new recipe for NVIDIA Dynamo We built AI Hypercomputer on the principle of choice, and want you to use the best tools for the job at hand. To that end, the new AI inference recipe for using NVIDIA Dynamo on AI Hypercomputer demonstrates how to deploy a disaggregated inference architecture, separating the “prefill” and “decode” phases across distinct GPU pools managed by GKE. It’s a powerful demonstration of how our open architecture allows you to combine best-in-class technologies from across the ecosystem to solve complex challenges.
Accelerate Reinforcement Learning with NVIDIA NeMo RL Reinforcement Learning (RL) is rapidly becoming the essential training technique for complex AI agents and workflows requiring advanced reasoning. For teams pushing the boundaries with reinforcement learning, new reproducible recipes are available for getting started with NVIDIA NeMo RL on Google Cloud. NeMo RL is a high-performance framework designed to tackle the complex scaling and latency challenges inherent in RL workloads. It provides optimized implementations of key algorithms like GRPO and PPO, making it easier to train large models. The new recipes run on A4 VMs (powered by NVIDIA HGX B200) with GKE and vLLM, offering a simplified path to setting up and scaling your RL development cycle for models like Llama 3.1 8B and Qwen2.5 1.5B.
Scale high-performance inference cost-effectively A user’s experience of a generative AI application highly depends on both a fast initial response to a request and a smooth streaming of the response through to completion. To streamline and standardize LLM serving, GKE Inference Gateway and Quickstart are now generally available. Inference Gateway simplifies serving with new features like prefix-aware load balancing, which dramatically improves latency for workloads with recurring prompts. The Inference Quickstart helps you find the optimal, most cost-effective hardware and software configuration for your specific model, saving you months of manual evaluation. With these new features, we’ve improved time-to-first-token (TTFT) and time-per-output-token (TPOT) on AI Hypercomputer.
Build your future on a comprehensive system
The progress we’ve shared today — from bringing vLLM to TPUs to enabling advanced profiling and third-party integrations — all stems from the premise that AI Hypercomputer is a supercomputing system, constantly evolving to meet the demands of the next generation of AI.
We’ll continue to update and optimize AI Hypercomputer based on our learnings from training Gemini, to serving quadrillions of tokens a month. To learn more about using AI Hypercomputer for your own AI workloads, read here. Curious about the last quarterly round up? Please see the previous post here. To stay up to date on our progress or ask us questions, join our community and access our growing AI Hypercomputer resources repository on GitHub. We can’t wait to see what you build with it.
Many of today’s multimodal workloads require a powerful mix of GPU-based accelerators, large GPU memory, and professional graphics to achieve the performance and throughput that they need. Today, we announced the general availability of the G4 VM, powered by NVIDIA’s RTX PRO 6000 Blackwell Server Edition GPUs. The addition of the G4 expands our comprehensive NVIDIA GPU portfolio, complementing the specialized scale of the A-series VMs, and the cost-efficiency of G2 VMs. The G4 VM is available now, bringing GPU availability to more Google Cloud regions than ever before, for applications that are latency sensitive or have specific regulatory requirements.
We also announced the general availability of NVIDIA Omniverse as a virtual machine image (VMI) on Google Cloud Marketplace. When run on G4, it’s easier than ever to develop and deploy industrial digital twin and physical AI simulation applications leveraging NVIDIA Omniverse libraries. G4 VMs provide the necessary infrastructure — up to 768 GB of GDDR7 memory, NVIDIA Tensor Cores, and fourth-generation Ray Tracing (RT) cores — to run the demanding real-time rendering and physically accurate simulations required for enterprise digital twins. Together, they provide a scalable cloud environment to build, deploy, and interact with applications for industrial digital twins or robotics simulation.
A universal GPU platform
The G4 VM offers a profound leap in performance, with up to 9x the throughput of G2 instances, enabling a step-change in results for a wide spectrum of workloads, from multi-modal AI inference, photorealistic design and visualization, and robotics simulation using applications developed on NVIDIA Omniverse. The G4 currently comes in 1, 2, 4, and 8 NVIDIA RTX PRO 6000 Blackwell GPU options, with fractional GPU options coming soon.
Here are some of the ways you can use G4 to innovate and accelerate your business:
AI training, fine-tuning, and inference
Generative AI acceleration and efficiency: With its FP4 precision support, G4’s high-efficiency compute accelerates LLM fine-tuning and inference, letting you create real-time generative AI applications such as multimodal and text-to-image creation models.
Resource optimization with Multi-Instance GPU (MIG) support: G4 allows a single GPU to be securely partitioned into up to four fully isolated MIG instances, each with its own high-bandwidth memory, compute cores, and dedicated media engines. This feature maximizes price-performance by enabling multiple smaller distinct workloads to run concurrently with guaranteed resources, isolation, and quality of service..
Flexible model capacity and scaling: Serve a wide range of models, from less than 30B to over 100B parameters, by leveraging advanced quantization techniques, MIG partitioning, and multi-GPU configurations.
NVIDIA Omniverse and simulation
NVIDIA Omniverse integration: Choose this foundation to build and connect simulation applications using physically-based simulation and OpenUSD that enable real-time interactivity and the development of AI-accelerated digital twins.
Large-scale digital twin acceleration: Accelerate proprietary or commercial computer-aided engineering and simulation software to run scenarios with billions of cells in complex digital twin environments.
Near-real-time physics analysis: Leverage the G4’s parallel compute power and memory to handle immense computational domains, enabling near-real-time computational fluid dynamics and complex physics analysis for high-fidelity simulations.
Robotics development: With NVIDIA Isaac Sim, an open-source, reference robotic simulation framework, customers are now able to create, train, and simulate AI-driven robots in physical and virtual environments. Isaac Sim is now available on the Google Cloud Marketplace.
AI-driven rendering, graphics and virtual workstations
AI-augmented content creation: Harness neural shaders and fifth-generation NVIDIA Tensor Cores to integrate AI directly into a programmable rendering pipeline, driving the next decade of AI-augmented graphics innovations, including real-time cinematic rendering and enhanced content creation.
Massive scene handling: Leverage massive memory (up to 96 GB per GPU on the G4) to create and render large complex 3D models and photorealistic visualizations with stunning detail and accuracy.
Virtual workstations: Fuel digital twins, simulation, and VFX workloads. The G4’s leap in capability is powered by full support for all NVIDIA DLSS 4 features, the latest NVENC/NVDEC encoders for video streaming and transcode, and fourth-generation RT Cores for real-time ray tracing.
Google Cloud scales NVIDIA RTX PRO 6000
Modern generative AI models often exceed the VRAM of a single GPU, making you use multi-GPU configurations to serve these workloads. While this approach is common, performance can be bottlenecked by the communication speed between the AI architecture. We significantly boosted multi-GPU performance on G4 VMs by implementing an enhanced PCIe-based P2P data path that optimizes critical collective operations like All-Reduce, which is essential for splitting models across GPUs. Thanks to the G4’s enhanced peer-to-peer capabilities, you can expect up to 168% throughput gains and 41% lower latency (inter-token latency) when using tensor parallelism for model serving compared to standard non-P2P offerings.
For your generative AI applications, this technical differentiation translates into:
Faster user experience: Lower latency means quicker responses from your AI services, enabling more interactive and real-time applications.
Higher scalability: Increased throughput allows you to serve more concurrent users from a single virtual machine, significantly improving the price-performance and scalability of your service.
Google Cloud services integrated with G4 VMs
G4 VMs are fully integrated with several Google Cloud services, accelerating your AI workloads from day one.
Google Kubernetes Engine (GKE): G4 GPUs are generally available through GKE. Since GKE recently extended Autopilot to all qualifying clusters, including GKE Standard clusters, you can benefit from GKE’s container-optimized compute platform to rapidly scale your G4 GPUs, enabling you to optimize costs. By adding the GKE Inference Gateway, you can stretch the benefits of G4 even further to achieve lower AI serving latency and higher throughput.
Vertex AI: Both inference and training benefit significantly from G4’s large GPU memory (96 GB per GPU, 768 GB total), native FP4 precision support, and global presence.
Dataproc: G4 VMs are fully supported on the Dataproc managed analytics platform, letting you accelerate large-scale Spark and Hadoop workloads. This enables data scientists and data engineers to significantly boost performance for machine learning and large-scale data processing workloads.
Cloud Run: We’ve extended our serverless platform’s AI infrastructure options to include the NVIDIA RTX PRO 6000, so you can perform real-time AI inference with your preferred LLMs or media rendering using fully managed, simple, pay-per-use GPUs.
Hyperdisk ML, Managed Lustre, and Cloud Storage: When you need to expand beyond local storage for your HPC and large scale AI/ML workloads, you can connect G4 to a variety of Google Cloud storage services. For low latency and up to 500K of IO per instance, Hyperdisk ML is a great option. For high-performance file storage in the same zone, Managed Lustre offers a parallel file system ideal for persistent storage, up to 1TB/s. Finally, if you need nearly unlimited global capacity, with powerful capabilities like Anywhere Cache for use cases like inference, choose Cloud Storage as your primary, highly available, and globally scalable storage platform for training datasets, model artifacts, and feature stores.
What customers are saying
Here’s how customers are using G4 to innovate and accelerate within their businesses:
“The combination of NVIDIA Omniverse on Google Cloud G4 VMs is the true engine for our creative transformation. It empowers our teams to compress weeks of traditional production into hours, allowing us to instantly generate photorealistic 3D advertising environments at a global scale while ensuring pixel-perfect brand compliance—a capability that redefines speed and personalization in digital marketing.” –Perry Nightingale, SVP Creative AI, WPP
“We’re excited to bring the power of Google Cloud G4 VMs into Altair One, so you can run your most demanding simulation and fluid dynamics workloads with the speed, scale, and visual fidelity needed to push innovation further.” – Yeshwant Mummaneni, Chief Engineer – Analytics, HPC, IoT & Digital Twin, Altair
The Google Cloud advantage
Choosing Google Cloud means selecting a platform engineered for tangible results. The new G4 VM is a prime example, with our custom P2P interconnect unlocking up to 168% more throughput from the underlying NVIDIA RTX PRO 6000 Blackwell GPUs. This focus on optimized performance extends across our comprehensive portfolio; the G4 perfectly complements our existing A-Series and G2 GPUs, ensuring you have the ideal infrastructure for any workload. Beyond raw performance, we deliver turnkey solutions to accelerate your time to value. With NVIDIA Omniverse now available on the Google Cloud Marketplace, you can immediately deploy enterprise-grade digital twin and simulation applications on a fully managed and scalable platform.
G4 capacity is immediately available. To get started, simply select G4 VMs from the Google Cloud console. NVIDIA Omniverse and Isaac Sim are qualified Google Cloud Marketplace solutions that can draw down on your Google Cloud commitments; for more information, please contact your Google Cloud sales team or reseller.
Today, we announced the general availability of the G4 VM family based on NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Thanks to unique platform optimizations only available in Google Cloud, G4 VMs deliver the best performance of any commercially available NVIDIA RTX PRO 6000 Blackwell GPU offering for inference and fine-tuning on a wide range of models, from less than 30B to over 100B parameters. In this blog, we discuss the need for these platform optimizations, how they work, and how to use them in your own environment.
Collective communications performance matters
Large language models (LLMs) vary significantly in size, as characterized by their number of parameters: small (~7B), medium (~70B), and large (~350B+). LLMs often exceed the memory capacity of a single GPU, including the NVIDIA RTX PRO 6000 Blackwell’s, with its 96GB of GDDR7 memory. A common solution is to use tensor parallelism, or TP, which works by distributing individual model layers across multiple GPUs. This involves partitioning a layer’s weight matrices, allowing each GPU to perform a partial computation in parallel. However, a significant performance bottleneck arises from the subsequent need to combine these partial results using collective communication operations like All-Gather or All-Reduce.
The G4 family of GPU virtual machines utilizes a PCIe-only interconnect. We drew on our extensive infrastructure expertise to develop this high-performance, software-defined PCIe fabric that supports peer-to-peer (P2P) communication. Crucially, G4’s platform-level P2P optimization substantially accelerates collective communications for workloads that require multi-GPU scaling, resulting in a notable boost for both inference and fine-tuning of LLMs.
How G4 accelerates multi-GPU performance
Multi-GPU G4 VM shapes get their significantly enhanced PCIe P2P capabilities from a combination of both custom hardware and software. This advancement directly optimizes collective communications, including All-to-All, All-Reduce, and All-Gather collectives for managing GPU data exchange. The result is a low-latency data path that delivers a substantial performance increase for critical workloads like multi-GPU inference and fine-tuning.
In fact, across all major collectives, the enhanced G4 P2P capability provides an acceleration of up to 2.2x without requiring any changes to the code or workload.
Inference performance boost by P2P on G4
On G4 instances, enhanced peer-to-peer communication directly boosts multi-GPU workload performance, particularly for tensor parallel inference with vLLM, with up to 168% higher throughput, and up to 41% lower inter-token latency (ITL).
We observe these improvements when using tensor parallelism for model serving, especially when compared to standard non-P2P offerings.
At the same time, G4 coupled with software-defined PCIe and P2P innovation, significantly enhances inference throughput and reduces latency, giving you the control to optimize your inference deployment for your business needs.
Throughput or speed: G4 with P2P lets you choose
The platform-level optimizations on G4 VMs translate directly into a flexible and powerful competitive advantage. For interactive generative AI applications, where user experience is paramount, G4’s P2P technology delivers up to 41% less inter-token latency — the critical delay between generating each part of a response. This results in a noticeably snappier and more reactive end-user experience, increasing their satisfaction with your AI application.
Alternatively, for workloads where raw throughput is the priority, such as batch inference, G4 with P2P enables customers to serve up to 168% more requests than comparable offerings. This means you can either increase the number of users served by each model instance, or significantly improve the responsiveness of your AI applications. Whether your focus is on latency-sensitive interactions or high-volume throughput, G4 provides a superior return on investment compared to other NVIDIA RTX PRO 6000 offerings in the market.
Scale further with G4 and GKE Inference Gateway
While P2P optimizes performance for a single model replica, scaling to meet production demand often requires multiple replicas. This is where the GKE Inference Gateway really shines. It acts as an intelligent traffic manager for your models, using advanced features like prefix-cache-aware routing and custom scheduling to maximize throughput and slash latency across your entire deployment.
By combining the vertical scaling of G4’s P2P with the horizontal scaling of the Inference Gateway, you can build an end-to-end serving solution that is exceptionally performant and cost-effective for the most demanding generative AI applications. For instance, you can use G4’s P2P to efficiently run a 2-GPU Llama-3.1-70B model replica with 66% higher throughput, and then use GKE Inference Gateway to intelligently manage and autoscale multiple of these replicas to meet global user demand.
G4 P2P supported VM Shapes
Peer-to-peer capabilities for NVIDIA RTX PRO 6000 Blackwell are available with the following multi-GPU G4 VM shapes:
Machine Type
GPUs
Peer-to-Peer
GPU Memory (GB)
vCPUs
Host Memory (GB)
Local SSD (GB)
g4-standard-96
2
Yes
192
96
360
3,000
g4-standard-192
4
Yes
384
192
720
6,000
g4-standard-384
8
Yes
768
384
1,440
12,000
For VM shapes smaller than 8 GPUs, our software defined PCIe fabric ensures path isolation between GPUs assigned to different VMs on the same physical machine. PCIe paths are created dynamically at VM creation and are dependent on the VM shape, ensuring isolation on multiple levels of the platform stack to prevent communication between GPUs that are not assigned to the same VM.
Get started with P2P on G4
The G4 peer-to-peer capability is transparent to the workload, and requires no changes to the application code or to libraries such as the NVIDIA Collective Communications Library (NCCL). All peer-to-peer paths are automatically set up during VM creation. You can find more information about enabling peer-to-peer for NCCL-based workloads in the G4 documentation.
Try Google Cloud G4 VMs with P2P from the Google Cloud console today, and start building your inference platform with GKE Inference Gateway. For more information, please contact your Google Cloud sales team or reseller.
The Oklahoma Employment Security Commission (OESC) is responsible for managing critical services for its citizens, including unemployment insurance benefits and employer tax collection. However, its core operating system resided on an aging mainframe with a 40-year-old data structure. This legacy system created significant roadblocks. The data schema was unintelligible, making it nearly impossible to access, verify, or analyze information efficiently. A single data request could take a technical team six months to fulfill, delivering a static data dump that was often obsolete by the time it arrived.
This meant that operational decisions were being made with outdated information. Added to that, critical reporting for the Department of Labor (DoL) was a resource-intensive, time-consuming process. The agency needed to move beyond its aging mainframe to a more modern, future-ready foundation. They turned to Google Public Sector to help.
A modern data platform to drive insight and efficiency
To prepare for their mainframe migration, unlock value from their data, and create a modern data platform, OESC partnered with Google Public Sector and a consulting partner, Phase2. The goal was to transform the agency’s data into more accurate, accessible insights that could support its mission in near real-time. By leveraging the power of Google Cloud, OESC could empower its teams with self-service analytics to dramatically improve data accuracy and build a “gold standard” data model for the future.
A foundation built on BigQuery
The adoption of BigQuery was central to this transformation. It allowed for rapid iteration of the data and the development of the gold standard data model by enabling efficient ingest and processing. This, in turn, allowed for the development of higher-quality data sets, the ability to verify data quality, and rapid report development.
Critically, using various Google Cloud services including BigQuery, the team created the ability to track point-in-time data—a capability impossible with the legacy system. This feature has enabled powerful reporting, leading to:
Data quality improvements
Deep trend metrics
Report creation for any historical point in the past
Improved audit capabilities
Improved fraud detection
Oklahoma Employment Security Commission’s modern data journey on Google Cloud: A multi-step transformation
The Oklahoma Employment Security Commission’s initiative to move beyond the mainframe was not a single, abrupt migration but a multi-step modernization journey with a future-ready foundation built on trust and accessibility. This partnership has already delivered a massive shift in operational capabilities.
From weeks to hours: a new era for critical reporting
Google Cloud technology is powering a new wave of transformation and modernization. With BigQuery, the agency is able to ingest and process data more efficiently allowing for rapid iteration and the development of higher-quality, reliable datasets. The most immediate impact has been on DoL reporting processes that previously took weeks or even months of manual effort by understaffed IT teams were reduced to just hours. Additionally, with Looker, business stakeholders can now self-serve these reports and gain access to critical data with vastly improved accuracy. This includes critical reports such as:
Claims and payment activities (monthly)
Advance weekly initial and continued claims report (weekly)
Time-lapse data for UI benefit payments (monthly)
Tax and employer core measures
Tasks that once took weeks or months due to inaccessible data are now delivered to business stakeholders within hours or days, or, where possible, through self-service dashboards. Key metrics that are now accessible include:
Number of active employers over time
Amount of outstanding taxes
Modern searching for application features, such as when a benefits claimant looks for a separating employer
Uncovering insights and improving services for Oklahoma employers
A critical capability unlocked by BigQuery was the ability to track point-in-time data, something impossible with the legacy system. Using years of daily snapshots, OESC conducted a historical analysis of employer tax rates. This enabled the agency to directly explore the data, helping to identify opportunities. These opportunities led to assisting in creating new legislation and driving down employer taxes. As one user stated, “Having access to this data is a gamechanger.”
Laying the groundwork for a mainframe-free future
The Oklahoma Employment Security Commission’s modernization journey was a deliberate, multi-step process designed to build a solid foundation. The data was reorganized from its mainframe-mimicking format into a modern, intuitive, and efficient schema within BigQuery. This new structure is the stepping stone that enables real-time reporting powered by Looker and lays the essential groundwork for a much smoother transition off the mainframe. The new platform also paves the way for advanced capabilities like accelerated fraud detection and future AI use cases.
Google Public Sector’s approach involved a multi-step process to successfully modernize this data:
Understanding the legacy: Significant manual effort was required to understand the complex, mainframe-mimicking schema in the existing Postgres database.
Reorganizing data into a modern schema: The data was then organized into a modern, intuitive, and efficient schema within BigQuery. This was a major stepping stone, moving away from a file-system architecture to a structure designed for analytic performance.
Enabling real-time access and reporting: Once the data was structured in BigQuery, it could be exposed through tools like Looker. This transformed the six-month reporting process into collaborative sessions where reports are created, providing access to near real-time data for operational decisions and helping improve service accuracy.
Preparing for the future (mainframe transition): Actively defining the future schema of the modern production database and populating it with new data alongside the old. This multi-year progress will allow OESC to transition off the mainframe much more smoothly in the future.
A vision for secure, statewide data collaboration
OESC’s partnership with Google Public Sector and adoption of the latest data and AI innovations aligns with Oklahoma’s broader data management and sharing initiatives. Their modern data platform, built on a well-designed schema in BigQuery, allows their data to be easily and securely shared with other state agencies via Analytics Hub. This positions OESC to lead in creating a more connected, data-driven government while maintaining robust control and governance.
Continue your agency’s innovation journey
Google Public Sector is dedicated to helping public sector agencies understand and apply transformative cloud, AI, data and security solutions. Join us at the Google Public Sector Summit on October 29th in Washington D.C., to discover how your agency can leverage the latest technologies to tackle your biggest challenges and drive your agency forward.
COLDRIVER, a Russian state-sponsored threat group known for targeting high profile individuals in NGOs, policy advisors and dissidents, swiftly shifted operations after the May 2025 public disclosure of its LOSTKEYS malware, operationalizing new malware families five days later. It is unclear how long COLDRIVER had this malware in development, but GTIG has not observed a single instance of LOSTKEYS since publication. Instead, GTIG has seen new malware used more aggressively than any other previous malware campaigns we have attributed to COLDRIVER (also known as UNC4057, Star Blizzard, and Callisto).
The new malware, which GTIG attributes directly to COLDRIVER, has undergone multiple iterations since discovery, indicating a rapidly increased development and operations tempo from COLDRIVER. It is a collection of related malware families connected via a delivery chain. GTIG seeks to build on details on a part of this infection chain released in a recent Zscaler blog post by sharing wider details on the infection chain and related malware.
Malware Development Overview
This re-tooling began with a new malicious DLL called NOROBOT delivered via an updated COLDCOPY “ClickFix” lure that pretends to be a custom CAPTCHA. This is similar to previous LOSTKEYS deployment by COLDRIVER, but updates the infection by leveraging the user to execute the malicious DLL via rundll32, instead of the older multi-stage PowerShell method.
Figure 1: Malware development overview
While the earliest version of NOROBOT led to the deployment of a cumbersome Python backdoor tracked as YESROBOT, COLDRIVER quickly abandoned YESROBOT for a more flexible and extensible Powershell backdoor we track as MAYBEROBOT.
NOROBOT and its preceding infection chain have been subject to constant evolution—initially simplified to increase chances of successful deployment, before re-introducing complexity by splitting cryptography keys. The shift back to more complex delivery chains increases the difficulty of tracking their campaigns. This constant development highlights the group’s efforts to evade detection systems for their delivery mechanism for continued intelligence collection against high-value targets.
Delivery via “ClickFix” and Rundll32
This new malware infection chain contains three distinct components which are delivered via a new variant of the COLDCOPY “ClickFix” lure (c4d0fba5aaafa40aef6836ed1414ae3eadc390e1969fdcb3b73c60fe7fb37897) previously seen delivering LOSTKEYS. The new variant of COLDCOPY tries to get the user to download and execute a DLL using rundll32, while trying to disguise itself as a captcha by including text to verify that the user is not a robot. The DLL first observed was named “iamnotarobot.dll” and the export was named “humanCheck” – both of which play into the CAPTCHA theme of the page and partially inspired the ROBOT-themed naming convention for the malware that follows.
Figure 2: COLDCOPY attempting to lure the user to execute NOROBOT
NOROBOT
NOROBOT, also disclosed as BAITSWITCH by Zscaler, is a DLL that has been observed undergoing regular development from May through September 2025. In all of the versions observed the main purpose of NOROBOT has been to retrieve the next stage from a hardcoded command and control (C2) address and prepare the system for the final payload. The earliest version of NOROBOT (2e74f6bd9bf73131d3213399ed2f669ec5f75392de69edf8ce8196cd70eb6aee) made use of cryptography in which the key was split across multiple components and needed to be recombined in a specific way in order to successfully decrypt the final payload. This was likely done to make it more difficult to reconstruct the infection chain because if one of the downloaded components was missing the final payload would not decrypt properly. Despite this attempt to hinder analysis, the earliest version of NOROBOT included fetching and extracting a full Python 3.8 installation, which is a noisy artifact that is likely to raise suspicions.
Details of the files retrieved by this version of NOROBOT are:
SFX RAR containing a Python 3.8 installation for Windows
Command to store part of a crypto key in the registry
Persistence via scheduled task
Command to run bitsadmin to retrieve a file named libsystemhealthcheck.py
Command to run bitsadmin to retrieve a file named libcryptopydatasize.py
Both files retrieved using bitsadmin are from inspectguarantee[.]org:
libsystemhealthcheck.py contains part of an AES key that is combined with the key stored in the registry and decrypts libcryptopydatasize.py, which we have named YESROBOT.
YESROBOT
The decrypted version of YESROBOT is a Python backdoor which uses HTTPS to retrieve commands from a hardcoded C2. The commands are AES encrypted with a hardcoded key. System information and username are encoded in the User-Agent header of the request. YESROBOT is a minimal backdoor that requires all commands to be valid Python, which makes typical functionality, such as downloading and executing files or retrieving documents, more cumbersome to implement. A typical approach would include the retrieval and execution logic in the backdoor and only require the operator to send the URL. This makes YESROBOT difficult to extend and operate, and hints that the deployment of YESROBOT was a hastily made choice. GTIG observed only two instances of YESROBOT deployment over a two week period in late May before it was abandoned in favor of a different backdoor, MAYBEROBOT. It is for these reasons that GTIG assesses that YESROBOT was hastily deployed as a stopgap mechanism after our publication on LOSTKEYS.
Figure 3: Main loop of YESROBOT, limited to Python command execution only
MAYBEROBOT
In early June 2025, GTIG observed a variant of NOROBOT (3b49904b68aedb6031318438ad2ff7be4bf9fd865339330495b177d5c4be69d1) which was drastically simplified from earlier versions. This version fetches a single file, which we observed to be a single command that sets up a logon script for persistence. The logon script was a Powershell command which downloaded and executed the next stage, which we call MAYBEROBOT, also known as SIMPLEFIX by Zscaler.
The file fetched by the logon script was a heavily obfuscated Powershell script (b60100729de2f468caf686638ad513fe28ce61590d2b0d8db85af9edc5da98f9) that uses a hardcoded C2 and a custom protocol that supports 3 commands:
Download and execute from a specified URL
Execute the specified command using cmd.exe
Execute the specified powershell block
In all cases an acknowledgement is sent to the C2 at a different path, while in the case of command 2 and 3, output is sent to a third path.
GTIG assesses that MAYBEROBOT was developed to replace YESROBOT because it does not need a Python installation to execute, and because the protocol is extensible and allows attackers more flexibility when achieving objectives on target systems. While increased flexibility was certainly achieved, it is worth noting that MAYBEROBOT still has minimal built-in functionality and relies upon the operator to provide more complex commands like YESROBOT before it.
The ROBOTs Continue to Evolve
As GTIG continued to monitor and respond to COLDRIVER attempts to deliver NOROBOT to targets of interest from June through September 2025, we observed changes to both NOROBOT and the malware execution chain that indicate COLDRIVER was increasing their development tempo. GTIG has observed multiple versions of NOROBOT over time with varying degrees of simplicity. The specific changes made between NOROBOT variants highlight the group’s persistent effort to evade detection systems while ensuring continued intelligence collection against high-value targets. However, by simplifying the NOROBOT downloader, COLDRIVER inadvertently made it easier for GTIG to track their activity.
GTIG’s insight into the NOROBOT malware’s evolution aligned with our observation of their movement away from the older YESROBOT backdoor in favor of the newer MAYBEROBOT backdoor. GTIG assesses that COLDRIVER may have made changes to the final backdoor for several reasons: YESROBOT requiring a full Python interpreter to function is likely to increase detection in comparison to MAYBEROBOT, and YESROBOT backdoor was not easily extensible.
As MAYBEROBOT became the more commonly observed final backdoor in these operations, the NOROBOT infection chain to get there continued evolving. Over the course of this period of time, COLDRIVER simplified their malware infection chain and implemented basic evasion techniques, such as rotating infrastructure and file naming conventions, paths where files were retrieved from, how those paths were constructed, changing the export name and changing the DLL name. Along with making these minor changes, COLDRIVER re-introduced the need to collect crypto keys and intermediate downloader stages to be able to properly reconstruct the full infection chain. Adding complexity back in may increase operational security for the operation as it makes reconstructing their activity more difficult. Network defenders need to collect multiple files and crypto keys to reconstruct the full attack chain; whereas in the simplified NOROBOT chain they only need the URL from the logon script to retrieve the final payload.
GTIG has observed multiple versions of NOROBOT indicating consistent development efforts, but the final backdoor of MAYBEROBOT has not changed. This indicates that COLDRIVER is interested in evading detection of their delivery mechanism while having high confidence that MAYBEROBOT is less likely to be detected.
Phishing or Malware?
It is currently not known why COLDRIVER chooses to deploy malware over the more traditional phishing they are known for, but it is clear that they have spent significant development effort to re-tool and deploy their malware to specific targets. One hypothesis is that COLDRIVER attempts to deploy NOROBOT and MAYBEROBOT on significant targets which they may have previously compromised via phishing and already stolen emails and contacts from, and are now looking to acquire additional intelligence value from information on their devices directly.
As COLDRIVER continues to develop and deploy this chain we believe that they will continue their aggressive deployment against high-value targets to achieve their intelligence collection requirements.
Protecting the Community
As part of our efforts to combat threat actors, we use the results of our research to improve the safety and security of Google’s products. Upon discovery, all identified malicious websites, domains and files are added to Safe Browsing to protect users from further exploitation. We also send targeted Gmail and Workspace users government-backed attacker alerts notifying them of the activity and encouraging potential targets to enable Enhanced Safe Browsing for Chrome and ensure that all devices are updated.
We are committed to sharing our findings with the security community to raise awareness and with companies and individuals that might have been targeted by these activities. We hope that improved understanding of tactics and techniques will enhance threat hunting capabilities and lead to stronger user protections across the industry.
Indicators of compromise (IOCs) and YARA rules are included in this post, and are also available as a GTI collection and rule pack.
NOROBOT – machinerie.dll – Latest sample from late August 2025
YARA Rules
rule G_APT_MALWARE_NOROBOT_1 {
meta:
author = "Google Threat Intelligence"
description = "DLL which pulls down and executes next stages"
strings:
$path = "/konfiguration12/" wide
$file0 = "arbeiter" wide
$file1 = "schlange" wide
$file2 = "gesundheitA" wide
$file3 = "gesundheitB" wide
$new_file0 = "/reglage/avec" wide
$new_file1 = "/erreur" wide
condition:
$path or
all of ($file*) or
all of ($new_file*) or
(
for any s in ("checkme.dll", "iamnotarobot.dll", "machinerie.dll"): (pe.dll_name == s) and
for any s in ("humanCheck", "verifyme"): (pe.exports(s))
)
}
rule G_APT_BACKDOOR_YESROBOT_1 {
meta:
author = "Google Threat Intelligence Group (GTIG)"
strings:
$s0 = "return f'Mozilla/5.0 {base64.b64encode(str(get_machine_name()).encode()).decode()} {base64.b64encode(str(get_username()).encode()).decode()} {uuid} {get_windows_version()} {get_machine_locale()}'"
$s1 = "'User-Agent': obtainUA(),"
$s2 = "url = f"https://{target}/connect""
$s3 = "print(f'{target} is not availible')"
$s4 = "tgtIp = check_targets(tgtList)"
$s5 = "cmd_url = f'https://{tgtIp}/command'"
$s6 = "print('There is no availible servers...')"
condition:
4 of them
}
rule G_APT_BACKDOOR_MAYBEROBOT_1 {
meta:
author = "Google Threat Intelligence Group (GTIG)"
strings:
$replace = "-replace '\n', ';' -replace '[^\x20-\x7E]', '' -replace '(?i)x[0-9A-Fa-f]{4}', '' -split "\n""
condition:
all of them
}
How do you know if your agent is actually working? It’s one of the most complex but critical questions in development. In our latest episode of the Agent Factory podcast, we dedicated the entire session to breaking down the world of agent evaluation. We’ll cover what agent evaluation really means, what you should measure, and how to measure using ADK and Vertex AI. You’ll also learn more advanced evaluation in multi-agents systems.
This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.
Deconstructing Agent Evaluation
We start by defining what makes agent evaluation so different from other forms of testing.
Beyond Unit Tests: Why Agent Evaluation is Different
The first thing to understand is that evaluating an agent isn’t like traditional software testing.
Traditional software tests are deterministic; you expect the same input to produce the same output every time (A always equals B).
LLM evaluation is like a school exam. It tests static knowledge with Q&A pairs to see if a model “knows” things.
Agent evaluation, on the other hand, is more like a job performance review. We’re not just checking a final answer. We’re assessing a complex system’s behavior, including its autonomy, reasoning, tool use, and ability to handle unpredictable situations. Because agents are non-deterministic, you can give the same prompt twice and get two different–but equally valid–outcomes.
So, if we’re not just looking at the final output, what should we be measuring? The short answer is: everything. We need a full-stack approach that looks at four key layers of the agent’s behavior:
Final Outcome: Did the agent achieve its goal? This goes beyond a simple pass/fail to look at the quality of the output. Was it coherent, accurate, and safe? Did it avoid hallucinations?
Chain of Thought (Reasoning): How did the agent arrive at its answer? We need to check if it broke the task into logical steps and if its reasoning was consistent. An agent that gets the right answer by luck won’t be reliable.
Tool Utilization: Did the agent pick the right tool for the job and pass the correct parameters? Crucially, was it efficient? We’ve all seen agents get stuck in costly, redundant API call loops, and this is where you catch that.
Memory & Context Retention: Can the agent recall information from earlier in the conversation when needed? If new information conflicts with its existing knowledge, can it resolve that conflict correctly?
How to Measure: Ground Truth, LLM-as-a-Judge, and Human-in-the-Loop
Once you know what to measure, the next question is how. We covered three popular methods, each with its own pros and cons:
Ground Truth Checks: These are fast, cheap, and reliable for objective measures. Think of them as unit tests for your agent’s outputs: “Is this a valid JSON?” or “Does the format match the schema?” Their limitation is that they can’t capture nuance.
LLM-as-a-Judge: Here, you use a powerful LLM to score subjective qualities, like the coherence of an agent’s plan. This approach scales incredibly well, but its judgments are only as good as the model’s training and biases.
Human-in-the-Loop: This is the gold standard, where domain experts review agent outputs. It’s the most accurate method for capturing nuance but is also the slowest and most expensive.
The key takeaway is not to pick just one. The best strategy is to combine them in a calibration loop: start with human experts to create a small, high-quality “golden dataset,” then use that data to fine-tune an LLM-as-a-judge until its scores align with your human reviewers. This gives you the best of both worlds: human-level accuracy at an automated scale.
The Factory Floor: Evaluating an Agent in 5 Steps
The Factory Floor is our segment for getting hands-on. Here, we moved from high-level concepts to a practical demo using the Agent Development Kit (ADK).
The ADK Web UI is perfect for fast, interactive testing during development. We walked through a five-step “inner loop” workflow to debug a simple product research agent that was using the wrong tool.
1. Test and Define the “Golden Path.” We gave the agent a prompt (“Tell me about the A-phones”) and saw it return the wrong information (an internal SKU instead of a customer description). We then corrected the response in the Eval tab to create our first “golden” test case.
2. Evaluate and Identify Failure. With the test case saved, we ran the evaluation. As expected, it failed immediately.
3. Find the Root Cause. This is where we got into the evaluation. We jumped into the Trace view, which shows the agent’s step-by-step reasoning process. We could instantly see that it chose the wrong tool (lookup_product_information instead of get_product_details).
4. Fix the Agent. The root cause was an ambiguous instruction. We updated the agent’s code to be more specific about which tool to use for customer-facing requests versus internal data.
5. Validate the Fix. After the ADK server hot-reloaded our code, we re-ran the evaluation, and this time, the test passed. The agent provided the correct customer-facing description.
From Development to Production
This ADK workflow is fantastic for development, but it doesn’t scale. For that, you need to move to a production-grade platform.
From the Inner Loop to the Outer Loop: ADK and Vertex AI
ADK for the Inner Loop: It’s built for the fast, manual, and interactive debugging you do during development.
Vertex AI for the Outer Loop: When you need to run evaluations at scale with richer metrics (like LLM-as-a-judge), you need a production-grade platform like Vertex AI’s Gen AI evaluation services. It’s designed to handle complex, qualitative evaluations for agents at scale and produce results you can build monitoring dashboards with.
Both of these workflows require a dataset, but what if you don’t have one? This is the “cold start problem,” and we solve it with synthetic data generation. We walked through a four-step recipe:
Generate Tasks: Ask an LLM to generate realistic user tasks.
Create Perfect Solutions: Have an “expert” agent produce the ideal, step-by-step solution for each task.
Generate Imperfect Attempts: Have a weaker or different agent try the same tasks, giving you a set of flawed attempts.
Score Automatically: Use an LLM-as-a-judge to compare the imperfect attempts against the perfect solutions and score them.
Once you have evaluation data, the developer’s next challenge is clear: how do you use it to design tests that scale? You can’t just manually check every output forever. We approach this problem with a three-tier testing strategy.
Tier 1: Unit Tests. This is the ground floor. Just like in traditional coding, you test the smallest pieces of your agent in isolation. For example, verifying that a specific tool, like fetch_product_price, correctly extracts data from a sample input without running the whole agent.
Tier 2: Integration Tests. This is the agent’s “test drive.” Here, you evaluate the entire, multi-step journey for a single agent. You give it a complete task and verify that it can successfully chain its reasoning and tools together to produce the final, expected outcome.
Tier 3: End-to-End Human Review. This is the ultimate sanity check where automation meets human judgment. For complex tasks, a human expert evaluates the agent’s final output for quality, nuance, and correctness. This creates a “human-in-the-loop” feedback system to continuously calibrate and improve the agent’s performance. It’s also at this stage that you begin testing how multiple agents interact within a larger system.
As we move from single agents to multi-agent systems, evaluation has to evolve. Judging an agent in isolation doesn’t tell you much about the overall system’s performance.
We used an example of a customer support system with two agents: Agent A for initial contact and Agent B for processing refunds. If a customer asks for a refund, Agent A’s job is to gather the info and hand it off to Agent B.
If you evaluate Agent A alone, its task completion score might be zero because it didn’t actually issue the refund. But in reality, it performed its job perfectly by successfully handing off the task. Conversely, if Agent A passes the wrong information, the system as a whole fails, even if Agent B’s logic is perfect.
This shows why, in multi-agent systems, what really matters is the end-to-end evaluation. We need to measure how smoothly agents hand off tasks, share context, and collaborate to achieve the final goal.
We wrapped up by touching on some of the biggest open challenges in agent evaluation today:
Cost-Scalability Tradeoff: Human evaluation is high-quality but expensive; LLM-as-a-judge is scalable but requires careful calibration. Finding the right balance is key.
Benchmark Integrity: As models get more powerful, there’s a risk that benchmark questions leak into their training data, making scores less meaningful.
Evaluating Subjective Attributes: How do you objectively measure qualities like creativity, proactivity, or even humor in an agent’s output? These are still open questions the community is working to solve.
Your Turn to Build
This episode was packed with concepts, but the goal was to give you a practical framework for thinking about and implementing a robust evaluation strategy. From the fast, iterative loop in the ADK to scaled-up pipelines in Vertex AI, having the right evaluation mindset is what turns a cool prototype into a production-ready agent.
We encourage you to watch the full episode to see the demos in action and start applying these principles to your own projects.
AWS Parallel Computing Service (PCS) now supports Slurm v25.05. You can now create AWS PCS clusters running the newer Slurm v25.05.
The release of Slurm v25.05 in PCS provides new Slurm functionalities including enhanced multi-cluster sackd configuration and improved requeue behavior for instance launch failures. With this release, login nodes can now control multiple clusters without requiring sackd reconfiguration or restart. This enables administrators to pre-configure access to multiple clusters for their users. The new requeue behavior enables more resilient job scheduling by automatically retrying failed instance launches during capacity shortages, thus increasing overall cluster reliability.
AWS PCS is a managed service that makes it easier for you to run and scale your high performance computing (HPC) workloads on AWS using Slurm. To learn more about PCS, refer to the service documentation and AWS Region Table.
Amazon CloudWatch Database Insights now supports tag-based access control for database and per-query metrics powered by RDS Performance Insights. You can implement access controls across a logical grouping of database resources without managing individual resource-level permissions.
Previously, tags defined on RDS and Aurora instances did not apply to metrics powered by Performance Insights, creating significant overhead in manually configuring metric-related permissions at the database resource level. With this launch, those instance tags are now automatically evaluated to authorize metrics powered by Performance Insights. This allows you to define IAM policies using tag-based access conditions, resulting in improved governance and security consistency.
Please refer to RDS and Aurora documentation to get started with defining IAM policies with tag-based access control on database and per-query metrics. This feature is available in all AWS regions where CloudWatch Database Insights is available.
CloudWatch Database Insights delivers database health monitoring aggregated at the fleet level, as well as instance-level dashboards for detailed database and SQL query analysis. It offers vCPU-based pricing – see the pricing page for details. For further information, visit the Database Insights User Guide.
We designed BigQuery Studio to give data analysts, data engineers, and data scientists a comprehensive analytics experience within a single, purpose-built platform, helping them transform data into powerful insights.
Today, we’re thrilled to unveil a significant update to the BigQuery Studio, with a simplified and organized console interface to streamline your workflows, enhance your productivity, and give you greater control. Start your day ready to dive into data in an environment built for efficiency, free from the time-consuming sifting through countless queries or searching for the right tables. Come with us on a tour of the new interface, including a new:
Additional Explorer view to simplify data discovery and exploration
Reference panel that brings all the context you need, without context switching
Decluttered UI that gives you more control
Finding your way with the Explorer
Your journey begins with an expanded view of the Explorer, which lets you find and access resources using a full tab with more information about each resource.To view resources within a project, pick the project in the Explorer and choose the resource type you want to explore. A list of the resources shows up in a tab where you can filter or drill down to find what you’re looking for. To see all of your starred resources across projects, simply click “Starred” at the top of the Explorer pane to open the list of starred items. Alongside the new Explorer view, the full resource tree view is still available in the Classic Explorer, accessible by clicking the middle icon at the top of the pane.
As your projects grow, so does the need for efficient searching. The new search capabilities in BigQuery Studio allow you to easily find BigQuery resources. Use the search box in the new Explorer pane to search across all of your BigQuery resources within your organization. Then, filter the results by project and resource type to pinpoint exactly what you need.
To reduce tab proliferation and give you more control over your workspace, clicking on a resource now consistently opens it within the same BigQuery Studio tab. To open multiple results in separate tabs, use ctrl+click (or cmd+click). To prevent the current tab from getting its content replaced, double-click the tab name (you’ll notice that its name changes from italicized to regular font).
Context at your fingertips with the Reference panel
Writing complex queries often involves switching between tabs or running exploratory queries just to remember schema details or column names. The Reference panel eliminates this hassle. It dynamically displays context-aware information about tables and schemas directly within your editors as you write code. This means you have quick access to crucial details, so you can write your query with fewer interruptions.
The Reference panel also lets you generate code without having to copy and paste things like table or column names. To quickly start a new SQL query on a table, click the actions menu at the top of the Reference panel and select “insert query snippet”. The query code is automatically added to your editor. You can also click any field name in the table schema to insert it into your code.
Beyond the highlights: Less clutter and more efficiency
These updates are part of a broader effort to provide you with a clean workspace over which you have more control. In addition, the new BigQuery Studio includes a dedicated Job history tab, accessible from the new Explorer pane, providing a bigger view of jobs and reducing clutter by removing the bottom panel. You can also fully collapse the Explorer panel to gain more space to focus on your code.
Ready to experience the difference? We invite you to log in to the BigQuery Studio and try the new interface. Check out the Home tab in BigQuery Studio to learn more about these changes. For more details and to deepen your understanding, be sure to explore our documentation. Any feedback? Email us at bigquery-explorer-feedback@google.com.
For too long, network data analysis has felt less like a science and more like deciphering cryptic clues. To help close that gap, we’re introducing a new Mandiant Academy course from Google Cloud, designed to replace frustration with clarity and confidence.
We’ve designed the course specifically for cybersecurity professionals who need to quickly and effectively enhance network traffic analysis skills. You’ll learn to cut through the noise, identify malicious fingerprints with higher accuracy, and fortify your organization’s defenses by integrating critical cyber threat intelligence (CTI).
What you’ll learn
This track includes four courses that provide practical methods to analyze networks and operationalize CTI. Students will explore five proven methodologies to network analysis:
Packet capture (PCAP)
Network flow (netflow)
Protocol analysis
Baseline and behavioral
Historical analysis
Incorporating common tools, we demonstrate how to enrich each methodology adding CTI, and how analytical tradecraft enhances investigations.
The first course, Decoding Network Defense, refreshes foundational CTI principles and the five core network traffic analysis methodologies.
The second course, Analyzing the Digital Battlefield, investigates PCAP, netflow, and protocol before exploring how CTI enriches new evidence.
In the third course, Insights into Adversaries, students learn to translate complex human behaviors into detectable signatures.
The final course, The Defender’s Arsenal, introduces essential tools for those on the frontline, protecting their network’s perimeter.
Who should attend this course?
“Protecting the Perimeter” was developed for practitioners whose daily work is to interpret network telemetry from multiple data sources and identify anomalous behavior.This track’s format is designed for professionals who possess enough knowledge and skill to defend networks, but have limited time to continue education and enhance their abilities.
This training track is the second release from Mandiant Academy’s new approach to on-demand training which concentrates complex security concepts into short-form courses.
Sign up today
To learn more about and register for the course, please visit the Mandiant Academy website. You can also access Mandiant Academy’s on-demand, instructor-led, and experiential training options. We hope this course proves helpful in your efforts to defend your organization against cyber threats.
Deploying LLM workloads can be complex and costly, often involving a lengthy, multi-step process. To solve this, Google Kubernetes Engine (GKE) offers Inference Quickstart.
With Inference Quickstart, you can replace months of manual trial-and-error with out-of-the-box manifests and data-driven insights. Inference Quickstart integrates with the Gemini CLI through native Model Context Protocol (MCP) support to offer tailored recommendations for your LLM workload cost and performance needs. Together, these tools empower you to analyze, select, and deploy your LLMs on GKE in a matter of minutes. Here’s how.
1. Select and serve your LLM on GKE via Gemini CLI
You can install the gemini cli and gke-mcp server with the following steps:
Here are some example prompts that you can give Gemini CLI to select an LLM workload and generate the manifest needed to deploy the model to a GKE cluster:
code_block
<ListValue: [StructValue([(‘code’, “1. What are the 3 cheapest models available on GKE Inference Quickstart? Can you provide all of the related performance data and accelerators they ran on?rn2. How does this model’s performance compare when it was run on different accelerators?rn3. How do I choose between these 2 models?rn4. I’d like to generate a manifest for this model on this accelerator and save it to the current directory.”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f7e3038cee0>)])]>
This video below shows an end-to-end example of how you can quickly identify and deploy your optimal LLM workload to a pre-existing GKE cluster via this Gemini CLI setup:
Choosing the right hardware for your inference workload means balancing performance and cost. The trade-off is nonlinear. To simplify this complex trade-off, Inference Quickstart provides performance and cost insights across various accelerators, all backed by Google’s benchmarks.
For example, as shown in the graph below, minimizing latency for a model like Gemma 3 4b on vLLM dramatically increases cost. This is because achieving ultra-low latency requires sacrificing the efficiency of request batching, which leaves your accelerators underutilized. Request load, model size, architecture, and workload characteristics can all impact which accelerator is optimal for your specific use case.
To make an informed decision, you can get instant, data-driven recommendations by asking Gemini CLI or using the Inference Quickstart Colab notebook.
3. Calculate cost per input/output token
When you host your own model on a platform like GKE, you are billed for accelerator time, not for each individual token. Inference Quickstart calculates cost per token using the accelerator’s hourly cost and the input/output throughput.
The following formula attributes the total accelerator cost to both input and output tokens:
This formula assumes an output token costs four times as much as an input token. The reason for this heuristic is that the prefill phase (processing input tokens) is a highly parallel operation, whereas the decode phase (generating output tokens) is a sequential, auto-regressive process. You can ask Gemini CLI to change this ratio for you to fit your workload’s expected input/output ratio.
The key to cost-effective LLM inference is to take a data-driven approach. By relying on benchmarks for your workloads and using metrics like cost per token, you can make informed decisions that directly impact your budget and performance.
Next steps
GKE Inference Quickstart goes beyond cost insights and Gemini CLI integration, including optimizations for storage, autoscaling, and observability. Run your LLM workloads today with GKE Inference Quickstart to see how it can expedite and optimize your LLMs on GKE.