Cloud

2025 12 04

GCP – Building a Production-Ready AI Security Foundation

Scaling Generative AI applications from proof-of-concept to production is often bottlenecked by security concerns, specifically sensitive data exposure and prompt injection.

Establishing a production-ready posture requires a defense-in-depth strategy across three layers:

Application Layer: Real-time threat detection and mitigation.
Data Layer: Enforcing privacy controls and compliance.
Infrastructure: Network segmentation and compute isolation.

To implement these controls, this guide details three hands-on labs focused on securing these specific architectural planes.

Protect the Application in Real-Time: Model Armor

The application layer, where users directly interact with your AI model, is the most exposed surface in a GenAI application. This surface is frequently targeted by attackers using prompts and responses to exploit vulnerabilities.

This lab focuses on securing the application and model layers by demonstrating how to deploy a comprehensive security service called Model Armor. Model Armor acts as an intelligent firewall, analyzing prompts and responses in real-time to detect and block threats before they can cause harm.

In this lab, you learn to mitigate critical risks, including:

Prompt injection & jailbreaking: Malicious users crafting prompts to bypass safety guardrails or extract confidential data. You will create a Model Armor security policy that automatically detects and blocks these attempts.
Malicious URL detection: Blocking users who embed dangerous links in prompts, which could be part of an indirect injection.
Sensitive data leakage: Preventing the model from inadvertently exposing Personally Identifiable Information (PII) in its responses.

The Key Components:

You will create reusable templates that define what Model Armor should analyze, detect, and block. The block-unsafe-prompts template targets malicious inputs, while the data-loss-prevention template prevents sensitive data from being exposed in prompts or responses.

After completing this lab, you will have the blueprint to integrate Model Armor directly into your application’s backend API, ensuring that every request to your model first passes through this real-time threat detection layer.

aside_block: <ListValue: [StructValue([(‘title’, ‘Go to the lab!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f6faf6a9160>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Safeguard AI Data with Sensitive Data Protection

While the application layer needs real-time defense, the data used for training and testing AI models requires protection before it even enters the development environment. Raw customer data poses significant privacy challenges, and developers need high-quality data that is safe and compliant.

This lab guides you through building an automated data sanitization pipeline to protect sensitive information used in AI development. You will use Google Cloud’s Sensitive Data Protection (SDP) to inspect, classify, and de-identify Personally Identifiable Information (PII) across various data formats.

The Key Components:

Inspection Templates: You define an inspection template to look for specific sensitive information types, or infoTypes, that are relevant to your data and geography, such as credit card numbers or SSNs.
De-identification Templates: You build separate de-identification templates for different data formats, giving you granular control:

Unstructured Data: Replacing sensitive values in text files (like chat logs) with their infoType name to preserve context.
Structured Data: Using record transformations like character masking on CSV files to preserve data utility for testing while still de-identifying sensitive fields.
Image Data: Leveraging optical character recognition (OCR) to detect and redact sensitive text embedded within images.

Automated Jobs: You configure a single job that automatically applies the correct redaction based on the file type it detects and inspects, automating the security workflow for data stored in Cloud Storage.

In a production environment, you would use these templates to create a fully automated, hands-off detection and de-identification process, often by setting up a job trigger whenever new raw customer data is uploaded. For sensitive data unique to your business, you can define custom infoTypes within Sensitive Data Protection.

aside_block: <ListValue: [StructValue([(‘title’, ‘Go to lab!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f6faf6a9c70>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Harden the AI Infrastructure Foundation

The final layer of defense is the underlying infrastructure that hosts your development, training, and deployment processes. A production-ready AI environment must be isolated, hardened, and protected from system tampering, privilege escalation, and accidental data exposure.

This lab focuses on mitigating common infrastructure threats by creating a multi-layered, secure foundation.

The Key Components:

Secure Network Foundation: You provision a secure Virtual Private Cloud (VPC) and subnet, configured with Private Google Access to ensure that compute resources can reach Google APIs over a private network, avoiding the public internet. You also deploy a Cloud NAT gateway to allow private instances to initiate controlled outbound connections without having a public IP.
Hardened Compute: You deploy a secure Vertex AI Workbench instance inside your private VPC, which serves as your isolated development environment. You enforce the principle of least privilege by creating and assigning a dedicated service account with only the necessary roles. The instance itself is hardened by disabling root access and enabling security features like Secure Boot.
Secure Storage: You create a fortified Cloud Storage bucket for your datasets, models, and artifacts. You apply strong configurations, including:

Enforce public access prevention to override any misconfigured IAM settings.
Uniform bucket-level access for simpler, more predictable control.
Object versioning and soft delete for recovery from accidental or malicious overwrites or deletions.
Data access logs to provide a comprehensive and immutable audit trail.

For maximum security, this entire environment can be wrapped in a VPC Service Controls perimeter, which prevents data exfiltration by ensuring services can only be accessed by authorized resources within your private network perimeter.

aside_block: <ListValue: [StructValue([(‘title’, ‘Go to the lab!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f6faf6a9550>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Build Your Production-Ready AI Security Today

Ready to move your AI project from prototype to a secure, production-grade application? Dive into the codelabs now to begin your journey across the application, data, and infrastructure layers:

These labs are part of the Securing AI Applications module in our official Production-Ready AI with Google Cloud program. Explore the full curriculum for more content that will help you bridge the gap from a promising prototype to a production-grade AI application.

Share your progress and connect with others on the journey using the hashtag #ProductionReadyAI. Happy learning!

Read More for the details.

2025 12 04

AWS – Announcing new Amazon EC2 M9g instances powered by AWS Graviton5 processors (Preview)

Tibor Kiss AWS, Cloud AWS

Starting today, new general purpose Amazon Elastic Compute Cloud (Amazon EC2) M9g instances, powered by AWS Graviton5 processors, are available in preview. AWS Graviton5 is the latest in the Graviton family of processors that are custom designed by AWS to provide the best price performance for workloads in Amazon EC2. These instances offer up to 25% better compute performance, and higher networking and Amazon Elastic Block Store (Amazon EBS) bandwidth than AWS Graviton4-based M8g instances. They are up to 30% faster for databases, up to 35% faster web applications, and up to 35% faster for machine learning workloads compared to M8g.

M9g instances are built on the AWS Nitro System, a collection of hardware and software innovations designed by AWS. The AWS Nitro System enables the delivery of efficient, flexible, and secure cloud services with isolated multitenancy, private networking, and fast local storage. Amazon EC2 M9g instances are ideal for workloads such as application servers, microservices, gaming servers, midsize data stores, and caching fleets.

To learn more or request access to the M9g preview, see Amazon EC2 M9g instances. To begin your Graviton journey, visit the Level up your compute with AWS Graviton page.

Read More for the details.

2025 12 04

GCP – Unlocking GKE’s Full Potential: The Flat Network Decoded

Tibor Kiss Cloud, Google Cloud gcp

As organizations adopt GKE for critical workloads, including generative and agentic AI, understanding GKE capabilities is essential. The networking layer is a key component, and while GKE offers a fully integrated, flat network model, you may be transitioning from different setups. It’s important to grasp how GKE’s network model differs and how to leverage its design.

The newly published “Unlocking the Power of GKE’s Flat Network: Design Recommendation” offers a comprehensive guide to designing, deploying, and managing this network model. It dives into the key advantages of GKE’s flat network, contrasts it with the alternative, island-mode, and provides architectural recommendations that can help you take advantage of its full potential for enhanced scalability, performance, and integration.

In this blog post, take a peak into this design recommendation document.

Use Case Recap

Migrating to GKE offers agility and scalability for your containerized workloads, but we know users sometimes face challenges with IP address management, especially those accustomed to the “island mode” networking of other cloud providers.

While GKE’s default flat network model doesn’t natively support that “island mode” approach, there are ways to adapt your existing island-mode architectures to GKE’s flat networking architecture and navigate potential IP address management concerns. That’s why we’re providing clear strategies and showcasing GKE’s latest feature to help you address IP management challenges and easily transition to GKE.

Design Recommendation

To help you along your journey the “Unlocking the Power of GKE’s Flat Network: Design Recommendation” design guide provides in depth knowledge. This guide was written by several Google experts and dives into different patterns and designs based on various use cases.

The design guide is meant to serve as your main reference to assist you in evaluating all options and point you to reference architectures that describe how to deploy recommended patterns. You can utilize these recommendations as a guide, sample, or building blocks for designing, researching or planning your network. As with all things architecture, you have a varying degree of flexibility in what the final design will look like.

Example Pattern

Let’s take a quick look at one of the designs highlighted in the Unlocking the Power of GKE’s Flat Network: Design Recommendation.

This design outlines a strategy for emulating island mode behavior within GKE’s flat network. It combines VPC-based island mode, which helps conserve IP addresses, with Private Service Connect (PSC) to access shared tooling. Network Connectivity Center (NCC) with PSC transitivity will provide access to common tooling through PSC endpoints, each having unique, routable IPs. For communication between these emulated “islands,” IP masquerading will be used to map a Pod’s IP address to its node’s IP, making outbound Pod traffic appear to originate from the node. This comprehensive approach establishes a strong foundation for scalable and complex connectivity patterns.

Learn More Today

You can learn more about GKE Networking with the following resources:

Read More for the details.

2025 12 04

GCP – Responding to CVE-2025-55182: Secure your React and Next.js workloads

Tibor Kiss Cloud, Google Cloud gcp

Earlier today, Meta and Vercel publicly disclosed two vulnerabilities that expose services built using the popular open-source frameworks React Server Components (CVE-2025-55182) and Next.js to remote code execution risks when used for some server-side use cases. At Google Cloud, we understand the severity of these vulnerabilities, and our security teams have shared their recommendations to help our customers take immediate, decisive action to secure their applications.

Vulnerability background

The React Server Components framework is commonly used for building user interfaces. On Dec. 3, 2025, CVE.org assigned this vulnerability as CVE-2025-55182. The official Common Vulnerability Scoring System (CVSS) base severity score has been determined as Critical, a severity of 10.0.

Vulnerable versions: React 19 through React 19.2.0
Patched in React 19.2.1
Fix: https://github.com/facebook/react/commit/7dc903cd29dac55efb4424853fd0442fef3a8700
Announcement: https://react.dev/blog/2025/12/03/critical-security-vulnerability-in-react-server-components

Next.js is a web development framework that depends on React, and is also commonly used for building user interfaces. (The Next.js vulnerability was referenced as CVE-2025-66478 before being marked as a duplicate.)

Vulnerable versions: Next.js 15 through Next.js 16
Patched versions are listed here.
Fix: https://github.com/vercel/next.js/commit/6ef90ef49fd32171150b6f81d14708aa54cd07b2
Announcement: https://nextjs.org/blog/CVE-2025-66478

We strongly encourage organizations who manage environments relying on the React and Next.js frameworks to update to the latest version, and take the mitigation actions outlined below.

Mitigating CVE-2025-55182

We have created and rolled out a new Cloud Armor web application firewall (WAF) rule designed to detect and block exploitation attempts related to CVE-2025-55182. This new rule is available now and is intended to help protect your internet-facing applications and services that use global or regional Application Load Balancers. We recommend deploying this rule as a temporary mitigation while your vulnerability management program patches and verifies all vulnerable instances in your environment.

For customers using Firebase Hosting or Firebase App Hosting, a rule is already enforced to limit exploitation of CVE-2025-55182 through requests to custom and default domains. All other workloads require customer intervention and patching for impacted packages.

For Project Shield users, we are working to deploy WAF protections for all sites and no action is necessary to enable these WAF rules. For long-term mitigation, you will need to patch your origin servers as an essential step to eliminate the vulnerability (see additional guidance below).

Cloud Armor and the Application Load Balancer can be used to deliver and protect your applications and services regardless of whether they are deployed on Google Cloud, on-premises, or on another infrastructure provider. If you are not yet using Cloud Armor and the Application Load Balancer, please follow the guidance further down to get started.

Deploying the cve-canary WAF rule for Cloud Armor

To configure Cloud Armor to detect and protect from CVE-2025-55182, you can use the cve-canary preconfigured WAF rule leveraging the new ruleID that we have added for this vulnerability.

In your Cloud Armor backend security policy, create a new rule and configure the following match condition:

code_block: <ListValue: [StructValue([(‘code’, “(has(request.headers[‘next-action’]) || has(request.headers[‘rsc-action-id’]) || request.headers[‘content-type’] == ‘multipart/form-data’ || rnrequest.headers[‘content-type’] == ‘application/x-www-form-urlencoded’ ) && evaluatePreconfiguredWaf(‘cve-canary’,{‘sensitivity’: 0, ‘opt_in_rule_ids’: rn[‘google-mrs-v202512-id000001-rce’]})”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f489936d220>)])]>

This can be accomplished from the Google Cloud console by navigating to Cloud Armor and modifying an existing or creating a new policy.

Cloud Armor rule creation in the Google Cloud console

Alternatively, the gcloud CLI can be used to create or modify a policy with the requisite rule:

code_block: <ListValue: [StructValue([(‘code’, ‘gcloud compute security-policies rules create PRIORITY_NUMBER \rn –security-policy SECURITY_POLICY_NAME \rn –expression “(has(request.headers[‘next-action’]) || rnhas(request.headers[‘rsc-action-id’]) || request.headers[‘content-type’] == ‘multipart/form-data’ || request.headers[‘content-type’] == rn’application/x-www-form-urlencoded’) && rnevaluatePreconfiguredWaf(‘cve-canary’,{‘sensitivity’: 0, ‘opt_in_rule_ids’: rn[‘google-mrs-v202512-id000001-rce’]})” \rn –action=deny-403′), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f489936d580>)])]>

Additionally, if you are managing your rules with Terraform, you may implement the rule via the following syntax:

code_block: <ListValue: [StructValue([(‘code’, ‘rule {rn action = “deny(403)”rn priority = “PRIORITY_NUMBER”rn match {rn expr {rn expression = “(has(request.headers[‘next-action’]) || rnhas(request.headers[‘rsc-action-id’]) || request.headers[‘content-type’] == ‘multipart/form-data’ || request.headers[‘content-type’] == rn’application/x-www-form-urlencoded’) && rnevaluatePreconfiguredWaf(‘cve-canary’,{‘sensitivity’: 0, ‘opt_in_rule_ids’: rn[‘google-mrs-v202512-id000001-rce’]})”rn }rn }rn description = “Applies protection for CVE-2025-55182 (React/Next.JS)”rn }’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f489936d7c0>)])]>

Verifying WAF rule safety for your application and consuming telemetry

Cloud Armor rules can be configured in preview mode, a logging-only mode to test or monitor the expected impact of the rule without Cloud Armor enforcing the configured action. We recommend that the new rule described above first be deployed in preview mode in your production environments so that you can see what traffic it would block.

Once you verify that the new rule is behaving as desired in your environment, then you can disable preview mode to allow Cloud Armor to actively enforce it.

Cloud Armor per-request WAF logs are emitted as part of the Application Load Balancer logs to Cloud Logging. In order to see what Cloud Armor’s decision was on every request, load balancer logging first needs to be enabled on a per backend service basis. Once it is enabled, all subsequent Cloud Armor decisions will be logged and can be found in Cloud Logging by following these instructions.

Getting started with Cloud Armor (if you’re not already)

If your workload is already using an Application Load Balancer to receive traffic from the internet, you can configure Cloud Armor to protect your workload from this and other application-level vulnerabilities (as well as DDoS attacks) by following these instructions.

If you are not yet using an Application Load Balancer and Cloud Armor, you can get started with the external Application Load Balancer overview, the Cloud Armor overview, and the Cloud Armor best practices.

If your workload is using Cloud Run, Cloud Run functions, or App Engine and receives traffic from the internet, you must first set up an Application Load Balancer in front of your endpoint in order to leverage Cloud Armor security policies to protect your workload. You will then need to configure the appropriate controls to ensure that Cloud Armor and the Application Load Balancer can’t be bypassed.

Long-term mitigation: Mandatory framework update and redeployment

While WAF rules provide critical frontline defense, the most comprehensive long-term solution is to patch the underlying frameworks.

We urge all customers running React and Next.js applications on Google Cloud to immediately update their dependencies to the latest stable versions (React 19.2.1 or the relevant version of Next.js listed here), and redeploy their services.

This applies specifically to applications deployed on:

Cloud Run, Cloud Run functions, or App Engine: Update your application dependencies with the updated framework versions and redeploy.
Google Kubernetes Engine (GKE): Update your container images with the latest framework versions and redeploy your pods.
Compute Engine: The public OS images provided by Google Cloud do not have React or Next.js packages installed by default. If you have installed a custom OS with the affected packages, update your workloads to include the latest framework versions and enable WAF rules in front of all workloads.
Firebase: If you’re using Cloud Functions for Firebase, Firebase Hosting, or Firebase App Hosting, update your application dependencies with the updated framework versions and redeploy. Firebase Hosting and App Hosting are also automatically enforcing a rule to limit exploitation of CVE-2025-55182 through requests to custom and default domains.

Patching your applications is an essential step to eliminate the vulnerability at its source and ensure the continued integrity and security of your services.

We will continue to monitor the situation closely and provide further updates and guidance as necessary. Please refer to our official Google Cloud Security advisories for the most current information and detailed steps.

Read More for the details.

2025 12 03

GCP – How CME Group builds a faster, smarter exchange on Cloud SQL

Tibor Kiss Cloud, Google Cloud gcp

Editor’s note: The Chicago Mercantile Exchange (CME Group) has evolved from a nineteenth-century commodities exchange into one of the most advanced financial market infrastructures in the world. To support real-time trading and risk management at a global scale, the company launched a strategic partnership with Google Cloud. By migrating to Cloud SQL and adopting AI-powered insights, CME Group empowered developers, paid down technical debt, and unlocked new opportunities for data-driven innovation across financial markets.

From butter and eggs to bandwidth

CME Group is where risk meets opportunity. Every transaction that happens in our exchange — every order placed, trade executed, or risk calculated — relies on data moving flawlessly and instantly. The integrity of our markets depends on it.

Behind each of those trades is a database storing valuations, ownership, and so much more information, all of which can shift from millisecond to millisecond throughout the day. At our scale, those databases have to store and retrieve that information under relentless demand. We’re processing millions of messages a day with no margin for latency or error. That level of precision doesn’t come easily, especially in a highly regulated industry where performance has to coexist with security and reporting. Every change we make must align with strict compliance standards and global regulatory frameworks.

Speed has always been our currency, but scale became a challenge. CME Group’s legacy database estate required significant engineering effort to maintain performance and meet regulatory demands. We needed to reduce operational overhead while improving our security posture. This required a managed database solution that offered transparent observability and clear compliance controls.

When Cloud SQL meets the trading floor

Our 10-year strategic partnership with Google Cloud aims to address this by migrating all our technology to the cloud, enabling us to innovate and collaborate on pushing the boundaries of what cloud infrastructure can support. Together, we’re experimenting with new ways to achieve ultra-low-latency performance in the cloud. As data volumes surge and AI becomes increasingly central to risk management, the ability to move and interpret information in milliseconds is a technical requirement. We’re building systems with Google Cloud that let us keep the market running, even as we lead it into the future.

With Cloud SQL, we’ve found a way to keep our data layer as fast and dependable as the markets we serve. Cloud SQL gives our teams real-time visibility into what’s happening inside the database. When an application slows, we can identify the root cause in minutes instead of hours. Those insights are built into the platform, which means we don’t need custom tooling or manual analysis to keep operations steady.

But for us, the value of Cloud SQL goes beyond performance tuning. It’s about confidence. Our database administrators can focus on strategic improvements, and our developers can validate and optimize queries without waiting for escalation. Taken together, we have faster troubleshooting and a data foundation ready for the always-on demands of global trading.

aside_block: <ListValue: [StructValue([(‘title’, ‘Build smarter with Google Cloud databases!’), (‘body’, <wagtail.rich_text.RichText object at 0x7efeae7312e0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Cloud SQL is our new favorite teammate

The more we use Cloud SQL, the more it feels like we’ve added a new member to the team. AI-assisted insights from Cloud SQL have changed how the CME Group team works. When an application slows, Cloud SQL tells us why. It surfaces anomalies, walks us through guided analysis, and even suggests query optimizations that restore performance in minutes. Developers can see those recommendations right in their workflows, test fixes, then move on. No waiting, no hand-offs, no firefights.

In other words, AI-assisted troubleshooting has made performance management into a shared responsibility. And because Cloud SQL delivers a consistent experience, our teams can move seamlessly between environments. There’s less training – and a lot more collaboration. The end result is a smarter, more unified data culture at CME Group.

Performance is our competitive advantage

The work we’re doing with Google Cloud is about more than modernization. Every improvement in speed, reliability, and visibility translates directly into business confidence. CME Group can now deploy new features faster while maintaining the continuity our clients depend on.

Cloud SQL has given us a foundation for that agility. Fewer performance issues mean more time focused on innovation: expanding our analytics capabilities, accelerating AI initiatives, and exploring new ways to commercialize data responsibly. When you stop chasing outages, it turns out you have more time to take bigger bets and build the future.

For us at CME Group, performance has always been the product. Now, it’s also the platform. We’re building the infrastructure with Google Cloud that keeps global markets moving and the intelligence that will shape what comes next.

Learn more:

Sign up for the new Cloud SQL free trial, a dedicated 30-day program designed to give both new and existing Google Cloud users hands-on access to premium, enterprise-grade features of Cloud SQL (PostgreSQL and MySQL).
Download this IDC report to learn how migrating to Cloud SQL can lower costs, boost agility, and speed up deployments.
Learn how Ford and Yahoo gained high performance and cut costs by modernizing with Cloud SQL.

Read More for the details.

2025 12 03

GCP – No metadata? No problem, with AI and Dataplex Universal Catalog

Tibor Kiss Cloud, Google Cloud gcp

If you’ve ever opened a dataset in BigQuery only to find columns with generic names like col1, col2, and value_x, you know the tax that poor documentation can put on analytics. At the heart of this issue is the schema — the blueprint of how your data is structured, named, and related. But when schemas are inconsistent, cryptic, or poorly documented, they create a knowledge gap that slows down discovery, governance, and trust.

This is the reality of “metadata debt.” A column named cust_id might mean “customer identifier” in one dataset and “customs record ID” in another. Multiply this ambiguity across hundreds of tables and thousands of columns, and you have a problem that plagues even the most modern data stacks.

For data engineers, analysts, and governance teams, the challenge of inadequate metadata is familiar:

Manual documentation doesn’t scale. Even with a dedicated data steward, keeping table and column descriptions up to date is a losing battle.
Context is scattered. Some details live in team wikis, others in spreadsheets, and some only in the minds of engineers who have since moved on.
Governance gets bottlenecked. Without clear definitions, policy enforcement and data classification become guesswork.

And now, with the rise of AI agents in analytics workflows, a poorly documented schema isn’t just an inconvenience — it’s a blocker. Simply put, an AI agent can’t query a column it doesn’t understand.

Automation enters the picture

But while AI may suffer from this problem, it can also help fix it, with automated metadata generation, a new capability in the Google Data Cloud that is now generally available. By analyzing profile data (think: data types, value distributions, patterns) alongside schema context, an AI system can draft human-readable descriptions for tables and columns — instantly.

Here’s what that means in practice:

A table named sales_fact_2025 might get a generated description like:“Contains transactional sales data for 2025, including product IDs, regions, quantities, and revenue.”
A column named qty might be described as:“Number of units sold in each transaction.”

It’s not just about filling in blanks. It’s about creating consistent, searchable, and understandable documentation that’s ready the moment a dataset lands in your environment.

The power of BigQuery + Dataplex

In a Google Cloud environment, you can use Dataplex Universal Catalog to automate metadata creation for your BigQuery datasets, right where you work:

Profiling in Dataplex gathers statistics about your BigQuery tables.
Gemini-powered generation turns those stats into clear, contextual descriptions for tables, columns, and even glossary terms.
Dataplex Universal Catalog stores these descriptions for search, governance, and AI workflows across your environment.

You get the benefits of automated metadata generation right away — whether you’re in the BigQuery console searching for datasets, in Dataplex applying governance policies, or in an AI-powered data agent. Benefits include:

1. Time to insights

Instead of spending upfront analysis time figuring out what the data represents, you can jump straight into querying it in BigQuery.

Before (without generated metadata): An analyst encounters a table with a column named c1. The data in the column looks like a series of numbers, but it’s unclear what they represent.
After (with generated metadata): The analyst sees the description for the c1 column: “Estimated annual revenue of the account.” They can now write the correct query from the start: SELECT account_id, c1 AS estimated_annual_revenue FROM accounts WHERE c1 > 1000000;

2. Governance at scale

The Dataplex Universal Catalog can now store AI-generated descriptions, meaning governance rules can be applied more effectively. When every column has a description, it’s easier to spot sensitive data, classify fields, and enforce compliance policies without manual detective work.

3. Fuel for AI agents

Data agents rely on metadata for grounding. When descriptions are complete and consistent, the AI can map natural language requests to the right datasets with higher accuracy. That means fewer hallucinations, more relevant results, and better trust in conversational analytics.

A customer perspective: Virgin Media O2

Virgin Media O2 is a leading British telecommunications company formed through the merger of Virgin Media and O2 UK. As one of the largest telecoms operators in the United Kingdom, providing mobile, broadband, TV and landline services, the company continues to innovate in how it manages and leverages data.

“As part of its forward-looking data strategy, Virgin Media O2 is enhancing how metadata is created, understood, and governed across its expansive data estate. With over 20,000 data assets distributed across a federated architecture of business units and data users, the organization is unlocking new opportunities to make data more meaningful, discoverable, and trusted.

To enable this, we implemented a Smart Metadata solution that combines the power of generative AI with the deep domain knowledge of our internal experts. Leveraging BigQuery Data Insights, AI automatically generates rich, contextual metadata by analysing schema, data profiles, and relationships. For example, a column named txn_amt becomes “Transaction Amount (in GBP) — derived from the daily retail sales feed,” making data instantly meaningful to analysts and business users. This metadata is then refined and validated through crowdsourced input from specialists across the organization — ensuring it reflects real-world business context and remains accurate, relevant, and actionable.

By blending automation with human intelligence, Virgin Media O2 built a scalable, governed, and intelligent metadata foundation. This approach enhances data discovery, strengthens data quality, and empowers teams across departments to make confident, data-driven decisions — turning metadata into a strategic enabler of innovation, trust, and enterprise-wide value.”

– Chandu Bhuman, Head of Data Strategy, Cloud & Engineering, Virgin Media O2

Looking ahead

Automated metadata generation doesn’t replace human judgment — you’ll still want to review and refine key business definitions — but it does close the gap between when data is created and when it becomes usable. For data analytics teams running on Google Cloud, that’s not just a productivity boost — it’s the foundation for the next wave of analytics, where humans and AI agents work from the same, clearly defined context. Automated metadata generation is also accessible through an API, making it easy to integrate with existing data engineering pipelines. To get started with automated metadata generation in Google Data Cloud, visit the documentation.

Read More for the details.

2025 12 03

AWS – Amazon SageMaker HyperPod now supports checkpointless training

Tibor Kiss AWS, Cloud AWS

Amazon SageMaker HyperPod now supports checkpointless training, a new foundational model training capability that mitigates the need for a checkpoint-based job-level restart for fault recovery. Checkpointless training maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This represents a fundamental shift from traditional checkpoint-based recovery, where failures require pausing the entire training cluster, diagnosing issues manually, and restoring from saved checkpoints, a process that can leave expensive AI accelerators idle for hours, costing your organization wasted compute.

Checkpointless training transforms this paradigm by preserving the model training state across the distributed cluster, automatically swapping out faulty training nodes on the fly and using peer-to-peer state transfer from healthy accelerators for failure recovery. By mitigating checkpoint dependencies during recovery, checkpointless training can help your organization save on idle AI accelerator costs and accelerate time. Even at larger scales, checkpointless training on Amazon SageMaker HyperPod enables upwards of 95% training goodput on cluster sizes with thousands of AI accelerators.

Checkpointless training on SageMaker HyperPod is available in all AWS Regions where Amazon SageMaker HyperPod is currently available. You can enable checkpointless training with zero code changes using HyperPod recipes for popular publicly available models such as Llama and GPT OSS. For custom model architectures, you can integrate checkpointless training components with minimal modifications for PyTorch-based workflows, making it accessible to your teams regardless of their distributed training expertise.

To get started, visit the Amazon SageMaker HyperPod product page and see the checkpointless training GitHub page for implementation guidance.

Read More for the details.

2025 12 03

AWS – New serverless model customization capability in Amazon SageMaker AI

Tibor Kiss AWS, Cloud AWS

Amazon Web Services (AWS) announces a new serverless model customization capability that empowers AI developers to quickly customize popular models with supervised fine-tuning and the latest techniques like reinforcement learning. Amazon SageMaker AI is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost AI model development for any use case.

Many AI developers seek to customize models with proprietary data for improved accuracy, but this often requires lengthy iteration cycles. For example, AI developers must define a use case and prepare data, select a model and customization technique, train the model, then evaluate the model for deployment. Now AI developers can simplify the end-to-end model customization workflow, from data preparation to evaluation and deployment, and accelerate the process. With an easy-to-use interface, AI developers can quickly get started and customize popular models, including Amazon Nova, Llama, Qwen, DeepSeek, and GPT-OSS, with their own data. They can use supervised fine-tuning and the latest customization techniques such as reinforcement learning and direct preference optimization. In addition, AI developers can use the AI agent-guided workflow (in preview), and use natural language to generate synthetic data, analyze data quality, and handle model training and evaluation—all entirely serverless.

You can use this easy-to-use interface in the following AWS Regions: Europe (Ireland), US East (N. Virginia), Asia Pacific (Tokyo), and US West (Oregon). To join the waitlist to access the AI agent-guided workflow, visit the sign-up page.

To learn more, visit the SageMaker AI model customization page and blog.

Read More for the details.

2025 12 03

AWS – Announcing TypeScript support in Strands Agents (preview) and more

Tibor Kiss AWS, Cloud AWS

In May, we open sourced the Strands Agents SDK, an open source python framework that takes a model-driven approach to building and running AI agents in just a few lines of code. Today, we’re announcing that TypeScript support is available in preview. Now, developers can choose between Python and TypeScript for building Strands Agents.

TypeScript support in Strands has been designed to provide an idiomatic TypeScript experience with full type safety, async/await support, and modern JavaScript/TypeScript patterns. Strands can be easily run in client applications, in browsers, and server-side applications in runtimes like AWS Lambda and Bedrock AgentCore. Developers can also build their entire stack in Typescript using the AWS CDK.

We’re also announcing three additional updates for the Strands SDK. First, edge device support for Strands Agents is generally available, extending the SDK with bidirectional streaming and additional local model providers like llama.cpp that let you run agents on small-scale devices using local models. Second, Strands steering is now available as an experimental feature, giving developers a modular prompting mechanism that provides feedback to the agent at the right moment in its lifecycle, steering agents toward a desired outcome without rigid workflows. Finally, Strands evaluations is available in preview. Evaluations gives developers the ability to systematically validate agent behavior, measure improvements, and deploy with confidence during development cycles.

Head to the Strands Agents GitHub to get started building.

Read More for the details.

2025 12 03

GCP – Sanctioned but Still Spying: Intellexa’s Prolific Zero-Day Exploits Continue

Tibor Kiss Cloud, Google Cloud gcp

Introduction

Despite extensive scrutiny and public reporting, commercial surveillance vendors continue to operate unimpeded. A prominent name continues to surface in the world of mercenary spyware, Intellexa. Known for its “Predator” spyware, the company was sanctioned by the US Government. New Google Threat Intelligence Group (GTIG) analysis shows that Intellexa is evading restrictions and thriving.

Intellexa has adapted, evaded restrictions, and continues selling digital weapons to the highest bidders. Alongside research published by our colleagues from Recorded Future and Amnesty, this blog post will shed light on Intellexa’s recent activities, unveil the real-world impact of their surveillance tools, and detail the actions we are taking against this industry.

Continued Prolific Exploitation of Zero-Day Vulnerabilities

Over the past several years, Intellexa has solidified its position as one of, if not the most, prolific spyware vendors exploiting zero-day vulnerabilities against mobile browsers. Despite the consistent efforts of security researchers and platform vendors to identify and patch these flaws, Intellexa repeatedly demonstrates an ability to procure or develop new zero-day exploits, quickly adapting and continuing operations for their customers.

Intellexa is responsible for a substantial number of the zero-day vulnerabilities identified over the years by Google’s Threat Analysis Group (TAG), now part of GTIG. As an example, out of approximately 70 zero-day vulnerabilities discovered and documented by TAG since 2021, Intellexa accounts for 15 unique zero-days, including Remote Code Execution (RCE), Sandbox Escape (SBX), and Local Privilege Escalation (LPE) vulnerabilities. All of these zero-days have been patched by the respective vendors. In addition to developing exploitation of zero-days, we increasingly see evidence that Intellexa is purchasing steps of exploit chains from external entities.

CVE	Role	Vendor	Product	Type	Description
CVE-2025-48543	SBX+LPE	Google	Android	Memory corruption	Use-After-Free in Android Runtime
CVE-2025-6554	RCE	Google	Chrome	Memory corruption	Type confusion in V8
CVE-2023-41993	RCE	Apple	iOS	Memory Corruption	WebKit JIT RCE
CVE-2023-41992	SBX+LPE	Apple	iOS	Memory Corruption	Kernel IPC Use-After-Free
CVE-2023-41991	LPE	Apple	iOS	Code Signing Bypass	Code Signing Bypass
CVE-2024-4610	LPE	ARM	Mali	Memory Corruption	Improper GPU memory processing operations
CVE-2023-4762	RCE	Google	Chrome	Memory corruption	Type confusion in V8
CVE-2023-3079	RCE	Google	Chrome	Memory Corruption	Type Confusion in V8
CVE-2023-2136	SBX	Google	Skia	Memory Corruption	Integer overflow in Skia SKSL
CVE-2023-2033	RCE	Google	Chrome	Memory Corruption	Use-After-Free in V8
CVE-2021-38003	RCE	Google	Chrome	Memory Corruption	Inappropriate implementation in V8
CVE-2021-38000	RCE	Google	Chrome	Logic/Design Flaw	Insufficient validation of untrusted input in Intents
CVE-2021-37976	SBX	Google	Chrome	Memory Corruption	Information leak in memory_instrumentation
CVE-2021-37973	SBX	Google	Chrome	Memory Corruption	Use-after-free in Portals
CVE-2021-1048	SBX+LPE	Google	Android	Memory Corruption	Use-After-Free in ep_loop_check_proc

Table 1: Zero-days associated with Intellexa since 2021

Exploit Chain

Partnering with our colleagues at CitizenLab in 2023, we captured a full iOS zero-day exploit chain used in the wild against targets in Egypt. Developed by Intellexa, this exploit chain was used to install spyware publicly known as Predator surreptitiously onto a device. According to metadata, Intellexa referred to this exploit chain internally as “smack.”

First Stage: JSKit Framework Déjà Vu

The initial stage of the exploit chain was a Safari RCE zero-day that Apple fixed as CVE-2023-41993. The exploit leveraged a framework internally called “JSKit.” Once arbitrary memory read and write primitives have been achieved thanks to a vulnerability in the renderer, in this case CVE-2023-41993, the framework provides all the requisite components to perform native code execution on modern Apple devices.

We believe that Intellexa acquired their iOS RCE exploits from an external entity, as we have seen this exact same JSKit framework used by other surveillance vendors and government-backed attackers since 2021. In 2024, we reported publicly on a campaign by Russian government-backed attackers using this exact same iOS exploit and JSKit framework in a watering hole attack against Mongolian government websites. We have also seen it used in other campaigns by surveillance vendors, including another surveillance vendor using the same framework when exploiting CVE-2022-42856 in 2022.

The JSKit framework is well maintained, supports a wide range of iOS versions, and is modular enough to support different Pointer Authentication Code (PAC) bypasses and code execution techniques. The framework can parse in-memory Mach-O binaries to resolve custom symbols and can ultimately manually map and execute Mach-O binaries directly from memory. In addition, the JSKit framework is fairly robust and well engineered, with each step of the exploitation process tested carefully. To date, we haven’t seen a similar framework exist for Android.

Figure 1: Example of testing and validating shellcode execution

The exploit Intellexa used was apparently tracked internally as “exploit number 7,” according to debug strings at the entry point of the RCE exploit. This suggests that the external entity supplying exploits likely possesses a substantial number of iOS exploits targeting a wide range of versions.

Figure 2: Debug string suggesting multiple iOS exploits

Regarding Chrome exploitation, Intellexa has used a custom framework with all the features needed to gain code execution from any arbitrary vulnerability capable of leaking TheHole magic object in V8. They first used this framework with CVE-2021-38003, then with CVE-2023-4762, CVE-2023-3079, CVE-2023-2033, and more recently in June 2025 with CVE-2025-6554, observed in Saudi Arabia. This most recent, CVE-2025-6554, was a type confusion error in Chrome’s v8 engine. Chrome quickly mitigated the issue for all Chrome users with a configuration change and then fixed the bug as CVE-2025-6554 in version 138.0.7204.96. All these CVEs are vulnerabilities in V8 that all can be used to leak TheHole object.

Following Stages: Watching the Helper

The second stage is the most technical part of the chain and would require an entire separate blog post to describe all of its functionality. Essentially, this stage is in charge of breaking out of the Safari sandbox and executing an untrusted third stage payload as system by abusing the kernel vulnerabilities CVE-2023-41991 and CVE-2023-41992. This second stage communicates with the first stage to re-use some of the primitives (e.g., PAC bypass) and offers kernel memory read/write capabilities to the third stage.

The third stage (tracked by GTIG as PREYHUNTER) is the last one we captured and is composed of two modules called “helper” and “watcher.”

The watcher module primarily ensures that the infected device does not exhibit suspicious behavior; if such behavior is detected, a notification is generated, and the exploitation process is terminated. The module is also in charge of monitoring crashes.

The following behaviors are detected:

Developer mode via security.mac.amfi.developer_mode_status
Console attached via diagnosticd
US or IL locale set on the phone
Cydia installed
Bash, tcpdump, frida, sshd, or checkrain process currently running on the phone
McAfee, AvastMobileSecurity, or NortonMobileSecurity installed on the phone
Custom HTTP proxy setup
Custom root CA installed

The helper module is communicating with the other parts of the exploit via a Unix socket at /tmp/helper.sock. Similar to the ALIEN malware for Android, the module has the ability to hook various places with custom frameworks called DMHooker and UMHooker. These hooks are allowing the module to perform basic spyware capabilities such as:

Recording VOIP conversations (stored in /private/var/tmp/l/voip_%lu_%u_PART.m4a)
Running a keylogger
Capturing pictures from the camera

The module is also hooking into the SpringBoard in order to hide user notifications caused by the aforementioned actions. We believe these capabilities are provided to the operator to make sure the infected device is the correct one before deploying a more sophisticated spyware, such as Predator.

The binary left compilation artifacts such as the following build directory including the name of the exploit chain.

/Users/gitlab_ci_2/builds/jbSFKQv5/0/roe/ios16.5-smackjs8-production/.

Overall, these exploits are high in sophistication, especially compared to the less sophisticated spyware stager, supporting our assessment that the exploits were likely acquired from another party.

Disrupting Novel Delivery Capabilities

The primary delivery mechanism for Intellexa’s exploits remains one-time links sent to targets directly via end-to-end encrypted messaging applications. However, we have also observed another tactic with a few customers—the use of malicious advertisements on third-party platforms to fingerprint users and redirect targeted users to Intellexa’s exploit delivery servers.

We believe this campaign is another example of commercial surveillance vendors abusing ads for exploit delivery, and Intellexa has gotten increasingly involved in this space since early 2025. Working with our partners, we identified the companies Intellexa created to infiltrate the advertising ecosystem, and those partners subsequently shut down the accounts from their platforms.

Addressing the Threat of Intellexa’s Activities

Community efforts to raise awareness have built momentum toward an international policy response. Google has been a committed participant in the Pall Mall Process, designed to build consensus and progress toward limiting the harms from the spyware industry. Together, we are focused on developing international norms and frameworks to limit the misuse of these powerful technologies and protect human rights around the world. These efforts are built on earlier governmental actions, including steps taken by the US Government to limit government use of spyware, and a first-of-its-kind international commitment to similar efforts.

Recognizing the severity and widespread nature of Intellexa’s activities in particular, we have made the decision to simultaneously deliver our government-backed attack warning to all known targeted accounts associated with Intellexa’s customers since 2023. This effort encompasses several hundred accounts across various countries, including Pakistan, Kazakhstan, Angola, Egypt, Uzbekistan, Saudi Arabia, and Tajikistan, ensuring that individuals at risk are made aware of these sophisticated threats.

Following our disclosure policy, we are sharing our research to raise awareness and advance security across the ecosystem. We have also added all identified websites and domains to Safe Browsing to safeguard users from further exploitation. We urge users and organizations to apply patches quickly and keep software fully up-to-date for their protection. Google will remain focused on detecting, analyzing, and preventing zero-day exploitation as well as reporting vulnerabilities to vendors immediately upon discovery.

Indicators of Compromise (IOCs)

To assist the wider community in hunting and identifying activity outlined in this blog post, we have included IOCs in a GTI Collection for registered users.

File Indicators

85d8f504cadb55851a393a13a026f1833ed6db32cb07882415e029e709ae0750
e3314bcd085bd547d9b977351ab72a8b83093c47a73eb5502db4b98e0db42cac

YARA Rule

This rule is intended to serve as a starting point for hunting efforts to identify PREYHUNTER malware; however, it may need adjustment over time.

rule G_Hunting_PREYHUNTER_IOSStrings_1 {
	meta:
		author = "Google Threat Intelligence Group (GTIG)"
	strings:
		$ = "/Users/gitlab_ci_2/builds/jb"
		$ = "/roe/ios1"
		$ = "-production/libs/Exploit" ascii wide
		$ = "/private/var/tmp/l/voip_%lu_%u_PART.m4a" ascii wide
		$ = "/private/var/tmp/etherium.txt" ascii wide
		$ = "/private/var/tmp/kusama.txt" ascii wide
		$ = "_gadget_pacia" ascii wide
		$ = "ZN6Helper4Voip10setupHooksEvE3$_3" ascii wide
		$ = "Hook 1 triggered! location:" ascii wide
		$ = "KernelReaderI11CorelliumRWE" ascii wide
		$ = "NSTaskROP20WithoutDeveloperMode" ascii wide
		$ = "UMHookerI14RemoteTaskPort" ascii wide
		$ = "com.elanbenami.EnneaApp" ascii wide
		$ = "callFunc: building PAC cache for" ascii wide
		$ = "select  tset  FROM tsettings WHERE INSTR(tset, ?)" ascii wide
		$ = "select * from tsettings WHERE length(sha256) > ?" ascii wide
		$ = "isTrojanThreadERK" ascii wide
		$ = "getpid from victim returned:" ascii wide
		$ = "victim task kaddr:" ascii wide
	condition:
		1 of them
}

Acknowledgements

We would like to acknowledge and thank The Citizen Lab and Amnesty International for their collaboration and partnership.

Read More for the details.

2025 12 03

GCP – Building Conversational Genomics

Tibor Kiss Cloud, Google Cloud gcp

aside_block: <ListValue: [StructValue([(‘title’, ‘D I S C L A I M E R’), (‘body’, <wagtail.rich_text.RichText object at 0x7f6ea37cd850>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Modern genomics has made remarkable progress in automating the early stages of analysis. Sequencing technology continues to improve in both cost and throughput, and variant calling pipelines reliably identify millions of genetic differences from reference genomes. But when it comes to interpreting those variants, there’s a bottleneck: iterative data exploration.

When Exploration Becomes the Bottleneck

While the initial processing is increasingly streamlined, variant interpretation remains time-consuming. The iteration cycle looks like this:

A researcher notices several variants in BRCA1 (a cancer-related gene) and wants to compare their frequencies across European vs Asian populations. That analytical question triggers a context switch—from analysis mode to coding mode. Now they’re writing a script to parse the genomic data file, query gnomAD (a population frequency database) for each variant, aggregate results by ancestry, and generate a comparison plot. They wait for results, generate the visualization, and by the time it appears, the flow of analytical thinking has been disrupted by context-switching between interpretation and implementation.

The cycle repeats with each new question. Want to filter for variants that are both rare in the population and predicted to damage protein function in genes affecting the heart? That requires scripting dataframes with multiple conditional filters across different annotation sources—another context switch that breaks analytical flow.

This pattern, where the time between asking a question and seeing results limits hypothesis testing, constrains the iterative exploration that drives genomic discovery. It’s a bottleneck that tools like DeepVariant have highlighted rather than solved—we’ve automated finding variants but not understanding them.

What if you could just ask?

Conversation as the Interface

Working with the Google Research Genomics team, we explored whether multi-agent systems could transform this final mile. We saw potential in combining Google’s Agent Development Kit (ADK) with genomic workflows to eliminate the constant context-switching between analysis and coding.

Here’s how it works in practice. A researcher analyzing patient genome HG002 (7.8 million variants) asks:

“Show me pathogenic variants in cardiovascular genes”

Five seconds later, a filtered list appears with population frequencies already integrated from gnomAD:

The researcher notices APOB in the output and immediately follows up:

“Compare APOB across ethnic populations”

Three seconds later, a heatmap visualization materializes showing allele frequencies across 8 ancestries.

“What’s the clinical significance?”

Two seconds later, a comprehensive breakdown appears covering APOB’s role in lipid metabolism, associated conditions such as Familial Hypobetalipoproteinemia and Familial Hypercholesterolemia, and relevant clinical context from configured knowledge sources (in our implementation, NCBI Gene and PubMed).

No scripting. No context switching. Just questions and instant answers.

This is the result of combining Google’s ADK, Gemini, and Cloud infrastructure to rethink how we interact with complex scientific data.

Proving It Works: The APOB Spike Test

To validate accuracy, we designed a controlled experiment that would satisfy the clinical-grade standards the Google Genomics team maintains across all their tools. We recognized that conversational convenience couldn’t compromise precision—the system must detect single-variant changes with minimal or no false positives. In clinical genomics, a single pathogenic variant can determine whether a patient needs immediate medical intervention, making false positives (which could trigger unnecessary treatments or family screening) and false negatives (which could miss actionable conditions) equally problematic.

To validate accuracy, we designed a controlled experiment. Starting with the reference genome HG002 (Genome in a Bottle consortium standard), we deliberately inserted genomic material with known pathogenic variant at a low variant allele fraction (VAF), and verified the system could detect it among 7.8 million background variants—like finding a specific grain of sand on a beach.

Genome	Total Found Variants	Details	Expected Result
HG002_v1.0 (original reference, 2020)	7,882,234	NovaSeq PCR-free 30x coverage, DeepVariant v1.0	No APOB variants
HG002_v1.0_pathogenic (deliberately spiked)	7,882,235 (+1)	Artificially inserted 2:21006087:C>T in APOB gene, associated with Familial Hypercholesterolemia	Detect the single spike
HG002_v1.9 (newer caller, 2025)	8,861,146 (~979K more)	Same sample/coverage, DeepVariant v1.9 (improved algorithm)	No APOB variants

The Test

The validation required running the standard genomic analysis pipeline—VEP annotation (predicting variant effects), ClinVar integration (checking known pathogenic variants), gnomAD population frequency queries (retrieving ancestry-specific allele frequencies), and clinical report generation. While the pipeline itself runs automatically, exploring and querying results traditionally requires writing scripts for each question. With our conversational system, we simply asked: “Are there any APOB results?”

The Results

The spike-in test validated three critical capabilities. First, it demonstrated single-variant precision. Starting with HG002’s 7,882,234 variants (the original reference), the system correctly reported no APOB findings. When we inserted a single pathogenic variant (2:21006087:C>T), the system detected it immediately—finding our needle in a 7.8-million-variant haystack. Beyond detection, it correctly pulled the variant’s ClinVar pathogenicity classification (Pathogenic/Likely_pathogenic for Familial Hypercholesterolemia) and accurate population frequencies from gnomAD.

The test also explored caller independence. The system handled two different versions of DeepVariant (v1.0 and v1.9) without modification. Version 1.0 identified 7,882,234 variants while version 1.9’s improved sensitivity found 8,861,146—nearly a million additional variants. In both cases, the system correctly reported no APOB findings, demonstrating compatibility across different versions of the same caller. While this demonstrates robustness within the DeepVariant family, testing with different variant callers (e.g., GATK, Freebayes) would be needed to establish broader caller independence.

Finally, processing 8.8 million variants from v1.9 with the same accuracy as 7.8 million validated that the architecture scales. No false positives appeared in either search space in this validation test, and the conversational layer performed consistently regardless of the underlying data volume, though broader testing across variant types and query patterns would further validate these performance characteristics.

The spike test revealed something else interesting about how we handle scale. Remember those 7.8 million variants? In a hypothetical clinical deployment, when analyzing patient data, the focus would typically be on ACMG secondary findings—medically actionable genes that require immediate attention.

Based on what you’re looking for, the system adapts its approach:

code_block: <ListValue: [StructValue([(‘code’, ‘if analysis_mode == “clinical”:rn # ACMG SF v3.3 – 84 genes onlyrn variants_to_annotate = filter_variants_to_acmg_genes(variants)rnelse:rn # All 7.8M variantsrn variants_to_annotate = variants’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cf6d0>)])]>

Instead of searching the entire beach for our specific grain of sand, clinical mode searches just the lifeguard zones—the 84 genes where findings change medical care. Research mode (analyzing all variants genome-wide) still searches everywhere. Both detected our APOB spike perfectly, but clinical mode did it by examining only variants in ACMG genes—typically less than 0.01% of the total. That’s not just an optimization; it’s how clinical genomics is practiced—prioritizing genes with established clinical actionability for patient care.

The Architecture: Two-Phase Design

The system follows a pattern common to many scientific domains: separate expensive upfront computation from interactive exploration. You do heavy lifting once (minutes to hours), then enable rapid iteration (seconds per query).

Think of analyzing a whole genome, processing mass spectrometry data, running climate simulations, or reconstructing medical images. The computational work happens first, then researchers want to explore results interactively. But traditional workflows often require scripting for each new exploration step.

Our approach applies multi-agent conversational AI to this pattern in genomics:

Phase 1: Computational Analysis (Runs Once)

The system performs standard tertiary genomic analysis—VEP annotation, ClinVar lookups, and gnomAD frequency queries. What’s different is that agents orchestrate and track these workflows:

VEP Annotation (~60 minutes)

All 7.8M variants annotated with gene symbols and functional predictions
Runs on Google Kubernetes Engine (GKE) with parallelization across available cores
Uses Persistent Disk for VEP cache (reduces time from 6+ hours to ~1 hour)
Triggered by InitiationPipeline agent, managed via Cloud Tasks

Knowledge Integration (~5-8 minutes)

ClinVar: Pathogenicity classifications from local cache
gnomAD: Population frequencies via BigQuery across 8 ancestries
Clinical Assessment: Gemini generates actionable summaries from patterns
Orchestrated by ReportPipeline agent upon VEP completion

The system stores all results using ADK’s GcsArtifactService (annotated variants, clinical findings) in Cloud Storage for instant querying during Phase 2. This provides automatic versioning and persistence across pod restarts.

Phase 2: Interactive Exploration (Real-Time)

Queries leverage both ADK artifacts and real-time data retrieval. Gene-specific questions load stored annotations from the GcsArtifactService (instant), while population frequency comparisons may trigger live BigQuery queries against gnomAD—but BigQuery’s performance makes these feel instantaneous (sub-5 seconds). Visualizations are generated on demand by the QueryAgent.

Think of Phase 1 as indexing a massive book. It takes time upfront, but once complete, you can look up anything instantly. This separation enables conversational speed without sacrificing computational depth.

The Google AI Stack That Powered It

Building a production-grade conversational genomics system required orchestrating multiple Google Cloud services. Here’s what each piece contributes and why it mattered:

Google ADK

ADK enabled us to structure the system as coordinated specialists rather than a monolithic pipeline. The key was using different agent types for different problems.

code_block: <ListValue: [StructValue([(‘code’, ‘# Root coordinator uses LLM for flexible routingrncoordinator = LlmAgent(rn name=”GenomicCoordinator”,rn model=”gemini-2.5-flash”,rn description=”Routes variant analysis requests based on user intent”,rn sub_agents=[initiation_pipeline, rn completion_pipeline, rn report_pipeline, rn query_agent]rn)rnrn# Critical workflows use sequential execution for reliabilityrninitiation_pipeline = SequentialAgent(rn name=”InitiationPipeline”, rn sub_agents=[intake_agent, vep_start_agent]rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cf730>)])]>

This hybrid approach matters because LLM agents excel at understanding ambiguous user intent and making routing decisions, while sequential agents excel at deterministic workflows requiring exact execution order. Use LLMs for flexibility, use sequential workflows for reliability.

When a user says “analyze this VCF for clinical findings,” the LLM coordinator understands that “clinical” implies ACMG secondary findings analysis, routes to InitiationPipeline, which executes a deterministic sequence: validate VCF → parse variants → determine analysis mode → create background job. No guesswork in critical paths.

Google Kubernetes Engine (GKE)

Variant annotation is computationally expensive. VEP (Variant Effect Predictor) processes 7.8 million variants against reference databases. We deployed on n2-highmem-32 nodes (32 vCPU, 256 GB RAM) to parallelize aggressively:

code_block: <ListValue: [StructValue([(‘code’, ‘resources:rn requests:rn cpu: “30”rn memory: “120Gi”rn limits:rn cpu: “32” rn memory: “240Gi”rnvolumeMounts:rn – name: vep-cachern mountPath: /mnt/cachern readOnly: true’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cfb20>)])]>

By mounting a 100GB Persistent Disk with pre-downloaded VEP reference data, we eliminated network bottlenecks during annotation. VEP time dropped from 6+ hours to ~1 hour. GKE’s ability to mount persistent volumes to pods made this straightforward.

BigQuery

gnomAD (Genome Aggregation Database) provides population allele frequencies. Google hosts gnomAD v2 and v3 as public BigQuery datasets.

For every pathogenic variant, we query both versions:

code_block: <ListValue: [StructValue([(‘code’, “– Query v3 first (GRCh38)rnSELECT allele_freq, popmax, AF_afr, AF_amr, AF_asj, AF_eas, AF_fin, AF_nfe, AF_sas, AF_othrnFROM `bigquery-public-data.gnomad.v3_genomes` rnWHERE chrom = ‘{chrom}’ AND pos = {pos} AND ref = ‘{ref}’ AND alt = ‘{alt}’rnrn– If not found, query v2 (GRCh37)rnSELECT allele_freq, popmax, AF_afr, AF_amr, AF_eas, AF_fin, AF_nfe, AF_asj, AF_othrnFROM `bigquery-public-data.gnomad.v2_1_1_genomes`rnWHERE chrom = ‘{chrom}’ AND pos = {pos} AND ref = ‘{ref}’ AND alt = ‘{alt}'”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cfd30>)])]>

Cloud Storage and Firestore: State Management

The system uses a three-layer state management architecture, combining ADK’s native services with external storage:

VertexAI Session Service (ADK Native – Conversational State)

ADK’s VertexAiSessionService manages conversational context—user preferences, analysis mode selection (clinical vs research), conversation history. This persists across pod restarts and enables the coordinator agent to route requests intelligently based on session state.

code_block: <ListValue: [StructValue([(‘code’, ‘session_service = VertexAiSessionService(rn agent_engine_id=AGENT_ENGINE_IDrn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cfd00>)])]>

GCS Artifact Service (ADK Native – Heavy Data Storage)

ADK’s GcsArtifactService handles large genomic data artifacts—parsed variants, VEP annotations, final reports. These binary objects can be hundreds of megabytes and need persistent, versioned storage:

code_block: <ListValue: [StructValue([(‘code’, ‘artifact_service = GcsArtifactService(bucket_name=BUCKET_NAME)rnrn# Save VEP annotationsrnawait artifact_service.save_artifact(rn app_name=APP_NAME,rn user_id=user_id, rn session_id=session_id,rn filename=f”annotated_variants_{task_id}.pkl”,rn data=pickle.dumps(annotations)rn)rnrn# Later queries load instantlyrnannotations = pickle.loads(rn await artifact_service.load_artifact(…, filename=…)rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cfc10>)])]>

Firestore (External – Async Job Status)

Firestore tracks background job status independently from ADK’s session management. This separation matters because Cloud Tasks workers need to query and update job progress without loading full session context:

code_block: <ListValue: [StructValue([(‘code’, “# Worker updates VEP statusrnfirestore_client.collection(‘tasks’).document(task_id).update({rn ‘status’: ‘vep_complete’,rn ‘completed_at’: datetime.now()rn})rnrn# CompletionPipeline polls for completionrntask_doc = firestore_client.collection(‘tasks’).document(task_id).get()rnif task_doc.get(‘status’) == ‘vep_complete’:rn trigger_report_generation()”), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37cfbb0>)])]>

This architecture separates concerns: conversational state lives in ADK sessions, heavy data in ADK artifacts, and async orchestration state in Firestore. Each layer optimizes for its specific use case—lightweight session state, large binary persistence, and independent job tracking respectively.

The system uses a dual-reference strategy because some variants exist in GRCh38 (v3) but not GRCh37 (v2), or vice versa. The system tries v3 first, falls back to v2, ensuring maximum coverage regardless of reference genome mismatches. BigQuery’s free tier (1TB queries/month) covered our usage.

Cloud Tasks

VEP annotation takes ~1 hour. We can’t block the main service waiting. Cloud Tasks manages long-running background jobs:

code_block: <ListValue: [StructValue([(‘code’, “task = {rn ‘http_request’: {rn ‘http_method’: ‘POST’,rn ‘url’: f'{worker_url}/worker/run-vep’,rn ‘body’: json.dumps({rn ‘task_id’: task_id,rn ‘input_artifact’: vcf_artifact_namern })rn }rn}rnrntasks_client.create_task(parent=queue_path, task=task)”), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x7f6ea37d20a0>)])]>

The main service returns immediately: “VEP annotation started, check back in 60 minutes.” Cloud Tasks handles:

Automatic retries on worker failures
Rate limiting (prevents overwhelming workers)
Status tracking via Firestore

When VEP completes, another Cloud Task triggers report generation. The coordinator agent monitors status and responds conversationally: “VEP finished! Generating clinical report now (~3 minutes)…”

Next Steps

While the current system integrates ClinVar for known pathogenic variants, the multi-agent architecture naturally extends to incorporate emerging tools. AlphaMissense, Google DeepMind’s tool for predicting missense variant pathogenicity, could provide insights for the millions of variants not yet clinically annotated. When users query novel variants, the system would seamlessly blend ClinVar’s validated annotations with AlphaMissense predictions, clearly distinguishing between established clinical knowledge and AI-driven insights.

Looking further ahead, AlphaGenome initiatives could extend the system beyond point mutations to structural variants and regulatory elements. Because agents orchestrate tools rather than hardcode logic, adding new prediction services or annotation tools requires no architectural changes—just expanded capabilities for the existing agents to leverage.

Conclusion

The bottleneck in many scientific domains has shifted from data generation to data interpretation. In genomics, sequencing costs have plummeted and variant calling is increasingly automated, but exploratory analysis often requires scripting for each new question.

By rethinking analysis as conversation rather than scripting, we eliminate this barrier. Researchers can explore data at the speed of thought, generating insights in seconds that previously required minutes to hours of coding.

Most importantly, the patterns we discovered generalize beyond genomics. The combination of Google ADK for multi-agent orchestration and native state management, Gemini for natural language understanding, GKE for high-performance compute, BigQuery for knowledge integration, and Cloud Tasks for async orchestration provides a blueprint for transforming any compute-intensive scientific workflow into a conversational experience.

The future of scientific computing isn’t just faster—it’s conversational.

Additional resources

For deeper exploration, check out these resources:

Read More for the details.

2025 12 03

AWS – Introducing elastic training on Amazon SageMaker HyperPod

Tibor Kiss AWS, Cloud AWS

Amazon SageMaker HyperPod now supports elastic training, enabling organizations to accelerate foundation model training by automatically scaling training workloads based on resource availability and workload priorities. This represents a fundamental shift from training with a fixed set of resources, as it saves hours of engineering time spent reconfiguring training jobs based on compute availability.

Any change in compute availability previously required manually halting training, reconfiguring training parameters, and restarting jobs—a process that requires distributed training expertise and leaves expensive AI accelerators sitting idle during training job reconfiguration. Elastic training automatically expands training jobs to absorb idle AI accelerators and seamlessly contracting when higher-priority workloads need resources—all without halting training entirely.

By eliminating manual reconfiguration overhead and ensuring continuous utilization of available compute, elastic training can help save time previously spent on infrastructure management, reduce costs by maximizing cluster utilization, and accelerate time-to-market. Training can start immediately with minimal resources and grow opportunistically as capacity becomes available.

SageMaker HyperPod is available in all regions where Amazon SageMaker HyperPod is currently available. Organizations can enable elastic training with zero code changes using HyperPod recipes for publicly available models including Llama and GPT OSS. For custom model architectures, customers can integrate elastic training capabilities through lightweight configuration updates and minimal code modifications, making it accessible to teams without requiring distributed systems expertise.

To get started, visit the Amazon SageMaker HyperPod product page and see the elastic training documentation for implementation guidance.

Read More for the details.

2025 12 03

AWS – Amazon Bedrock now supports reinforcement fine-tuning delivering 66% accuracy gains on average over base models

Tibor Kiss AWS, Cloud AWS

Amazon Bedrock now supports reinforcement fine-tuning, helping you improve model accuracy without needing deep machine learning expertise or large sums of labeled data. Amazon Bedrock automates the reinforcement fine-tuning workflow, making this advanced model customization technique accessible to everyday developers. Models learn to align with your specific requirements using a small set of prompts rather than the large sums of data needed for traditional fine-tuning methods, enabling teams to get started quickly. This capability teaches models through feedback on multiple possible responses to the same prompt, improving their judgement of what makes a good response. Reinforcement fine-tuning in Amazon Bedrock delivers 66% accuracy gains on average over base models so you can use smaller, faster, and more cost-effective model variants while maintaining high quality.

Organizations struggle to adapt AI models to their unique business needs, forcing them to choose between generic models with average performance or expensive, complex customization that requires specialized talent, infrastructure, and risky data movement. Reinforcement fine-tuning in Amazon Bedrock removes this complexity by making advanced model customization fast, automated, and secure. You can train models by uploading training data directly from your computer or choose from datasets already stored in Amazon S3, eliminating the need for any labeled datasets. You can define reward functions using verifiable rule-based graders or AI-based judges along with built-in templates to optimize your models for both objective tasks such as code generation or math reasoning, and subjective tasks such as instruction following or chatbot interactions. Your proprietary data never leaves AWS’s secure, governed environment during the entire customization process, mitigating security and compliance concerns.

You can get started with reinforcement fine-tuning in Amazon Bedrock through the Amazon Bedrock console and via the Amazon Bedrock APIs. At launch, you can use reinforcement fine-tuning with Amazon Nova 2 Lite with support for additional models coming soon. To learn more about reinforcement fine-tuning in Amazon Bedrock, read the launch blog, pricing page, and documentation.

Read More for the details.

2025 12 02

GCP – GKE Turns 10 Hackathon: Announcing the winners and highlights

Tibor Kiss Cloud, Google Cloud gcp

The GKE Turns 10 Hackathon was an electrifying showcase of developer ingenuity! Building on the excitement from our initial announcement, the hackathon challenged participants to build powerful AI agents that interact with microservice applications using the robustness of Google Kubernetes Engine (GKE) and the intelligence of Google AI models like Gemini.

The goal was to seamlessly integrate next-generation agentic AI capabilities, all orchestrated on GKE, to elevate existing applications to new heights. The hackathon attracted an incredible 4,773 registered participants from 133 countries, culminating in 133 innovative gallery projects.

Now, let’s give a massive round of applause to our remarkable winners whose projects are a testament to the creativity, technical mastery, and deep understanding of GKE and AI at the heart of this challenge.

Grand prize winner: Amie Wei

The grand prize winner, Amie Wei, was invited to attend KubeCon + CloudNativeCon North America 2025, where she shared her experience and insights. Amie presented a lightning talk, joined an exclusive interview with Stephanie Wong at the House of Kube, and was featured in a broadcast on theCUBE, where she discussed the hackathon trends and her winning application. Read about Amie’s interview and the hackathon highlights here.

She highlighted how Google credits and available resources made it easier to get hands-on experience with GKE and AI.

“You miss 100% of the shots you do not take”, said Amie when asked what she wanted to tell any other first time hackathon participant or developer.

Check Amie’s submission: The cart-to-kitchen AI assistant on GKE. This AI shopping assistant analyzes a user’s grocery cart and recommends recipes. It uses Google products including Gemini, GKE Autopilot, Agent Development Kit (ADK), and Agent-to-Agent (A2A) protocols to enable AI model communication. The assistant helps users decide what to make for dinner based on what they already have.

GKE Hackathon - Cart To Kitchen AI Assistant

Regional Winners

Amie’s project set a high bar, but the creativity continued across the globe. We are excited to celebrate our regional winners who demonstrated remarkable technical skill and exemplified the potential of integrating agentic AI with GKE.

North America regional winner

The North America regional winner, Anh Lam, was also invited to attend KubeCon + CloudNativeCon North America 2025, where she presented a lightning talk highlighting her technical execution on GKE. She also had an exclusive interview with Stephanie Wong at the House of Kube.

Check out Anh’s submission: CardOS: AI-Powered Credit Pre-Approval System CardOS uses a multi-agent AI pipeline to revolutionize credit pre-approval. It analyzes spending patterns from Bank of Anthos data, tailors terms and perks, and balances bank profitability with customer value.

Latin America regional winner
NeroFashion by Hudson Araújo, Gabriel Valentim, Samuel Cavalcanti, and Giovanna Moeller.

NeroFashion created a plug-and-play AI microservice for the Online Boutique application running on GKE. The project enhances the online shopping experience by adding a layer of intelligence with features like image mixing to allow users to see themselves wearing the product, and smart descriptions for accessibility.

Asia Pacific regional winner
V-Commerce Studio by Rakesh E, Poujhit MU, Manjunathan R, Mary Shermila.

V-Commerce Studio redefines e-commerce with AI personalized chat, proactive engagement, virtual try-ons, and instant ad generation, all built on Google’s ecosystem for intelligent retail automation. It enhances customer experience and automates business workflows, using Gemini models and GKE orchestration.

Europe, Middle East, Africa regional winner
Cartmate by Victor Bash.

Cartmate transforms online shopping into an intelligent, conversational experience. The AI assistant understands style, learns preferences, and provides a truly personalized shopping experience through six specialized AI agents, all orchestrated using a multi-agent architecture deployed on GKE.

Honorable mention #1
Voice Teller – Dial ADK + MCP by Julian Hecker.

Voice Teller is an AI phone agent that handles core banking actions (log in by voice, check balance, create transaction) by replacing clunky IVR systems with an intelligent, real-time voice interface. It runs all components on a GKE Autopilot cluster and uses Agent Development Kit (ADK) for conversational logic and tool orchestration.

Honorable mention #2
CO2‑Aware Shopping Assistant by Prabhakaran Jayaraman Masani.

This project is an intelligent shopping companion that helps users make environmentally conscious purchasing decisions. It features AI-powered product discovery with real-time environmental impact scoring, sustainable shipping optimization, and uses six specialized AI agents that collaborate using the ADK, MCP, and A2A protocols on GKE Autopilot.

Honorable mention #3
Vigil AI by Ayan Liger.

Vigil AI is a proactive, hierarchical multi-agent system designed to enhance the security of the Bank of Anthos application against sophisticated fraud. It uses four specialized agents (TransactionMonitor, Orchestrator, Investigation Agent, Actuator) orchestrated on GKE to flag suspicious activity, investigate using a Gemini model, and lock the user’s account if necessary, without modifying the existing application code.

Inspired by the GKE Hackathon?

If seeing these incredible projects has sparked your interest in agentic AI, take the next step in your developer journey. Whether you missed the hackathon or simply want to sharpen your expertise for enterprise environments, we have a new program designed to help you turn these concepts into reality.

GEAR: Gemini Enterprise Agent Ready (GEAR) is a new educational sprint designed to help developers and decision-makers learn, build and deploy AI agents. This initiative launches in early 2026. Join the waitlist to be among the first to participate.

Read More for the details.

2025 12 02

GCP – Decoding high-bandwidth memory: A practical guide to GPU memory for fine-tuning AI models

Tibor Kiss Cloud, Google Cloud gcp

We’ve all been there. You’ve meticulously prepared your dataset and written your training script. You hit run, and your excitement builds, only to be crushed by the infamous error: CUDA out of memory.

This is one of the most common roadblocks in AI development. Your GPU’s High Bandwidth Memory (HBM), is the high-speed memory that holds everything that’s needed for computation, and running out of it is a hard stop. But how do you know how much you need?

To build a clear foundation, we’ll start by breaking down the HBM consumers on a single GPU and we’ll present key strategies to reduce HBM consumption on a single GPU. Later, we’ll explore advanced multi-GPU strategies like data and model parallelism that can help relieve memory pressure and scale your training in the cloud.

Understanding HBM: What’s using all the memory?

When you fine-tune a model, your HBM is primarily consumed by three things:

Model Weights: This is the most straightforward. It’s the storage space required for the model’s parameters—the “brain” that it uses to make predictions. A 7-billion parameter model loaded in 16-bit precision will take up roughly 14 GB before you even process a single piece of data.
Optimizer States and Gradients: This is the overhead that’s required for learning. To update the model’s weights, the training process needs to calculate gradients (the direction of learning) and the optimizer (like the popular AdamW) needs to store its own data to guide the training. In full fine-tuning, this can be the largest consumer of HBM.
Activations and Batch Data: This is the most dynamic part. When your data (images, text, etc.) flows through the model’s layers, the intermediate calculations, or activations, are stored in HBM. The memory needed here is directly proportional to your batch size. A larger batch size means more activations are stored simultaneously, which leads to faster training but much higher memory usage.

Note: These calculations are theoretical minimums. Real-world frameworks add up to 30% overhead due to temporary buffers, kernel launches, and memory fragmentation.

Although it’s impossible to get a perfect number without experimentation, you can estimate your HBM needs with this general formula:

Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)

Further reading: See this excellent JAX e-book that covers these topics in great detail and even has some “try it out yourself” test questions.

Example: Why full fine-tuning is so demanding

To see why running out of memory is such a common problem, let’s walk through a real-world example that I recently worked on: fine-tuning the medgemma-4b-it model, which has 4 billion parameters. Our script loads it in bfloat16 precision (2 bytes per parameter).

First, let’s calculate the static HBM footprint. This is the memory that’s required just to load the model and prepare it for training, before you’ve even processed a single piece of data.

Model Size: The memory that’s needed to simply hold the model on the GPU.

4 billion parameters × 2 bytes/parameter = 8 GB
Gradients and Optimizer States: The overhead for training every parameter with the AdamW optimizer.

Gradients: 4 billion parameters × 2 bytes/parameter = 8 GB

Optimizer States (AdamW): 2 × 4 billion parameters × 2 bytes/parameter = 16 GB

Note: While AdamW is a popular optimizer, other optimizers, such as Adafactor and Lion, have different memory footprints.

Adding these together gives us our baseline HBM cost for a full fine-tuning attempt:

8 GB (Model) + 8 GB (Gradients) + 16 GB (Optimizer) = 32 GB

This 32 GB is the baseline just to start the training process. On top of this, the GPU needs additional memory for activations, which is a dynamic cost that grows with your batch size and input data size. This is why full fine-tuning of large models is so demanding and often reserved for the most powerful hardware.

Key strategies to reduce HBM consumption

The HBM requirement for a full fine-tune can seem impossibly high. But several powerful techniques can reduce memory consumption, making it feasible to train large models on consumer-grade or entry-level professional GPUs.

Parameter-Efficient Fine-Tuning (PEFT) with LoRA

Instead of training all the billions of parameters in a model, Parameter-Efficient Fine-Tuning (PEFT) methods focus on training only a small subset of parameters. The most popular of these is LoRA (Low-Rank Adaptation).

LoRA works by freezing the original model’s weights and injecting a tiny number of new, trainable adapter layers into the model architecture. This means the memory-hungry gradients and optimizer states are only needed for these few million new parameters, not the full 4 billion.

The math behind LoRA’s memory savings

LoRA doesn’t remove the base model from your GPU. The full 8 GB of the original model’s weights are still loaded and taking up HBM. They’re just frozen, which means that the GPU isn’t training them. All of the memory savings come from the fact that you no longer need to store the huge gradients and optimizer states for that massive, frozen part of the model.

Let’s recalculate the static HBM footprint with LoRA, assuming it adds 20 million trainable parameters:

Model Size (unchanged): The base model is still loaded.

4 billion parameters × 2 bytes/parameter = 8 GB
LoRA Gradients & Optimizer States: We now only need overhead for the tiny set of new parameters.

Gradients: 20 million parameters × 2 bytes/parameter = 40 MB

Optimizer States: 2 × 20 million parameters × 2 bytes/parameter = 80 MB

The new static HBM footprint is now:

8 GB (Model) + 40 MB (Gradients) + 80 MB (Optimizer) ≈ 8.12 GB

The training overhead has shrunk from 24 GB to just 120 MB. Your new baseline memory requirement is now just over 8 GB. This lower baseline memory requirement leaves much more room for the dynamic memory that’s needed for activations, which lets you use a reasonable batch size on a common 16 GB or 24 GB GPU without running out of memory.

Model quantization

Besides training fewer parameters, we can also shrink the ones that we have by using quantization, which involves reducing the numerical precision of the model’s weights. The standard precision for modern training is bfloat16 because it offers the dynamic range of float32 with half the memory footprint. But we can reduce HBM usage further by converting weights to lower-precision integer formats like int8 or int4.

Using lower-precision integer formats has a significant impact on HBM when compared to the standard bfloat16 baseline:

bfloat16 (standard): The baseline size (e.g., a 7B model requires ~14 GB).
8-bit precision: Halves the model size (e.g., 14 GB becomes ~7 GB).
4-bit precision: Reduces the model size by a factor of 4 (e.g., 14 GB becomes ~3.5 GB).

The reduction in size lets you fit much larger models into memory with minimal degradation in performance.

aside_block: <ListValue: [StructValue([(‘title’, ‘A word of warning from experience:’), (‘body’, <wagtail.rich_text.RichText object at 0x7f53766bcee0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Combining LoRA and Quantization: These techniques work best together. Quantized LoRA (QLoRA) is a method that stores the massive base model in a highly efficient 4-bit format (specifically NF4 or NormalFloat 4), while adding small, trainable LoRA adapters in bfloat16. During the training process, the 4-bit weights are dequantized to bfloat16 for computation. Dequantizing in process lets you fine-tune very large models on a single GPU with the memory savings of 4-bit storage and the mathematical stability of 16-bit training.

FlashAttention: An algorithmic speed boost

Finally, FlashAttention is a foundational algorithmic optimization that significantly reduces HBM usage and speeds up training on both single and multi-GPU setups. The attention mechanism in transformers is a primary memory bottleneck because it requires storing a large, intermediate attention matrix. FlashAttention cleverly reorders the computation to avoid storing this full matrix in memory, leading to substantial memory savings and faster execution.

Best of all, enabling FlashAttention is often as simple as a one-line change. In the MedGemma fine-tuning script, this was done by setting the value attn_implementation="sdpa", which can automatically use more efficient backends like FlashAttention if the hardware supports it.

Scaling beyond a single GPU: Advanced strategies

Techniques like LoRA and quantization are useful for lowering HBM needs on a single GPU. But to train truly massive models or to really speed up the process, you’ll eventually need to scale out to multiple GPUs. Here are some of the key strategies that can be used to distribute the load and overcome memory limitations.

Data parallelism

Data parallelism is the most common and intuitive approach to scaling. In a Distributed Data Parallel (DDP) setup, the entire model is replicated on each GPU. The key is that the global batch of training data is split, with each GPU processing its own mini-batch concurrently. After each forward and backward pass, the gradients from each GPU are averaged together to ensure that all of the model replicas learn from the entire dataset and they stay in sync. This method is excellent for speeding up training but it doesn’t reduce the HBM that’s required to hold the model itself, because every GPU needs a full copy.

Model parallelism

When a model is too large to fit into the memory of a single GPU, you must use model parallelism. Instead of replicating the model, this strategy splits the model across multiple GPUs. There are two primary ways to do this:

Tensor parallelism: This method splits a single large operation (like a massive weight matrix in a transformer layer) across several GPUs. Each GPU computes its part of the operation, and the results are combined.
Pipeline parallelism: This technique places different layers of the model onto different GPUs in a sequence. The data flows through the first set of layers on GPU 1, then the output is passed to GPU 2 for the next set of layers, and so on, like an assembly line.

These strategies are more complex to implement than data parallelism, but they’re essential for models that are simply too big for one device.

Fully Sharded Data Parallelism (FSDP)

FSDP is a powerful and efficient hybrid strategy that combines the ideas of data parallelism and model parallelism. Unlike standard data parallelism where each GPU holds a full copy of the model, optimizer states, and gradients, FSDP shards (or splits) all of these components across the GPUs. Each GPU only materializes the full parameters for the specific layer that it’s computing at that moment, dramatically reducing the peak HBM usage per device. FSDP makes it possible to train enormous models on a cluster of smaller GPUs.

By combining these hardware and software strategies, you can scale your fine-tuning jobs from a single GPU to a powerful, distributed cluster capable of handling even the most demanding AI models.

HBM sizing guide

HBM	Use case and explanation
16 GB	Sufficient for basic inference or fine-tuning with techniques like LoRA using a very small batch size (e.g., 1-2). Expect slower training times at this level.
24 GB	The recommended starting point for a good experience with 4-7 B parameter models. This capacity allows for a more effective batch size (e.g., 8-16) when using LoRA, providing a great balance of training speed and cost.
40+ GB	Necessary for maximizing training speed with large batch sizes or for working with larger models (in the 20+ B parameter range) now or in the future.

Encountering the CUDA out of memory error provides an important lesson in the trade-offs between model size, training techniques, and batch size. By understanding what consumes your HBM, you can make smarter decisions and keep your projects running smoothly.

I hope that this guide has helped clarify the CUDA out of memory error and that it’s given you the tools to keep your projects running smoothly. When you’re ready to take the next step, Google Cloud has the tools to accelerate your AI development.

Explore GPU configurations for your Cloud Run services and best practices for running Cloud Run jobs with GPU.
For maximum control: Spin up a Compute Engine instance with the latest NVIDIA H100 or A100 Tensor Core GPUs and take full control of your environment.
Looking to optimize your model hosting infrastructure? Take a look at The Ultimate Guide to xPU Inference Configuration.
For a deeper dive into scaling your model, check out How to Scale Your Model.
New to Google Cloud? Get started with the $300 free credit to find the perfect solution for your next project.

Special thanks to Jason Monden and Sayce Falk from the AI compute team for their helpful review and feedback on this post.

Read More for the details.

2025 12 02

AWS – Announcing Amazon EC2 General purpose M8azn instances (Preview)

Tibor Kiss AWS, Cloud AWS

Starting today, new general purpose high-frequency high-network Amazon Elastic Compute Cloud (Amazon EC2) M8azn instances are available for preview. These instances are powered by fifth generation AMD EPYC (formerly code named Turin) processors, offering the highest maximum CPU frequency, 5GHz in the cloud. The M8azn instances offer up to 2x compute performance versus previous generation M5zn instances. These instances also deliver 24% higher performance than M8a instances.

M8azn instances are built on the AWS Nitro System, a collection of hardware and software innovations designed by AWS. The AWS Nitro System enables the delivery of efficient, flexible, and secure cloud services with isolated multitenancy, private networking, and fast local storage. These instances are ideal for applications such as gaming, high-performance computing, high-frequency trading (HFT), CI/CD, and simulation modeling for the automotive, aerospace, energy, and telecommunication industries.

To learn more or request access to the M8azn instances preview, visit the Amazon EC2 M8a page.

Read More for the details.

2025 12 02

AWS – Announcing the Apache Spark upgrade agent for Amazon EMR

Tibor Kiss AWS, Cloud AWS

AWS announces the Apache Spark upgrade agent, a new capability that accelerates Apache Spark version upgrades for Amazon EMR on EC2 and EMR Serverless. The agent converts complex upgrade processes that typically take months into projects spanning weeks through automated code analysis and transformation. Organizations invest substantial engineering resources analyzing API changes, resolving conflicts, and validating applications during Spark upgrades. The agent introduces conversational interfaces where engineers express upgrade requirements in natural language, while maintaining full control over code modifications.

The Apache Spark upgrade agent automatically identifies API changes and behavioral modifications across PySpark and Scala applications. Engineers can initiate upgrades directly from SageMaker Unified Studio, Kiro CLI or IDE of their choice with the help of MCP (Model Context Protocol) compatibility. During the upgrade process, the agent analyzes existing code and suggests specific changes, and engineers can review and approve before implementation. The agent validates functional correctness through data quality validations. The agent currently supports upgrades from Spark 2.4 to 3.5 and maintains data processing accuracy throughout the upgrade process.

The Apache Spark upgrade agent is now available in all AWS Regions where SageMaker Unified Studio is available. To start using the agent, visit SageMaker Unified Studio and select IDE Spaces or install the Kiro CLI. For detailed implementation guidance, reference documentation, and migration examples, visit the documentation.

Read More for the details.

2025 12 02

GCP – Upskill for the holidays: Check out no-cost AI training now

Tibor Kiss Cloud, Google Cloud gcp

It’s the most wonderful time of the year to learn new skills in AI and more. Given the growing skills gap, our research with Ipsos shows that 70% of professionals set a goal to boost their knowledge in AI, machine learning, and generative AI over the next two years. Crucially, 65% prefer learning directly from industry experts to get relevant guidance.

Why not use this holiday season to invest in yourself? Explore no cost courses and hands-on labs from Google Cloud experts, available on Google Skills.

We’ve designed these trainings for everyone. You’ll find hands-on labs for technical practitioners, strategic courses for leaders driving AI adoption, tutorials and primers for non-technical workers and skill badges to showcase the skills you’ve attained.

There’s lots to explore on Google Skills, though knowing how busy everyone is — especially during the holiday season — we wanted to highlight a dozen promising courses and labs below for both technical and general learners.

aside_block: <ListValue: [StructValue([(‘title’, ‘The gift of no-cost AI learning’), (‘body’, <wagtail.rich_text.RichText object at 0x7f6e3dd8c1c0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

For technical learners

Whether you’re a beginner or an expert, these hands-on labs and courses give you the knowledge to apply AI immediately in your projects. You’ll learn practical applications — from fine-tuning models to deploying remote agents — and enjoy the gift of 35 free monthly credits. The best part? Our labs provide a sandbox environment so you can learn Google Cloud skills using the real thing.

1. Kickstarting Application Development with Gemini Code Assist (5 credits): Earn a skill badge in this course, where you’ll learn to use the power of Google’s AI coding assistant and multiple development technologies.

2. Get Started with Vibe Coding and Gemini CLI (3 credits): This lab explores Vibe coding, which uses AI to generate code from natural language to help you speed up your development and focus on the big picture. Keep going to earn the Build a Smart Cloud Application with Vibe Coding and MCP skill badge.

3. Use Model Context Protocol (MCP) Tools with ADK Agents (7 credits): This lab explores the Model Context Protocol (MCP), an open standard for integrating external services, data, tools, and applications. You’ll learn to integrate MCP into ADK agents, from the Agent Development Kit, to improve your workflows.

4. Streamline App Development with Gemini Code Assist (no credits required): This course covers the core features of Gemini Code Assist — like intelligent code suggestions and real-time error detection. You’ll boost your productivity and improve code quality, saving time for more enjoyable tasks.

5. Google Cloud AI Infrastructure (no credits required): This learning path offers on-demand courses in AI infrastructure for intermediate to advanced learners. You’ll gain the skills to design and deploy high-performance AI and machine learning solutions using powerful tools like the AI Hypercomputer and Google Kubernetes Engine.

6. Supervised Fine-tuning for Gemini (2 credits): Learn to fine-tune Gemini models for specific tasks to improve the quality and efficiency of model outputs. You’ll get an overview of model tuning, Gemini options, usage, and execution.

7. Connect to Remote Agents with ADK and the Agent2Agent (A2A) SDK (7 credits): Deploy an ADK Agent as an A2A server, and use a JSON Agent Card to describe its capabilities. You’ll also enable another ADK agent as a sub-agent for complex tasks.

For non-technical learners

You don’t need to be a developer to use generative AI to boost your productivity. You’ll see exactly how gen AI is transforming every business role and get the skills you need to succeed — the ultimate holiday treat.

1. Generative AI Leader | Beyond the Chat Bot (no credits required): Start with “Beyond the Chat Bot,” covering Google Cloud AI tools like Gemini Advanced and NotebookLM. Then test and validate your gen AI knowledge with the Google Cloud Generative AI Leader certification exam.

2. Introduction to Gemini Enterprise (no credits required): You’ll explore Gemini Enterprise, a powerful platform that uses AI agents, enterprise search, NotebookLM and intelligent data access to solve organizational challenges. You’ll connect its capabilities to business needs, describe its architecture, and explain its data access and privacy handling.

3. Google AI Essentials (no credits required): Enhance your productivity across all roles and industries by gaining essential AI skills. This course teaches you how to use AI to generate ideas and content, assist with research, and speed up daily tasks like drafting email responses. Learn directly from Google AI experts how to use AI responsibly, and earn a certificate upon completion.

4. AI Boost Bites | Amplify Exec Voices with AI: (no credits required): Transform your daily work with practical, hands-on lessons from Google AI experts with short, 10-minute lessons. Start with the first lesson to learn how to drive real business results, fast.

5. Future-Proof Your AI Learning Strategy (no credits required): Continuous learning is key, especially in this rapidly evolving age of AI. This course shows you how to use Google Skills to help future-proof your organization. You’ll get ready-made materials, custom experiences, and blended learning strategies for positive returns.

Start your learning journey today with our generative AI content on Google Skills.

Read More for the details.

2025 12 02

GCP – Registration is open for Google Cloud Next 2026!

Tibor Kiss Cloud, Google Cloud gcp

Google Cloud Next returns to Las Vegas, April 22-24, 2026 and I’m thrilled to share that registration is now live! After welcoming a record-breaking number of attendees to our flagship conference last April, we are gearing up to make the 2026 event bigger and better.

Get ready for an incredible week filled with deep learning, inspiring sessions, and collaborative problem-solving. This is your chance to learn directly from industry experts and peers who are facing the same challenges and opportunities as you. You will leave Next ’26 equipped with fresh ideas, cutting-edge skills, and actionable insights you won’t find anywhere else.

Secure your spot now with early bird pricing, available for just $999 for a limited time.

Here is why you should not miss Next ’26:

Build your next intelligent agent, explore interactive demos, hackathons, and workshops, and discover how leading organizations are using AI to elevate their business to new heights.
Network with peers, experts, and the brightest minds in tech to share insights, spark new collaborations, and help influence the direction of your industry.
From deep-dive sessions and keynotes to hands-on labs, Next is the ultimate destination for the builders, dreamers, and doers defining the future of technology.

Don’t miss out!

Register here.

Read More for the details.

2025 12 02

AWS – Amazon EC2 P6e-GB300 UltraServers accelerated by NVIDIA GB300 NVL72 are now generally available

Tibor Kiss AWS, Cloud AWS

Today, AWS announces the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P6e-GB300 UltraServers. P6e-GB300 UltraServers, accelerated by NVIDIA GB300 NVL72, provide 1.5x GPU memory and 1.5x FP4 compute (without sparsity) compared to P6e-GB200.

Customers can optimize performance for the most powerful models in production with P6e-GB300 for applications that require higher context and implement emerging inference techniques like reasoning and Agentic AI.

To get started with P6e-GB300 UltraServers, please contact your AWS sales representative.

To learn more about P6e UltraServers and instances, visit Amazon EC2 P6 instances.

Read More for the details.