When it comes to large-scale data analysis, BigQuery is a powerhouse, with fundamental aggregate functions like SUM, AVG, and COUNT allowing you to extract meaningful insights for all types of workloads. And today, we’re excited to take your data analytics to the next level with a suite of advanced aggregation features that unlock more complex and sophisticated use cases. These aggregate functions fall into three categories:
Group by extension (grouping sets/cube, group by struct, array, group by all)
We built these functions as they were top feature requests from our customer council group. Here is what New York Times had to say:
“I just want to say thank you to BigQuery’s team who launched GROUP BY ROLLUP. We had a daily query taking more than 2 hours to run that now takes 10 minutes using this instead, and many other teams that want to use it now because of it. We saw slot consumption drop by about 96%!” – Edward Podojil, consultant, The New York Times
Let’s take a deeper look at these new aggregate functions, and how to use them in your data analytics workflows.
Group-by extensions
Grouping sets provide flexibility in calculating aggregations in multiple dimensions in one single statement without having to use UNION ALL. Group by struct, array allows you to easily group by commonly used data types in BigQuery. Group by lets you group by all non-aggregate columns in the Select statement without having to repeat each column twice.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e0675914d90>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
GROUP BY GROUPING SETS, CUBE (GA)
Users often need to slice and dice their data in multiple dimensions. Currently you have to rely on repeated UNION ALL/CROSS Join to group your data, which can make queries cumbersome and hard to understand. GROUP BY GROUPING SETS allow you to group different dimensions in a single statement. For example, the following query give you the sum of amount grouped by different combinations:
Date: total sales per day
Region: total sales per region
Product: total sales per product
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT Date, Region, Product, SUM(Amount) AS Total_AmountrnFROM salesrnGROUP BY GROUPING SETS ( (Date), (Region), (Product), () )rnORDER BY Date, Region, Product;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914fa0>)])]>
GROUP BY CUBE(x, y) is a shorthand syntax for GROUP BY GROUPING SETS ((x,y), x, y, ()), so you can GROUP BY all combinations of different dimensions.
GROUP BY STRUCT, ARRAY (GA) STRUCT and ARRAY are among the most commonly used data types in BigQuery today. Working with STRUCT and ARRAY data in BigQuery just got easier! You can now use GROUP BY and SELECT DISTINCT for these semi-structured data types directly. This means no more time-consuming workarounds like converting STRUCT/ARRAY to JSON strings. This will simplify your queries and boost performance, making complex analysis more efficient (documentation).
GROUP BY ALL (GA) GROUP BY ALL deduces non-aggregated columns from the SELECT clause, eliminating the need to list the same columns twice in SELECT and GROUP BY. It’s often used in queries with many dimensions and few aggregations. Listing all columns twice would be long and tedious. For example, you can group Chicago taxi trips by company, payment_type and taxi_id with a simple GROUP BY ALL query.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT company, payment_type, tax_id, SUM(trip_total)rnFROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`rnGROUP BY ALL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914a00>)])]>
User-defined aggregate functions, or UDAFs, let you define custom aggregations once, and reuse them across projects and teams. You no longer need to write repeated logic again and again to unlock advanced functionality like weighted average, merging JSON data, or even simulating geospatial functions.
Javascript UDAF (GA)
JavaScript user-defined aggregate functions (JS UDAFs) let you create custom aggregation logic beyond built-in functions, so you can calculate metrics tailored precisely to your needs. For instance, you can craft UDAFs to compute weighted averages, specialized statistics, or even construct data sketches. Here’s an example of a JavaScript UDAF that simulates the mode() function, returning the most frequent value within a group.
code_block
<ListValue: [StructValue([(‘code’, ‘REATE OR REPLACE AGGREGATE FUNCTION udaf.mode(x INT64)rnRETURNS STRUCT<itemSk INT64, itemCount INT64>rnLANGUAGE js ASrn”””rn export function initialState() {rn return {frequencyMap: new Map()};rn }rn export function aggregate(state, value) {rn var frequencyMap = state.frequencyMap;rn if (frequencyMap.has(value)) {rn frequencyMap.set(value, frequencyMap.get(value) + 1);rn } else {rn frequencyMap.set(value, 1);rn }rn }rn export function merge(state, partial_state) {rn var frequencyMap = state.frequencyMap;rn for (let [key, count] of partial_state.frequencyMap) {rn if (frequencyMap.has(key)) {rn frequencyMap.set(key, frequencyMap.get(key) + count);rn } else {rn frequencyMap.set(key, count);rn }rn }rn }rn export function finalize(state) {rn var maxCount = 0;rn var maxKey = 0;rn for (let [key, count] of state.frequencyMap) {rn if (count > maxCount) {rn maxKey = key;rn maxCount = count;rn }rn }rn return {itemSk: maxKey, itemCount: maxCount};rn }rn”””;’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914700>)])]>
SQL UDAF (GA)
SQL user-defined aggregate functions (SQL UDAFs) allow users to encapsulate complex aggregate expressions into a UDF, for composability and reusability without having to write the same code again and again. For example, you can wrap multiple aggregate function calls into a UDAF through a struct constructor.
code_block
<ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE FUNCTION percentiles_struct(column_name)rnRETURNS STRUCTrnASrn(rn COUNT(column) AS n_samples,rn APPROX_QUANTILES (column_name, 100)[offset(50)] AS percentile_50,rn APPROX_QUANTILES(column_name, 100)[offset(95)] AS percentile_95rn)’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914820>)])]>
Approximate aggregate functions
Companies across ad-tech, retail, fin-tech usually have massive amounts of multi-dimensional data. However, extracting data insights from millions or billions of rows of user-behavior data can be very expensive and time-consuming. Many companies are willing to calculate approximate aggregated results within defined error bounds, to get faster responses. Sketches enable approximate estimates of distinct counts, quantiles, histograms, and other statistical measures — all with minimal memory and computational overhead, and with a single pass through the data at scale.
KLL quantile functions (Preview)
BigQuery supports quantile calculations using native KLL quantile functions. For example, you could estimate the median trip duration and trip distance for all taxi rides on a given day or month.
First, create daily KLL quantile sketches over trip_seconds and trip_miles from the `chicago_taxi_trips` data table.
code_block
<ListValue: [StructValue([(‘code’, ‘CREATE TABLE sketch_table ASrnSELECT rn DATE(trip_start_timestamp, “UTC”) AS day,rn KLL_QUANTILES.INIT_INT64(trip_seconds) AS trip_seconds_sktech,rn KLL_QUANTILES.INIT_FLOAT64(trip_miles) AS trip_miles_sktech,rnFROM bigquery-public-data.chicago_taxi_trips.taxi_tripsrnGROUP BY day;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914af0>)])]>
Then, with EXTRACT_POINT_INT64/FLOAT64 functions, you can get the median trip_seconds and trip_miles for all trips on a specific day.
code_block
<ListValue: [StructValue([(‘code’, “SELECT rn KLL_QUANTILES.EXTRACT_POINT_INT64(trip_seconds_sktech, 0.5) AS trip_seconds_median,rn KLL_QUANTILES.EXTRACT_POINT_FLOAT64(trip_miles_sktech, 0.5) AS trip_miles_medianrnFROM sketch_tablernWHERE day = ‘2023-12-10’;”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914d00>)])]>
You can also leverage the MERGE_POINT_INT64/FLOAT64 functions to estimate the median trip_seconds and trip_miles for all trips in a month. The query merges daily sketches at first, and then calculates quantile values over a month.
code_block
<ListValue: [StructValue([(‘code’, “SELECT rn KLL_QUANTILES.MERGE_POINT_INT64(trip_seconds_sktech, 0.5) AS trip_seconds_median,rn KLL_QUANTILES.MERGE_POINT_FLOAT64(trip_miles_sktech, 0.5) AS trip_miles_medianrnFROM sketch_tablernWHERE day >= ‘2023-12-01’ AND day <= ‘2023-12-31’;”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e06759148b0>)])]>
Apache DataSketches (GA) Besides native support for sketches, BigQuery also supports open-source Apache DataSketches, a high-performance library of stochastic streaming algorithms. Initially developed by Yahoo, you can directly use them though the public UDF bigquery-utils repo powered by JS UDAFs. Here are some examples, just name a few (see more details in this blog post):
Theta Sketch: designed for cardinality estimation and set operations (union, intersection, difference)
Tuple Sketch: an extension of the Theta Sketch that supports associating values with the estimated unique items
The following examples use Theta Sketch over a public dataset, Chicago taxi trips, to demonstrate the power of the UDAF sketches library.
First, create a daily Theta Sketch to estimate the unique number of taxis running per day, using theta_sketch_agg_string.
code_block
<ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE TEMP TABLE sketch_table ASrnSELECT rn DATE(trip_start_timestamp, “UTC”) AS day,rn bqutil.datasketches.theta_sketch_agg_string(taxi_id) AS taxi_id_sktech,rnFROM bigquery-public-data.chicago_taxi_trips.taxi_tripsrnGROUP BY day;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914940>)])]>
You can estimate the unique number of taxis that run on both days, using theta_sketch_intersection.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT bqutil.datasketches.theta_sketch_get_estimate(rn bqutil.datasketches.theta_sketch_intersection(rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-10”),rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-11”)rn )rn);’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914a30>)])]>
You can estimate the unique number of taxis that run on either of these two days, using theta_sketch_union.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT bqutil.datasketches.theta_sketch_get_estimate(rn bqutil.datasketches.theta_sketch_union(rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-10”),rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-11”)rn )rn);’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914dc0>)])]>
Do more with advanced aggregation
Using BigQuery’s advanced aggregation capabilities allows you to perform more complex analysis efficiently. Grouping sets provide flexibility in calculating aggregations at multiple dimensions. UDAFs empower you to define custom aggregations. Approximate aggregate functions provide speed, scalability and performance when approximate results are acceptable. By leveraging these features, you can unlock deeper insights from your data and help your team make timely business decisions. We can’t wait to hear your use cases and how you plan to use them in your day to day analysis!
Accurate time-series forecasting is essential for many business scenarios such as planning, supply chain management, and resource allocation. BigQuery now embeds TimesFM, a state-of-the-art pre-trained model from Google Research, enabling powerful forecasting via the simple AI.FORECAST function.
Time-series analysis is used across a wide range of fields including retail, healthcare, finance, manufacturing, and the sciences. Through the use of forecasting algorithms, users can have a more thorough understanding of their data including the recognition of trends, seasonal variations, cyclical patterns, and stationarity.
BigQuery already natively supports the well-known ARIMA_PLUS and ARIMA_PLUS_XREG models for time-series analysis. More recently, with the rapid progress and success of large pre-trained LLM models, the Google Research team developed TimesFM, a foundational model specifically for the time series domain.
The Time Series foundation model
TimesFM is a forecasting model that’s pre-trained on a large time-series corpus of 400 billion real-world time-points. A big advantage of this model is its ability to perform “zero-shot” forecasting. This means that it can make accurate predictions on unseen datasets without any training. In terms of the architecture, TimesFM is built as a decoder-only transformer model, which outputs batches of contiguous time-point segments at a time. This model has been featured on the GIFT-Eval benchmark and Monash public dataset, with a variety of public benchmarks from different domains and granularities. While ARIMA_PLUS offers customizability and explainability, the TimesFM model provides high ease-of-use and delivers good generalizability across many business domains, and often beats custom trained statistical and deep learning models.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e0672d0b550>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
How BigQuery supports TimesFM
The latest TimesFM 2.0 is now a native model in BigQuery. With 500 million parameters, TimesFM model inference runs directly on BigQuery infrastructure so there are no models to train, endpoints to manage, connections to set up, or quotas to adjust. TimesFM in BigQuery is also fast and scalable — you can forecast millions of univariate time series in a few minutes with a single SQL query.
Examples of the new AI.FORECAST function
To demonstrate, consider a use case that relies on the public bigquery-public-data.san_francisco_bikeshare.bikeshare_trips table. This dataset contains information about individual bicycle trips taken using the San Francisco Bay Area’s bike-share program.
Example 1: Single time series
The following query aggregates the total number of bike-share trips on each day and forecasts the number of trips for the next 10 days (the default horizon).
code_block
<ListValue: [StructValue([(‘code’, “SELECT *rnFROMrn AI.FORECAST(rn (rn SELECT TIMESTAMP_TRUNC(start_date, DAY) AS trip_date, COUNT(*) AS num_tripsrn FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`rn GROUP BY 1rn ),rn timestamp_col => ‘trip_date’,rn data_col => ‘num_trips’);”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0672d0b700>)])]>
The results look similar to:
The output includes the forecast timestamp and values through the columns forecast_timestamp and forecast_value. The confidence_level is default as 0.95. The prediction_interval_lower_bound and prediction_interval_upper_bound show the bounds for each forecasted point.
Example 2: Multiple time series
The AI.FORECAST function also lets you forecast multiple time series at a time as shown in the following example. The following query forecasts the number of bike share trips per subscriber type and per hour for the next month (approximately 720 hours), based on the previous four months of historical data.
code_block
<ListValue: [StructValue([(‘code’, “SELECT *rnFROMrn AI.FORECAST(rn (rn SELECTrn TIMESTAMP_TRUNC(start_date, HOUR) AS trip_hour,rn subscriber_type,rn COUNT(*) AS num_tripsrn FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`rn WHERE start_date >= TIMESTAMP(‘2018-01-01’)rn GROUP BY 1, 2rn ),rn horizon => 720,rn timestamp_col => ‘trip_hour’,rn data_col => ‘num_trips’,rn id_cols => [‘subscriber_type’]);”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e06468ee4f0>)])]>
The results look similar to the following:
In addition to the columns used by the single time series, there’s the time series identifier column, which we defined earlier as subscriber_type.Visualize the results
You can merge the history data and forecasted data and visualize these results. The following graph visualizes the ‘Subscriber’ time series with its the lower and upper bounds of the prediction interval as follows:
You can see the detailed queries we used to generate this in the tutorial.
When to use TimesFM vs ARIMA_PLUS
For quick, out-of-the-box forecasts, establishing baselines, or identifying general trends with minimal setup, use TimesFM. If you need to model specific patterns, fine-tune forecasts for seasonality or holidays, multivariate (ARIMA_PLUS_XREG), require explainable results, or want to leverage a longer historical context, ARIMA_PLUS is the more suitable choice.
Take the next step
The TimesFM 2.0 model is now available in BigQuery in preview. For more details, please see the tutorial and the documentation.
Today, we’re announcing Sol, a new transatlantic subsea cable system to connect the U.S., Bermuda, the Azores, and Spain. Sol translates to “sun” in Spanish and Portuguese, a nod to the cable system’s landing points in warmer climates. Alongside the Nuvem subsea cable, Sol completes our unique investment in transatlantic resiliency — with the two systems interconnecting terrestrially in the U.S. and in Iberia, as well as in Bermuda and the Azores.
The Sol cable will be manufactured in the U.S. and, once operational, bolster capacity and reliability for our growing network of 42 Google Cloud regions around the world, helping meet growing customer demand for Google Cloud and AI services across the U.S., Europe, and beyond.
Palm Coast, Florida, will anchor the Sol cable in the U.S. We’ll partner with DC BLOX to land the cable and establish a new connectivity hub in the Sunshine State. We’ll also develop a terrestrial route linking Palm Coast to our South Carolina cloud region.
When it’s complete, Sol will be the only in-service fiber-optic cable between Florida and Europe. In Spain, we’ll partner with Telxius to provide the necessary infrastructure to land the Sol cable in Santander, further integrating the Google Cloud region in Madrid into our global network.
“Florida is once again at the forefront of strengthening our nation’s digital infrastructure. The anchoring of the Sol subsea cable in our state is a testament to our state’s commitment to fostering an environment that is ripe for technological investment and innovation. We are proud to partner with companies like Google and DC BLOX on strategic infrastructure projects that will support the future of AI innovation and bring tangible benefits to our communities and the entire Sunshine State.” – J. Alex Kelly, Florida Secretary of Commerce
“This is a landmark moment for Palm Coast, Flagler Beach, and Flagler County, and it’s a clear signal that we are a community of the future, investing in our economic development and vitality. The Sol subsea cable is more than just infrastructure; it’s a gateway to unprecedented global connectivity that will attract further high-caliber industries that our residents deserve. We are not just putting Palm Coast and our community on the map; we are building a direct route to the world’s digital economy, ensuring a prosperous and dynamic future for our community.”– Vice Mayor Theresa Carli Pontieri, Palm Coast, Florida
“DC BLOX is proud and honored to expand the foundational digital infrastructure that is vital to Florida’s growing economy. Supporting Google’s Sol cable, along with the capacity for additional cables, the new Palm Coast Cable Landing Station campus enhances Florida’s position as a nexus for global communications.” – Chris Gatch, Chief Revenue Officer, DC BLOX
“Digital connectivity is today a strategic factor for the economic and social development of any city. At Santander City Council, we are committed to facilitating this type of infrastructure because its implementation attracts investment, talent, and technology companies, in addition to generating employment and positioning us as a city prepared for remote work, advanced services, and the Smart City model. That’s why we appreciate the interest that Google has shown in our coast and for choosing Santander as a transatlantic connection point!”– City Council Mayor Gema Igual, Santander, Spain,
“Telxius is delighted to support Google in bringing the Sol cable to Spain, a key milestone that will boost transatlantic connectivity with enhanced capacity, reliability, and resilience. Through our submarine landing services, we accelerate project implementation and help expand global networks for our customers.” – Mario Martín, CEO, Telxius
“The Government of the Azores welcomes this important investment in the region with great enthusiasm and congratulates Google on its visionary, pioneering, and strategic approach, which fully aligns with our own vision: that the Azores are an important and strategic hub for digital connectivity in Europe and the North Atlantic.” – Artur Lima, Vice President, Government of the Azores
“Promoting the expansion and modernization of submarine cable infrastructure, with a view to strengthening the country’s international connectivity and digital competitiveness, is an essential goal for the government to the development of our country. The project for Google’s new “Sol” cable, with 16 fiber optic cable pairs and the potential to provide redundancy to the “Nuvem” cable, also from that company, will increase resilience and respond to the growing demand for digital infrastructure in Portugal and Europe.”– Miguel Pinto Luz, Minister of Infrastructure and Housing, Portugal
“Bermuda welcomes Google’s continued investment with the announcement of a second cable ‘Sol,’ further establishing Bermuda as a digital hub in the Atlantic. Bermuda greatly values its positive relationship with Google. This second cable highlights the benefits of this Government’s strategy to develop Bermuda as a progressive world leading jurisdiction for technology companies.” – Alexa Lightbourne, Minister of Home Affairs, Bermuda
Sol will add capacity, increase reliability, and decrease latency for Google users and Google Cloud customers around the world. Alongside cable systems like Nuvem, Firmina, Equiano, and Grace Hopper, Sol further establishes key locations across the Atlantic as connectivity hubs, strengthening local economies and bringing AI’s benefits to people and businesses around the world.
There’s a buzz of excitement here at Tobacco Dock as we welcome our customers and partners to the Google Cloud Summit London. Together, we’re exploring the essential role Google Cloud is playing in driving AI innovation and adoption across the UK. Today is about shining a spotlight on our customers and partners, focusing on the real-world stories of how they are accelerating innovation, realising value, and reshaping entire industries with the powerful combination of AI agents and robust cloud infrastructure.
The opportunity ahead is immense. AI is poised to inject more than £400 billion into the UK economy by 2030, and this year’s summit reaffirms our long-term commitment to helping the UK realise its potential and shape its AI future.
Investing in Britain’s digital future
To lead in the AI era, a nation needs a strong digital foundation. That’s why we’re proud to announce a series of strategic commitments to the UK’s digital future.
First, we are launching a new partnership with the UK Government to help modernise public services. This collaboration marks a significant shift, exploring the use of advanced tech and AI to move beyond decades-old legacy contracts that can leave essential services vulnerable. It supports the UK Government’s “blueprint for a modern digital government” by aiming to streamline services, reduce costs, and lay the digital foundations for AI success.
Equally important, we’re investing in a world-class, AI-ready workforce. Today, we are announcing a landmark proposal to provide free cloud and AI skills training for up to 100,000 UK public sector workers. This initiative is designed to accelerate the digital transformation of public services and empower the workforce with the talent to lead in the AI era.
These commitments are backed by our continued investment in critical infrastructure. We are close to reaching a major milestone with our data center in Waltham Cross, which will be fully operational by the end of the year. Delivering on our promise, this facility will provide British businesses with the high-performance, low-latency cloud they need to compete globally.
From AI vision to real-world value
Agentic AI is emerging as a major force in enterprise technology, representing a foundational shift in how work gets done. We’re helping British businesses go beyond experimentation and embrace a new era of intelligent agents. With advanced models like Gemini on our Vertex AI platform, we’re giving businesses the toolkits to drive real results and realise value faster.
At the Summit, we’re showcasing how leading UK companies are already leveraging these capabilities to drive impact:
Imperial War Museums is the first UK museum to use AI at scale to digitise and transcribe oral histories. Implemented by our partner Capgemini, the Gemini-engineered system will make more than 20,000 histories from 1945 to the early 2000s searchable and accessible.
LUSH is taking advantage of Vertex AI and Cloud Storage to reduce packaging waste through AI-powered product recognition at checkout in its beauty shops.
Morrisons’ new in-app Product Finder—built with BigQuery and Gemini—helps customers quickly locate products in its grocery stores. It was used 50,000 times a day during the most recent Easter period, proving its immense practical value.
Starling Bank launched “Spending Intelligence,” an industry-leading AI tool powered by Gemini models. Customers can now understand their spending habits simply by asking natural language questions and getting instant analysis.
Toolstation now offers customers highly accurate product search results online and in-store, powered by Vertex AI Search for Commerce, which is boosting customer satisfaction and sales.
Extending Our Impact Through a Trusted Ecosystem
Collaboration is fundamental to progress. The achievements we celebrate today wouldn’t be possible without our incredible network of partners. Our work with Capgemini is helping British-born companies like Additive Catchments, a majority non-profit-owned company, to deploy AI-powered water stewardship solutions. Leveraging Google Cloud, they provide real-time insights to improve water quality and safeguard ecosystems across the UK—a perfect example of technology for good.
This kind of innovation isn’t happening in isolation. Since 2023, more than 60% of UK generative AI startups have chosen Google Cloud due to our deep commitment to the ecosystem. From our ‘Gemini for UK’ initiative, which offers up to £280,000 in cloud credits and access to expert mentorship and training, to our recent partnership with Tech London Advocates, our aim is to help founders scale faster.
We’ve also secured spaces for top UK AI Startups on the Europe-wide Google for Startups accelerator. These companies, such as Martee ai, are on a mission to improve the face of grab-and-go food, for the benefit of people and the planet, and Dyad AI, which is helping healthcare providers improve care quality and automate administrative workflows. They will be joining 15 top AI startups from across Europe in receiving expert help from Google so they can take their startups to the next level.
Setting a Secure Foundation for a Bold Future
A big part of our mission is to be the most trusted cloud partner for businesses, exemplified through our leading security, data sovereignty, and responsible AI practices. At the summit, Lloyds Banking Group and Vodafone are sharing how they are using Google Cloud’s security tools to help protect their operations and innovate securely for millions of customers.
This deep commitment to security is why Google Cloud announced plans earlier this year to acquire Wiz. By integrating Wiz’s best-in-class, multi-cloud visibility with Mandiant’s unmatched threat intelligence, we will offer UK businesses proactive, comprehensive protection across their entire cloud estate.
For UK organizations, including the public sector, we’re also building trust in generative AI by ensuring data sovereignty and compliance, so they can use this technology while keeping their data completely secure. Today we are happy to share that Google Cloud is strengthening its UK data residency commitment, empowering organizations with the choice to keep Gemini 2.5 Flash processing entirely within the UK. This expansion opens up even more use cases for businesses operating here.
This combination of world-class security and a commitment to openness is the foundation upon which UK organizations can innovate with confidence. Our leadership in open-source technologies like Kubernetes enables them to build and run applications anywhere, knowing their data is protected by world-class security. Whether it’s a bank securing its services, a retailer reimagining the customer experience, or a startup building the next big thing, our goal is to provide the tools and the trust to build boldly.
With that mission in mind, UK businesses have a real opportunity to lead on the global stage. Together, with our customers, partners, and the public sector, we’re building much more than a cloud platform; we’re helping UK organisations not only stay ahead but to move boldly into the future and shape the AI era.
Today, we’re making it even easier to achieve breakthrough performance for your AI/ML workloads: Google Cloud Managed Lustre is now GA, and available in four distinct performance tiers that deliver throughput ranging from 125 MB/s, 250 MB/s, 500 MB/s, to 1000 MB/s per TiB of capacity — with the ability to scale up to 8 PB of storage capacity. The Managed Lustre solution is powered by DDN’s EXAScaler, combining DDN’s decades of leadership in high-performance storage with Google Cloud’s expertise in cloud infrastructure.
Managed Lustre provides a POSIX-compliant, parallel file system that delivers consistently high throughput and low latency, essential for:
High-throughput inference: For applications that require near-real-time inference on large datasets, Lustre provides high parallel throughput and sub-millisecond read latency.
Large-scale model training: Accelerate the training cycles of deep learning models by providing rapid access to petabytes-sized datasets. Lustre’s parallel architecture ensures GPUs and TPUs are fed with data, minimizing idle time.
Checkpointing and restarting large models: Save and restore the state of large models during training faster, improving goodput and allowing for more efficient experimentation.
Data preprocessing and feature engineering: Process raw data, extract features, and prepare datasets for training, reducing the time spent on data pipelines.
Scientific simulations and research: Beyond AI/ML, Lustre excels in traditional HPC scenarios like computational fluid dynamics, genomic sequencing, and climate modeling, where massive datasets and high-concurrency access are critical.
Lustre is designed for the highly parallel and random I/O that characterizes many AI/ML training and inference tasks. This parallel processing capability across multiple clients ensures your compute resources are never starved for data.
Performance tiers and pricing
Managed Lustre offers flexible pricing and performance tiers designed to meet the diverse needs of your workloads, whether you’re focused on capacity or highest throughput density.
Irrespective of the aggregate throughput, all tiers come with sub-millisecond read latency, high single-stream throughput, and are perfect for parallel access to many small files.
Driving innovation together: partnering with DDN
Google Cloud’s Managed Lustre is powered by DDN’s EXAScaler, bringing together two industry leaders in high-performance computing and elastic cloud infrastructure. This partnership represents a joint commitment to simplifying the deployment and management of large-scale AI and HPC workloads in the cloud, thanks to:
Trusted leaders: By combining DDN’s decades of expertise in high-performance Lustre with Google Cloud’s global infrastructure and AI ecosystem, we are delivering a foundational capability that removes storage bottlenecks and helps our customers solve their most complex challenges in AI and HPC.
Fully managed and supported solution: Enjoy the benefits of a fully managed service from Google, with comprehensive support from both Google and DDN, for seamless operations and peace of mind.
Global availability and ecosystem integration: Managed Lustre is now globally accessible in multiple Google Cloud regions and integrates with the broader Google Cloud ecosystem, including Google Kubernetes Engine (GKE) and TPUs.
These benefits caught the attention of one of our largest partners, NVIDIA, who is looking forward to having it as part of its NVIDIA AI platform.
“Enterprises today demand AI infrastructure that combines accelerated computing with high-performance storage solutions to deliver uncompromising speed, seamless scalability and cost efficiency at scale. Google and DDN’s collaboration on Google Cloud Managed Lustre creates a better-together solution uniquely suited to meet these needs. By integrating DDN’s enterprise-grade data platforms and Google’s global cloud capabilities, organizations can readily access vast amounts of data and unlock the full potential of AI with the NVIDIA AI platform (or NVIDIA accelerated computing platform) on Google Cloud — reducing time-to-insight, maximizing GPU utilization, and lowering total cost of ownership.” – Dave Salvator, Director of Accelerated Computing Products, NVIDIA
Get started today!
Ready to supercharge your AI/ML and HPC workloads? Getting started with Managed Lustre is simple:
Don’t miss the opportunity to learn more about the strategic partnership between Google Cloud and DDN, and the unique capabilities of Managed Lustre. Read the official DDN press release here.
Watch the fireside chat with Sameet Agarwal, VP/GM Storage and Sven Oehme, CTO of DDN, here.
Today, we are thrilled to announce the expansion of the Z3 Storage Optimized VM family with the general availability of nine new Z3 virtual machines that offer local SSD capacity ranging from 3 TiB to 18 TiB per VM, complementing existing Z3 VMs which offer 36TiB of Local SSD per VM. We are also very pleased to launch a Z3 bare metal instance, which includes up to 72 TiB of Local SSDs. Z3 VMs enable customers like Shopify, Tenderly and ScyllaDB to achieve impressive performance improvements for their high performance storage workloads by reducing the IO access latency by up to 35% compared to VM instances using previous-generation local SSDs.
Z3 VMs are designed to run I/O-intensive workloads that require large local storage capacity and high storage performance, including SQL, NoSQL, and vector databases, data analytics, semantic data search and retrieval, and distributed file systems. The Z3 bare metal instance provides direct access to the physical server CPUs and is ideal for workloads that require low-level system access like private and hybrid cloud platforms, custom hypervisors, container platforms, or applications with specialized performance or licensing needs.
Both Z3 VMs and the bare metal instance are based on Titanium SSDs, which offload local storage processing from CPU resources to deliver real-time data processing, low-latency, high-throughput storage performance and enhanced storage security. Z3 VMs with Titanium SSD offer up to 36 GiB/s of read throughput and up to 9M IOPS, increasing write storage performance by up to 25% compared to previous generation Local SSDs1.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e5fa49db850>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>
Based on the 4th Gen Intel Xeon scalable processor, Z3 VMs come with up to 176 vCPUs, 1,408 GiB of memory, and 36 TiB of local storage in 11 virtual machine shapes. The Z3 bare metal instance offers 192 vCPUs, 1,536 GiB of memory and 72 TiB of local storage. Z3 VMs and the bare metal instance deliver the connectivity and storage performance that enterprise workloads need, withup to 100 Gbps in standard bandwidth and up to 200 Gbps with Tier1 networking for high-traffic applications.
The expanded Z3 virtual machine portfolio lets you rightsize your infrastructure and scale your clusters to meet workloads requirements by providing larger total local SSD capacity and higher local SSD capacity per vCPU. Z3 offers two different VM types: the standardlssd VM types, which include five VM shapes that offer about 200 GiB of local SSD per vCPU. They are optimized for data analytics (OLAP), and SQL databases like MySQL and Postgres workloads. The highlssd VM types includesix different VM shapes and the Z3 bare metal instance. They offer about 400 GiB of local SSD per vCPU and are optimized for distributed databases, data streaming, large parallel file systems and data search.
What our customers and partners are saying
“We are thrilled to announce Nutanix Cloud Clusters coming to Google Cloud at the end of CY25 as part of Nutanix’s commitment to delivering flexible, hybrid cloud solutions. Google Cloud’s Z3 instance types represent a perfect foundation for Nutanix to enable performance and resilience for enterprise applications. We’re excited about our partnership with Google Cloud in empowering our joint customers with greater choice and simplicity in their cloud journey.” – Saveen Pakala, Vice President of Product Management, Nutanix
“OP Labs contributes to the Optimism protocol, which enables orders of magnitude of improved performance and scalability for Ethereum. Z3 reduces p99 block insertion tail latencies by 30-50% for our most I/O-demanding blockchain nodes compared to N2. By migrating our solution to Z3, we will be able to scale our blockchain nodes to handle L2 state growth in a more performant and cost-effective way.” – Zach Howard Senior Staff Engineer, OP Labs
The launch of Google Cloud’s Z3 storage optimized instances with smaller VM shapes represents a leap forward in performance for high-traffic NoSQL environments. In internal tests and customer projects, ScyllaDB has impressively leveraged the advantages of Z3 including extremely low latencies under high read and write loads, high IOPS capacity enabling the processing of massive amounts of data and excellent cost-performance ratio for large-scale production systems. We are very excited to offer Z3 family servers in ScyllaDB Cloud, including Bring Your Own Account (BYOA).” – Avi Kivity, Co-founder and CTO, ScyllaDB
“Shopify has found Z3s to be an excellent platform to build our most performance sensitive storage systems on. We experienced a critical need for both large data volumes while remaining sensitive to latency and throughput on the storage side. While Google has a lot of options, local SSD was really the best fit, and Z3s allowed us to achieve the best price/performance along with enhanced stability appropriate for a source of truth Storage workload. Right now we see these storage optimized VMs as our platform of choice for the future.” – Mattie Toia, VP Infrastructure, Shopify
“Tenderly is built to be your go-to for Web3 production and development, bringing all the necessary infrastructure into one place. This allows teams to operate with speed and confidence, making blockchain technology accessible. We’ve seen impressive results running blockchain workloads on Z3 instances, with a 40% improvement on read latency compared to N2 and N2D instances.” – Ilija Petrovic, SRE Lead, Tenderly
“The VAST AI Operating System gives organizations a unified platform to reason over all of their data – structured, unstructured, and streaming through a global namespace that spans cloud and on-prem environments – enabling intelligent agents and applications to operate with full context and real-time speed. ,For customers running on Google Cloud, Z3 VMs complement this vision by providing the ideal storage infrastructure to accelerate these workloads, ensuring AI pipelines run fast and scale effortlessly in the cloud.” – Renen Hallak, Founder & CEO, VAST Data
Z3 VMs are also the physical foundation of AlloyDB, our flagship PostgreSQL-compatible database service, delivering sophisticated multi-level caching. AlloyDB uses Z3’s expansive local SSDs as an ultra-fast cache, holding datasets up to 25x larger than can be stored in memory. Database queries can access these large, cached datasets at latencies that closely approach in-memory performance, particularly when factoring in overall end-to-end application response times. This is a significant advantage for very large databases, including real-time analytical workloads, as AlloyDB’s high-performance columnar engine operates entirely within this massive cache. AlloyDB on Z3 VMs will soon be available in preview, delivering up to 3x better performance than N-series VMs for transactional workloads, particularly for large datasets.
Enhanced maintenance experience
Z3 instances make it easier for you to plan ahead and schedule maintenance operations at a time of your choosing by providing notice from the system several days in advance of a required maintenance. The new Z3 VMs further enhance the maintenance experience by allowing you to live-migrate an instance during maintenance events for VMs with 18 TiB or less of local SSD storage. For Z3 VMs with 36 TiB of local SSD and for Z3 bare metal instances, you’ll also receive in-place upgrades that preserve your data through the planned maintenance events.
Support for Hyperdisk
Z3 VMs support Hyperdisk, Google Cloud’s workload-optimized block storage that lets you optimize the performance for each workload by independently tuning the storage performance and capacity for each instance.
Z3 VMs are compatible with Hyperdisk Balanced, Hyperdisk Throughput, and ExtremeHyperdisk storage for scalable, high-performance network-attached storage, supporting up to 512 TiB of capacity per instance. For general-purpose workloads, Hyperdisk Balanced, with up to 160K IOPS per instance, offers a mix of performance and cost-efficiency. Hyperdisk Extreme delivers ultra-low latency and supports up to 350K IOPS and 5,000 MiB/s throughput per Z3 VM instance and up 500K IOPS and 10,000 MiB/s throughput for the Z3 bare metal instance — making it well-suited for demanding workloads like databases. Using Hyperdisk for persistent storage and Z3 Local SSD for caching creates an optimal storage architecture for high end databases and mission critical workloads
Get started with Z3 today
Z3 VMs and bare metal instances are available today in most regions worldwide. To start using Z3 instances, select Z3 under the new Storage-Optimized machine family when creating a new VM or GKE node pool in the Google Cloud console. Learn more at the Z3 machine series page. Contact your Google Cloud sales representative for more information on regional availability.
1. Results are based on Google Cloud’s internal benchmarking
Developers are racing to productize agents, but a common limitation is the absence of memory. Without memory, agents treat each interaction as the first, asking repetitive questions and failing to recall user preferences. This lack of contextual awareness makes it difficult for an agent to personalize their assistance–and leaves developers frustrated.
How we normally mitigate memory problems: So far, a common approach to this problem has been to leverage the LLM’s context window. However, directly inserting entire session dialogues into an LLM’s context window is both expensive and computationally inefficient, leading to higher inference costs and slower response times. Also, as the amount of information fed into an LLM grows, especially with irrelevant or misleading details, the quality of the model’s output significantly declines, leading to issues like “lost in the middle” and “context rot”.
How we can solve it now: Today, we’re excited to announce the public preview of Memory Bank, the newest managed service of the Vertex AI Agent Engine, to help you build highly personalized conversational agents to facilitate more natural, contextual, and continuous engagements. Memory Bank helps us address memory problems in four ways:
Personalize interactions: Go beyond generic scripts. Remember user preferences, key events, and past choices to tailor every response.
Maintain continuity: Pick up conversations seamlessly where they left off across multiple sessions, even if days or weeks have passed.
Provide better context: Arm your agent with the necessary background on a user, leading to more relevant, insightful, and helpful responses.
Improve user experience: Eliminate the frustration of repeating information and create more natural, efficient, and engaging conversations.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e5fa0277220>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
It understands and extracts memories from interactions: Using Gemini models, Memory Bank can analyze a user’s conversation history with the agent (stored in Agent Engine Sessions) to extract key facts, preferences, and context to generate new memories. This happens asynchronously in the background, without you needing to build complex extraction pipelines.
It stores and updates memories intelligently: Key information—like “My preferred temperature is 71 degrees,” or “I prefer aisle seats on flights” — is stored persistently and organized by your defined scope, such as user ID. When new information arises, Memory Bank (using Gemini) can consolidate it with existing memories, resolving contradictions and keeping the memories up to date.
It recalls relevant information: When a user starts a new conversation (session), the agent can retrieve these stored memories. This can be a simple retrieval of all facts or a more advanced similarity search (using embeddings) to find the memories most relevant to the current topic, ensuring the agent is always equipped with the right context.
A diagram illustrating how an AI agent uses conversation history from Agent Engine Sessions to generate and retrieve persistent memories about the user from Memory Bank.
Let’s take an example. Imagine you’re a retailer in the beauty industry. You have a personal beauty companion equipped with memory that recommends products and skincare routines.
As shown in the illustration, the agent is able to remember the user’s skin type (maintaining context) even after it evolves over time and be able to make personalized recommendations. This is the power of an agent with long-term memory.
Get started today with Memory Bank
You can integrate Memory Bank into your agent in two primary ways:
Develop an agent with Google Agent Development Kit (ADK) for an out-of-the-box experience
Develop an agent that orchestrates API calls to Memory Bank if you are building your agent with any other framework.
To get started, please refer to the official user guide and the developer blog. For hands-on examples, the Google Cloud Generative AI repository on GitHub offers a variety of sample notebooks, including integration with ADK and deployment to the Agent Engine runtime. For those wishing to try Memory Bank with third-party frameworks, we also provide notebook samples for LangGraph and CrewAI.
If you’re a developer using Agent Development Kit (ADK) but have never used Google Cloud before, you can still start by using our new express mode registration for Agent Engine Sessions and Memory Bank. Here’s how it works:
Sign up with your Gmail account to receive an API key
Use the key to access Agent Engine Sessions and Memory Bank
Build and test your agent within the free tier usage quotas
Seamlessly upgrade to a full Google Cloud project when you are ready for production
If you want to know more about Memory Bank, join the Vertex AI Google Cloud community to share your experiences, ask questions, and collaborate on new projects.
We believe this recognition affirms Google’s evolving commitment to delivering AI solutions tuned to unique industry challenges, empowering businesses to transform the digital commerce experience and deliver high ROI-generating product discovery, through the use of AI in relevance, ranking, and personalization.
Built upon Google’s deep expertise & knowledge of the way users search, interact with & purchase products across a broad landscape of commerce domains, Vertex AI Search for commerce is a fully-managed, AI-first solution tailored for e-commerce.
Redefining product discovery with generative AI
As digital channels continue to increase in traffic and adoption, businesses in e-commerce need to navigate the challenges involved in providing great digital experiences. Customers expect relevance, convenience, and personalization.
Vertex AI Search for commerce uses the best of Google AI to drive personalized product discovery at scale, optimized for revenue-per visitor. This enables businesses to not only understand user intent, but proactively surface the right products and content to shoppers through a variety of multimodal capabilities that leverage the latest advancements in generative AI.
Search and browse: Powering digital commerce sites and applications with Google-quality capabilities highly tuned for e-commerce and revenue maximization.
Conversational search:Enabling real-time, back and forth conversation, powered by Geminimodels to guide users through their shopping journeys, while optimizing for conversion.
Recommendations: Deliver highly personalized recommendations at scale.
Semantic image search: Users can search by image (“find this blouse”) or by a combination of text & image (“find the shoes that would look great with this blouse”).
In particular, our customers have noted our leadership in conversational search through one of our latest offerings, Conversational Commerce, which helps retailers provide assisted shopping on any digital channel, to engage with customers in a more natural and human-like conversation. This includes helping customers find their desired products online or helping a store associate answer questions using data from multiple sources to increase buying confidence for customers worldwide.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e5f90eb37c0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Intuitive search at the speed of intent, tuned for ecommerce
Search is changing as consumers are leveraging longer queries, seeking more comprehensive answers to questions.
Over the past decade, we’ve made significant strides in understanding user intent and refining our proprietary algorithms, now enhanced with gen AI. This has become increasingly important as user search behavior and the shopping journey have evolved, transforming the retail landscape. Unlike traditional models that are not AI-native, Vertex AI Search for commerce nearly eliminates the need for manual overriding to re-rank results and compensate for suboptimal search quality. Our solution is developed to help businesses, from digitally native e-commerce companies to traditional retailers, adapt to this new era of buying and selling by leveraging AI to maximize revenue.
Access to cutting edge AI/ML models: Vertex AI Search for commerce redefines e-commerce search using powerful AI / ML models like Gemini for unparalleled accuracy and relevance.
Leverage Google’s proprietary intelligence: Vertex AI Search for commerce utilizes Google Shopping’s vast query & click datasets, along with our most advanced knowledge graphs, to train our industry-leading AI Relevance models.
Advanced intent detection: Interprets complex queries and nuances in user intent beyond simple keyword matching, focusing on semantic meaning for intuitive results.
AI-based catalog optimization: Retail catalog search results are enhanced by combining Google Shopping’s effective web crawlers with Google’s extensive understanding of web content and our unique AI models.
Deep and scalable personalization: Analyzes individual shopper behavior, preferences, and historical data to deliver tailored product recommendations and search rankings, boosting satisfaction, conversions, and loyalty.
Managed deployment: Offers a fully managed solution for seamless integration and launch, minimizing engineering overhead.
Flexible customization and control: Businesses can customize the search experience with specific business rules and optimization objectives, aligning with unique KPI goals and other unique metrics.
The future of e-commerce is personalized and simplified
We believe being positioned as a Leader in the Gartner® Magic Quadrant™ for Search and Product Discovery underscores Google’s proven ability to deliver real business value. Vertex AI Search for commerce provides a comprehensive, AI-first solution that guides your customers through the buying journey, ensuring they find exactly what they need, every time.
To download the full 2025 Gartner Magic Quadrant for Search and Product Discovery report, click here, and for more information on Vertex AI Search for commerce, see here or register for our upcoming product roadmap webinar.
Gartner, Magic Quadrant for Search and Product Discovery – Mike Lowndes, Noam Dorros, Sandy Shen, Aditya Vasudevan, June 24, 2025
Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Google.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved.
For decades, institutions like Caltech, have been at the forefront of large-scale artificial intelligence (AI) research. As high-performance computing (HPC) clusters continue to evolve, researchers across disciplines have been increasingly equipped to process massive datasets, run complex simulations, and train generative models with billions of parameters. With this powerful combination, researchers have accelerated scientific discovery across diverse use cases including genomic analysis, drug discovery, weather forecasting, and beyond.
Modern research workloads, driven by AI and HPC, demand processing of structured and unstructured data at an unprecedented scale, while maintaining sub-millisecond storage latency, enterprise-level security and compliance, and reproducibility despite varying hardware and software configurations. This presents significant technical challenges for both researchers and supporting departments.
To accelerate scientific discovery in the AI era, Google Public Sector has announced it will support AI-optimized HPC infrastructure for researchers at Caltech. This new initiative furthers Caltech’s mission to expand human knowledge and benefit society through research integrated with education.
This support provides Caltech researchers four key resources:
A workhorse of diverse processor types, including Cloud GPUs and Google’s custom-design Arm-based processors (Axion) and Cloud Tensor Processing Units (TPUs) for AI acceleration and intense workloads.
A fully-managed, unified AI development platform, Vertex AI Platform, which includes Vertex AI Agent Builder and 200+ first-party (Gemini, Imagen 3), third-party, and open (Gemma) foundation models in Model Garden.
Dedicated campus training and workshops for students, researchers, and supporting teams, enabling them to increase AI literacy and adoption.
Google Public Sector and Caltech will integrate this AI-optimized infrastructure with Caltech’s existing HPC research environments to provide researchers instant access while maintaining their existing data and workloads.
One of the first initiatives that will be powered by this new AI infrastructure will be led by Dr. Babak Hassibi, Mose and Lillian S. Bohn Professor of Electrical Engineering and Computing and Mathematical Sciences at Caltech. Dr. Hassibi’s research focuses on making AI models more efficient, which is crucial for the advancement of the field. Current large language models (LLMs) can have billions or trillions of parameters. In efforts to make models even more efficient and useful. Dr. Hassibi’s proposal uses Vertex AI and has the potential to make AI more accessible and sustainable.
“We will be using Vertex AI to develop training methods on TPUs that incorporate pruning, quantization, and distillation, as well as considerations regarding resilience to attacks, into the training process itself. The former has the potential to significantly reduce inference time costs of the trained models, making AI much more accessible and sustainable. The latter can markedly improve the safety of systems that employ AI. Both will allow AI to move to the edge. In addition to the practical benefits, the work will inform the theoretical studies of AI models, in particular, their generalization performance and the limits of their compressibility, which is a major focus of my research group,” said Dr. Babak Hassibi.
By providing access to advanced AI and planet-scale infrastructure, this support enables Caltech researchers to continue to push scientific boundaries, investigate complex problems, and develop innovative solutions.
“We are living a time when we need to answer bigger questions faster, and do more with less. Google Public Sector is excited to support Caltech to build an AI-optimized infrastructure that will lead scientific discoveries across all domains for the best of all our constituents,” said Reymund Dumlao, Director of State & Local Government and Education at Google Public Sector.
At Google Public Sector, we’re passionate about applying the latest cloud, AI and security innovations to help you meet your mission. Subscribe to our Google Public Sector Newsletter to stay informed and stay ahead with the latest updates, announcements, events and more.
In the high-speed world of global motorsport, operational efficiency and technological innovation are as critical off the track as they are on it. And when it comes to innovating in the field, Formula E, with its focus on the future of mobility and sustainability, regularly takes the checkered flag.
Formula E orchestrates thrilling races with super-electric-power cars in cities worldwide. A central part of this experience is the management and distribution of immense volumes of media content. Over the course of a decade of high-octane racing, Formula E has accumulated a vast media archive, comprising billions of files and petabytes of data.
This valuable collection of content, previously housed in AWS S3 storage, necessitated a more robust, high-performance, and globally accessible solution to fully harness its potential for remote production and content delivery. These needs became especially acute as AI offers more opportunities to leverage archives and historic data.
At the same time, the organization faced internal operational challenges, including common issues like disconnected systems, which limited collaboration, alongside escalating costs from multiple software subscriptions. Formula E sought a unified solution to foster greater integration and prepare for the future of work.
This led to the widespread adoption of Google Cloud’s ecosystem, including Google Workspace, to address both its large-scale data storage needs and internal collaboration and productivity workflows.
A decade of data and disconnected operations
Formula E’s media archive represented a significant challenge due to its sheer scale. With 10 years of racing history captured, the archive contained billions of files and petabytes of data, making any migration a formidable task.
The demands of its global operations, which include taking production to challenging race locations across the globe, require seamless connectivity and high throughput. The organization had previously encountered difficulties connecting to its S3 storage from remote geographical locations. These experiences prompted successful experiments with Google Cloud, underscoring the critical need for a cloud solution that could deliver consistent performance, even in areas with less reliable internet infrastructure.
Internally, Formula E’s use of disparate systems, including Office 365 and multiple communication tools, resulted in disjointed workflows and limited collaborative capabilities. The reliance on multiple software subscriptions for communication and collaboration was also driving up operational costs. The team needed a more integrated and cost-effective environment that could streamline processes and foster a more collaborative culture, setting the stage for future advancements in workplace technology.
A pitstop in the cloud drives better performance
To address these multifaceted challenges, Formula E embarked on a strategic, two-pronged migration. This involved moving its massive media archive to Google Cloud Storage and transitioning its entire staff to Google Workspace.
For the monumental task of migrating its petabyte-scale archive, Formula E collaborated closely with Google Cloud and Mimir, its media management system provider. Following meticulous planning with Mimir and technical support from Google Cloud, the team chose the Google Cloud Storage Transfer Service (STS).
STS is a fully managed service engineered for rapid data transfer, which crucially required no effort from Formula E’s team during the migration itself. A pivotal element that ensured a smooth transition was Cloud Storage’s comprehensive support for S3 APIs. This compatibility enabled Formula E to switch cloud providers without any interruption to its business, guaranteeing continuity for its critical media operations.
To revolutionize its internal operations, Formula E partnered with Cobry, a Google Workspace migration specialist, for a two-phased migration process. The initial phase focused on migrating Google Drive content, which was soon followed by the full migration to Workspace. Cobry not only helped with the technical migration, the company also provided support for change management and training.
To ensure a smooth transition and encourage strong user adoption, Google champions from across the business received in-person training at Google’s offices. The project successfully migrated 220 staff members along with 9 million emails and calendar invites.
The need for speed, collaboration, and savings
The migration of the media archive to Google Cloud proved as successful as a clever hairpin maneuver at the track.
The Storage Transfer Service performed beyond expectations, moving data out of S3 at an impressive rate of 240 gigabits per second — which translates to approximately 100 terabytes per hour. The entire sprawling archive was transferred in less than a day, a level of throughput that impressed even Google Cloud’s internal teams and confirmed that STS could deliver results at major scale. Such efficiency meant that Formula E experienced no business interruption and maintained continuous access to its valuable media assets.
Beyond the rapid migration, Formula E now benefits significantly from Google Cloud’s global network. By leveraging this infrastructure, the organization enjoys lower latency and higher throughput, which are critical for its remote production studios when operating worldwide.
A core technology behind this enhanced performance is Google Cloud’s “Cold Potato Routing.” This strategy keeps data on Google’s private, internal backbone network for as long as possible, using the public internet for only the shortest final leg of the journey. This approach guarantees improved throughput and latency, effectively resolving the connectivity challenges Formula E previously faced in remote race locations.
Google Cloud’s commitment to an open cloud ecosystem, demonstrated by its full support for S3 APIs, was instrumental in facilitating a smooth transition without vendor lock-in.
The transition to Google Workspace, meanwhile, has transformed Formula E’s internal operations, fostering a more integrated and collaborative work environment. Developed with collaboration at its core, Google Workspace has seamlessly integrated Gemini into the team’s workflow. Initially adopted by office-based teams for automating repetitive tasks, formatting, and summarizing data, Gemini is now accessible across the entire business, empowering staff to work more intelligently.
“Moving from Office 365 to Google Workspace has been a great step for our team in productivity,” said Hayley Mann, chief people officer at Formula E. “The enhanced collaboration features have been really beneficial, and we’re excited about how Google’s integrated Al capabilities will empower and enable our people to work smarter and even more efficiently every day.”
The migration is also delivering financial benefits, as Formula E was also able to replace their usage of Slack and Zoom with Chat and Meet in Google Workspace. Further savings are anticipated as other existing software contracts expire.
Racing toward an open ecosystem
Looking to the future, Formula E is also positioned to realize significant cost savings by using Cloud Storage’s Autoclass feature, which intelligently manages storage classes based on access patterns to optimize costs.
With its petabyte-scale media archive now residing on Google Cloud, Formula E is well positioned to continue innovating in media management, ensuring its content is high-performance, cost-effective, and globally accessible for years to come. This includes leveraging new AI capabilities emerging across the Google ecosystem.
And by embracing integrated AI through Gemini and fostering a truly collaborative environment with Google Workspace, Formula E is accelerating its journey towards peak operational efficiency — mirroring the innovation it champions on the racetrack.
In 2024, retail sales for consumer packaged goods were worth $7.5 trillion globally. Their sheer variety — from cosmetics to clothing, frozen vegetables to vitamins — is hard to fathom. And distribution channels have multiplied: Think big box stores in the brick-and-mortar world and mega ecommerce sites online. Most importantly, jury-rigged digital tools can no longer keep pace with the ever-growing web of regulations designed to protect consumers and the environment.
SmarterX uses AI to untangle that web. Our Smarter1 AI model aggregates and triangulates publicly available datapoints — hundreds of millions UPCs and SKUs, as well as product composition and safety information — from across the internet. By matching specific products to applicable regulatory information and continuously updating our models for a particular industry or client, SmarterX helps retailers make fully compliant decisions about selling, shipping, and disposing of regulated consumer packaged goods.
And just like our clients, we needed to accelerate and expand our capabilities to keep pace with that data deluge and build better AI models faster. Migrating to Google Cloud and BigQuery gave us the power, speed, and agility to do so.
Embracing BigQuery: a flexible, easy-to-use, AI-enabled data platform
Because we deal with data from so many sources, we needed a cloud-based enterprise data platform to handle multiple formats and schemas natively. That’s exactly what BigQuery gives us. Since data is the foundation of our company and products, we began by migrating all our data — including the data housed in Snowflake — to BigQuery.
With other data platforms, the data has to be massaged before you can work with it: a time-consuming, often manual process. BigQuery is built to quickly ingest and aggregate data in many different formats, and its query engine allows us to work with data right away in whatever format it lands. Coupled with Looker, we can create easy-to-understand visualizations of the complex data in BigQuery without ever leaving Google Cloud.
In addition, because Gemini Code Assist is integrated with BigQuery, even our less-technical team members can do computational and analytical work.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1bb6ef6160>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/1’), (‘image’, None)])]>
An integrated tech stack unleashes productivity and creativity
After 10 years in business, SmarterX was also suffering from system sprawl.
Just as migrating data between platforms is inefficient, developers become less efficient when they have to bounce around different tools. Even with the increase in AI agents to help with coding and development, the tools struggle, too: When hopping among multiple systems, they pick up noise along the way. And governing identity and access management (IAM) individually for all those systems was time-consuming and left us vulnerable to potential security risks caused by misapplied access privileges.
Google Cloud provides a fully integrated tech stack that consolidates our databases, data ingestion and processing pipelines, models, compute power — even our email, documents, and office apps — in a single, unified ecosystem. And its LLMs are integrated throughout that ecosystem, from the Chrome browser to the SQL variants themselves. This obviates building custom pipelines for most new data sets and allows us to work more efficiently and coherently:
We’re now releasing new products 10-times faster than we were prior to migrating to Google Cloud.
We onboard new customers in less than a week instead of six months.
Ourdata pipelines handle 100 times the data they did previously, so we can maintain hundreds of customer deployments with a relatively small staff.
Consolidating on Google Cloud also lowered our overhead by 50% because we deprecated several of other SaaS platforms and teams can easily engage with Google’s tools without specialized expertise. Our entire team now lives in Google Cloud: Not an hour goes by that we aren’t using some form of the platform’s technology.
Eliminating system sprawl also means we no longer need to maintain security protocols for separate platforms. Permissioning and identity and access management are handled centrally, and Google Cloud makes it easy to stay current on compliance requirements like SOC-2.
A vision for AI in tune with our own: Gemini
The value SmarterX provides our customers relies heavily on our platform’s AI-driven capabilities. Finding the right AI model development platform and AI models was therefore one of the driving forces behind our choice of a new data platform. And when it comes to creating AI models, philosophy matters.
Google’s philosophy dovetails with our own because they’ve always been at the forefront of understanding how people want to access information. Since the company’s expertise makes web data searchable on an enterprise scale, its Gemini models are tuned beautifully to do what SmarterX needs them to. Before switching to Vertex AI and Gemini, it took us months to release a new model; we can now do the same work in a matter of weeks.
When SmarterX hires new team members, we look for creative thinkers, not speakers of a specific coding language. And we want to give our developers the brainspace to focus on complex problem-solving rather than puzzling over syntax for SQL coding. Gemini Code Assist in BigQuery is easy to learn and can accurately handle the syntax for them. That leaves our developers more time for finding innovative solutions.
A smooth migration by a team that knows its stuff
We couldn’t have completed our migration without the support of the Google Technical Onboarding Center. They really know their way around their technology and had spelunking tools at the ready for tricky scenarios we encountered along the way.
In less than a month, we migrated terabytes of data from Snowflake to BigQuery: more than 80 databases and thousands of tables from 21 different sources. We used a two-pronged approach that leveraged external tables for rapid access to data and native tables for optimized query performance.
Prior to the migration, Google provided foundational training for managing and operating the Google Cloud Platform. They also took the time to understand SmarterX technology. So instead of being constrained by a cookie-cutter migration plan, the Google team helped us to design and schedule a migration — with minimal disruptions or downtime — in the way that made the most sense for SmarterX and our customers. Google’s expertise in best-practices for security and identity and access management further enhanced the security of our new cloud environment.
Even though we’re not a huge customer pumping petabytes of data through Google Cloud daily, the team treated us as if we were on par with larger organizations. When you’re literally moving the foundation of your entire business, it feels good to know that Google has your back.
Snowflake felt like a traditional enterprise data warehouse grafted onto the cloud, completely uninfluenced by the AI revolution, with a database that forced us to work in a specific, predetermined way. With BigQuery, we have a real information production system: a computing cloud with a built-in SQL-friendly data platform, a wide-ranging toolset, embedded AI and model development, and a single user interface for developing products our own way.
Unlimited imagination, not roadmaps
Many people are surprised when I tell them that SmarterX doesn’t have roadmaps — we make bets. We’re wagering that companies want AI to solve whatever real-world use cases arise. Rather than telling you what to do, AI has the ability to understand things that were previously impossible to understand, and to help people express ideas that were previously impossible to express.
SmarterX works with over 2,000 brands. Ultimately, what they’re purchasing is the speed at which we can help them solve their business challenges with artificial intelligence. In much the same way, Google Cloud is solving our own technology challenges, sometimes before we even know we have them, so we can deliver top-notch products to our customers.
Instead of doing battle with a growing sprawl of outdated technology, BigQuery and the rest of the Google Cloud integrated toolset is allowing us to relentlessly reinvent ourselves. Not a week goes by when I don’t hear someone say, “Oh, wow, we can do that with Google Cloudtoo?”
Company description SmarterX helps retailers, manufacturers, and logistics companies minimize regulatory risk, maximize sales, and protect consumers and the environment by giving them AI-driven tools to safely and compliantly sell, ship, store, and dispose of their products. Its clients include global brands that are household names all across the world.
The landscape of business intelligence is evolving rapidly, with users expecting greater self-service and natural language capabilities, powered by AI. Looker’s Conversational Analytics empowers everyone in your organization to access the wealth of information within your data. Select the data you wish to explore, ask questions in natural language, as you would a colleague, and quickly receive insightful answers and visualizations that are grounded in truth, thanks to Looker’s semantic layer. This intuitive approach lowers the technical barriers to data analysis and fosters a more data-driven culture across your teams.
How does this work? At its core, Conversational Analytics understands the intent behind your questions. Enhanced by Gemini models, the process involves interpreting your natural language query, generating the appropriate data retrieval logic, and presenting the results in an easy-to-understand format, often as a visualization. This process benefits from Looker’s semantic model, which simplifies complex data with predefined business metrics, so that Gemini’s AI is grounded in a reliable and consistent understanding of your data.
Prioritizing privacy in Gemini
The rise of powerful generative AI models like Gemini brings incredible opportunities for innovation and efficiency. But you need a responsible and secure approach to data. When users use AI tools, questions about data privacy and security are top of mind. How are prompts and data used? Are they stored? Are they used to train the model?
At Google Cloud, the privacy of your data is a fundamental priority when you use Gemini models, and we designed our data governance practices to give you control and peace of mind. Specifically:
Your prompts and outputs are safe
Google Cloud does not train models on your prompts or your company’s data.
Conversational Analytics only uses your data to answer your business questions — making data queries, creating charts and summarizations, and providing answers. We store agent metadata, such as special instructions, to improve the quality of the agent’s answers, and so you can use the same agent in multiple chat sessions. We also store chat conversations so you can pick up where you left off. Both are protected via IAM and not shared with anyone outside your organization without permission.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1bb9ae4f40>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Your data isn’t used to train our models
The data processing workflow in Conversational Analytics involves multiple steps:
The agent sees the user’s question, identifies the specific context needed to answer it, and uses tools to retrieve useful context like sample data and column descriptions.
Using business and data context and the user question, the agent generates and executes a query to retrieve the data. The data is returned, and the resulting data table is generated.
Previously gathered information can then be used to create visualizations, text explanations, or suggested follow-up questions. Through this process, the system keeps track of the user’s question, the data samples, and the query results, to help formulate a final answer.
When the user asks a new question, such as a follow-up, the previous context of the conversation helps the agent understand the user’s new intent.
Enhancing trust and security in Conversational Analytics
To give you the confidence to rely on Conversational Analytics, we follow a comprehensive set of best practices within Google Cloud:
Leverage Looker’s semantic layer: By grounding Conversational Analytics in Looker’s semantic model, we ensure that the AI model operates on a trusted and consistent understanding of your business metrics. This not only improves the accuracy of insights but also leverages Looker’s established governance framework.
Secure data connectivity: Conversational Analytics connects to Google Cloud services like BigQuery, which have their own robust security measures and access controls. This helps ensure that your underlying data remains protected.
Use data encryption: Data transmitted to Gemini for processing is encrypted in-transit, safeguarding it from unauthorized access. Agent metadata and conversation history are also encrypted.
Continuously monitor and improve: Our teams continuously monitor the performance and security of Conversational Analytics and Gemini in Google Cloud.
Role-based access controls
In addition, Looker provides a robustrole-based access control (RBAC) framework that Conversational Analytics leverages to offer granular control over who can interact with specific data. When a Looker user initiates a chat with data, Looker Conversational Analytics respects their established Looker permissions. This means they can only converse with Looker Explores to which they already have access. For instance, while the user might have permission to view two Looker Explores, an administrator could restrict conversational capabilities to only one. As conversational agents become more prevalent, the user will only be able to use those agents to which they have been granted access. Agent creators also have the ability to configure the capabilities of Conversational Analytics agents, for example limiting the user to viewing charts while restricting advanced functionalities like forecasting.
Innovate with confidence
We designed Gemini to be a powerful partner for your business, helping you create, analyze, and automate with Google’s most capable AI models. We’re committed to providing you this capability without compromising your data’s security or privacy, or training on your prompts or data. By not storing your prompts, data, and model outputs or using them for training, you can leverage the full potential of generative AI with confidence, knowing your data remains under your control.
By following these security principles and leveraging Google Cloud’s robust infrastructure, Conversational Analytics offers a powerful, insightful experience that is also secure and trustworthy. By making data insights accessible to everyone securely, you can unlock new levels of innovation and productivity in your organization. Enable Conversational Analytics in Looker today, and start chatting with your data with confidence.
As adversaries grow faster, stealthier, and more destructive, traditional recovery strategies are increasingly insufficient. Mandiant’s M-Trends 2025 report reinforces this shift, highlighting that ransomware operators now routinely target not just production systems but also backups. This evolution demands that organizations re-evaluate their resilience posture. One approach gaining traction is the implementation of an isolated recovery environment (IRE)—a secure, logically separated environment built to enable reliable recovery even when an organization’s primary network has been compromised.
This blog post outlines why IREs matter, how they differ from conventional disaster recovery strategies, and what practical steps enterprises can take to implement them effectively.
The Backup Blind Spot
Most organizations assume that regular backups equal resilience; however, that assumption doesn’t hold up against today’s threat landscape. Ransomware actors and state-sponsored adversaries are increasingly targeting backup infrastructure directly, encrypting, deleting, or corrupting it to prevent recovery and increase leverage.
The M-Trends 2025 report reveals that in nearly half of ransomware intrusions, adversaries used legitimate remote management tools to disable security controls and gain persistence. In these scenarios, the compromise often extends to backup systems, especially those accessible from the main domain.
In short: your backup isn’t safe if it’s reachable from your production network. During an active incident, that makes it irrelevant.
What Is an Isolated Recovery Environment?
An isolated recovery environment (IRE) is a secure enclave designed to store immutable copies of backups and provide a secure space to validate restored workloads and rebuild in parallel while incident responders carry out the forensic investigation. Unlike traditional disaster recovery solutions, which often rely on replication between live environments, an IRE is logically and physically separated from production.
At its core, an IRE is about assuming a breach has occurred and planning for the moment when your primary environment is lost, ensuring you have a clean fallback that hasn’t been touched by the adversary.
Key Characteristics of an IRE
Separation of infrastructure and access: The IRE must be isolated from the corporate environment. No shared authentication, no shared tooling, no shared infrastructure or services, no persistent network links or direct TCP/IP connections between the production environment and the IRE.
Restricted administrative workflows: Day-to-day access is disallowed. Only break-glass, documented processes exist for access during validation or recovery.
Known-good, validated artifacts: Data entering the IRE must be scanned, verified, and stored with cryptographic integrity checks all while maintaining the isolation controls.
Validation environment and tools: The IRE must also include a secured network environment, which can be used by security teams to validate restored workloads and remove any identified attacker remnants.
Recovery-ready templates: Rather than restoring single machines, the IRE should support the rapid rebuild of critical systems in isolation with predefined procedures.
Implementation Strategy
Successfully implementing an IRE is not a checkbox exercise. It requires coordination between security, infrastructure, identity management, and business continuity teams. The following breaks down the major building blocks and considerations.
Infrastructure Segmentation and Physical Isolation
The foundational principle behind an IRE is separation. The environment must not share any critical infrastructure, identity, network, hypervisors, storage, or other services with the production environment. In most cases, this means:
Dedicated platforms (on-premises or cloud based) and tightly controlled virtualization platforms
No routable paths from production to the IRE network
Physical air-gaps or highly restricted one-way replication mechanisms
Independent DNS, DHCP, and identity services
Figure 2 illustrates the permitted flows into and within the IRE.
Figure 2: Typical IRE architecture
Identity and Access Control
Identity is the primary attack vector in many intrusions. An IRE must establish its own identity plane:
No trust relationships to production Active Directory
No shared local or domain accounts
All administrative access must require phishing resistant multi-factor authentication (MFA)
All administrative access should be via hardened Privileged Access Workstations (PAW) from within the IRE
Where possible, implement just-in-time (JIT) access with full audit logging
Accounts used to manage the IRE should not have any ties to the production environment; this includes being used from a device belonging to the production domain. These accounts must be used from a dedicated PAW.
Secure Administration Flows
Administrative access is often the weak link that attackers exploit. That’s why an IRE must be designed with tight control over how it’s managed, especially during a crisis.
In the following model, all administrative access is performed from a dedicated PAW. This workstation sits inside an isolated management zone and is the only system permitted to access the IRE’s core components.
Here’s how it works:
No production systems, including IT admin workstations, are allowed to directly reach the IRE. These paths are completely blocked.
The PAW manages the IRE’s:
Isolated Data Vault, where validated backups are stored.
Management Plane, which includes IRE services such as Active Directory, DNS, PAM, backup, and recovery systems.
Green VLAN, which hosts rebuilt Tier-0 and Tier-1 infrastructure.
Any restored services go first into a yellow staging VLAN, a controlled quarantine zone with no east-west traffic. Systems must be verified clean before moving into the production-ready green VLAN. Remote access to machines in the yellow VLAN is restricted to console only access (hypervisor or iLO consoles) from the PAW. No direct RDP/SSH is permitted.
This design ensures that even during a compromise of the production environment, attackers can’t pivot into the recovery environment. All privileged actions are audited, isolated, and console-restricted, giving defenders a clean space to rebuild from.
Figure 3: Permitted administration paths
One-Way Replication and Immutable Storage
How data enters the IRE is just as important as how it’s managed. Backups that are copied into the data transfer zone must be treated as potentially hostile until proven otherwise.
To mitigate risk:
Data must flow in only one direction, from production to IRE, never the other way around.*
This is typically achieved using data diodes or time-gated software replication that enforces unidirectional movement and session expiry.
Ingested data lands in a staging zone where it undergoes:
Hash verification against expected values.
Malware scanning, using both signature and behavioural analysis.
Cross-checks against known-good backup baselines (e.g., file structure, size, time delta).
Once validated, data is committed to immutable storage, often in the form of Write Once, Read Many (WORM) volumes or cloud object storage with compliance-mode object locking. Keys for encryption and retention are not shared with production and must be managed via an isolated KMS or HSM.
The goal is to ensure that even if an attacker compromises your primary backup system, they cannot alter or delete what’s been stored in the IRE.
*Depending on overall recovery strategies, it’s possible that restored workloads may need to move from the IRE back to a rebuilt production environment.
Recovery Workflows and Drills
An IRE is only useful if it enables recovery under pressure. That means planning and testing full restoration of core services. Effective IRE implementations include:
Templates for rebuilding domain controllers, authentication services, and core applications
Automated provisioning of VMs or containers within the IRE
Access to disaster recovery runbooks that can be followed by incident responders
Scheduled tabletop and full-scale recovery exercises (e.g., quarterly or bi-annually)
Many organizations discover during their first exercise that their documentation is out of date or their backups are incomplete. Recovery drills allow these issues to surface before a real incident forces them into view.
Hash Chaining and Log Integrity
If you’re relying on the IRE for forensic investigation as well as recovery, it’s essential to ensure the integrity of system logs and metadata. This is where hash chaining becomes important.
Implement hash chaining on logs stored in the IRE to detect tampering.
Apply digital signatures from trusted, offline keys.
Regularly verify the chain against trusted checkpoints.
This ensures that during an incident, you can prove not only what happened but also that your evidence hasn’t been modified, either by an attacker or by accident.
Choosing the Right IRE Deployment Model
The right model depends on your environment, compliance obligations, and team maturity.
Model
Advantages
Challenges
On-Premises
Full control, better for air-gapped environments
Higher CapEx, longer provisioning time, less flexibility
Cloud
Faster provisioning, built-in automation, easier to test
Requires strong cloud security maturity and IAM separation
Hybrid
Local speed + cloud resilience; ideal for large orgs with critical workloads
More complex design; requires secure identity split and replication paths
Common Pitfalls
Over-engineering for normal operations: The IRE is not a sandbox. Avoid mission creep.
Using the IRE beyond cyber recovery: The IRE is not for DR testing, HA, or daily operations. Any non-incident use risks breaking its isolation and trust model.
Assuming cloud equals isolation: Isolation requires deliberate configuration. Cloud tenancy is not enough.
Neglecting insider threats: The IRE must defend against sabotage from inside the organization, not just ransomware.
Closing Thoughts
As attackers accelerate and the blast radius of intrusions expands, the need for trusted, tamper-proof recovery options becomes clear. An isolated recovery environment is not just a backup strategy, it is a resilience strategy.
It assumes breach. It accepts that visibility may be lost during a crisis. And it gives defenders a place to regroup, investigate, and rebuild.
The Mandiant M-Trends 2025 report makes it clear; the cost of ransomware isn’t just in ransom paid, but in days or weeks of downtime, regulatory penalties, and reputation loss. The cost of building an IRE is less than a breach, and the peace of mind it offers is far greater.
For deeper technical guidance on building secure recovery workflows or assessing your current recovery posture, Mandiant Consulting offers strategic workshops and assessment services.
Acknowledgment
A special thanks to Glenn Staniforth for their contributions.
For organizations with stringent sovereignty and regulatory requirements, Google Distributed Cloud (GDC) air-gapped delivers a fully-managed experience with critical advanced networking capabilities. But operating in a completely isolated environment presents some unique networking challenges. Routine tasks become significantly more complex and manual, demanding more planning and bespoke solutions than on a connected network.
Today, we’re helping to solve these challenges with three major advancements in networking for GDC air-gapped: native IP address management (IPAM), multi-zone load balancing, and workload-level firewall policies — all powerful new capabilities designed to give you more control over your air-gapped environment.
Let’s take a look at these new capabilities.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3eb62669aee0>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Streamlined IP management for GDC
With GDC IP address management, you can now plan, track, and monitor IP addresses for all your workloads and infrastructure. IPAM for GDC is a valuable tool, since many air-gapped deployments consume IP addresses from your organization’s existing private IP address space, which can be difficult to manage, not very scalable, lacking in security, and finite. IPAM for GDC provides the following capabilities:
Scalable IP management: Expand your network for Day-2 IP growth, free from duplicate IP address conflicts, and with support for non-contiguous subnets.
Enhanced security and compliance: Strengthen your posture and meet strict compliance requirements with robust IPAM controls, including subnet delegation and private IPs for zonal infrastructure.
Optimized IP resource utilization: Reduce IP sprawl and maximize your finite IP resources.
IPAM for GDC provides the intelligent automation and centralized oversight essential for managing your complete IP lifecycle in secure, air-gapped environments, helping to ensure both operational excellence and adherence to critical regulations.
High availability with multi-zone load balancers
For critical applications, downtime is not an option. Now, you can help your workloads remain resilient and accessible, even in the event of a zone failure.
Our new multi-zone load balancing capability allows you to distribute traffic across multiple availability zones within your GDC environment. Both internal and external load balancers now support this multi-zone functionality, simplifying operations while maximizing uptime. This provides:
Continuous availability: Applications remain accessible even during a complete zone failure.
Operational simplification: There’s a single Anycast IP address for the application (regardless of where backends are located).
Optimized performance: Traffic is routed to the nearest available instance based on network topology and routing metrics.
The load balancing system operates by creating load balancer (LB) objects, which are then handled by new LB API controllers. These controllers manage object conditions, including cross-references and virtual IP address (VIP) auto-reservations, and create Kubernetes services across all clusters.
Workload-level network firewall policies
To secure an environment, you need to control traffic not just at the edge, but between every component inside. That’s why we’re launching workload-level firewall policies as part of the GDC air-gapped product. This feature provides fine-grained control over communication between individual workloads, such as VMs and pods, within a project. This feature helps:
Strengthen your security posture: Isolate workloads and limit communication between them.
Easily apply policies: Define and apply policies to specific workloads or groups of workloads.
Meet regulatory standards: Help adhere to regulatory requirements and internal standards.
GDC air-gapped implements default base network policies to create a secure architecture. In order to allow intra-project or cross-project traffic at the workload level, you can update these default policies as you wish. Policies are multi-zone by default. This means they affect all zones where your labeled workloads are present. You can enforce policies at the workload level using labels and workload selectors.
A new era of network control
These new capabilities — GDC IPAM, multi-zone load balancing, and workload-level firewall policies — represent a significant step forward in providing a robust, resilient, and secure networking experience for the air-gapped cloud. They work together to simplify your operations, strengthen your security posture, and empower you to run your most sensitive applications with confidence.
To learn more about these features, please refer to our documentation or contact your Google Cloud account team.
Editor’s Note: Today, we’re sharing insights from IDC Research Director, Devin Pratt, as he offers his analysis of recent research on Cloud SQL. In this post, you’ll see how Cloud SQL’s highly flexible, fully managed database service for MySQL, PostgreSQL, and SQL Server workloads can boost performance and cut costs, ultimately freeing your team to focus on core tasks. If you’re interested in exploring Google Cloud’s full range of database services, you can find more at our Google Cloud Databases homepage.
In today’s data-driven landscape, effectively managing databases requires solutions that tackle performance, scalability, and integration challenges. With years of experience analyzing database management systems (DBMS), I have witnessed the industry’s evolution in response to increasing demands for efficiency and innovation. This transformation is notably highlighted in IDC’s recent comprehensive Business Value White Paper, The Business Value of Cloud SQL: Google Cloud’s Relational Database Service for MySQL, PostgreSQL, and SQL Server.
The study examines the experiences of organizations that transitioned from self-managed database servers in their data centers or cloud environments to Cloud SQL. Through my analysis of the database market, I have observed how these transitions can significantly enhance an organization’s operational efficiency and reshape its cost structure. The findings align with my observations, revealing benefits such as reduced operational costs and access to advanced automation and expertise.
These results underscore the evolving nature of the database market and present valuable opportunities for businesses to optimize their operations through the strategic adoption of cloud solutions.
The challenges of modern database management
As a professional in database management, I’ve observed several key challenges facing organizations today:
Performance demands: Applications require faster read/write speeds to maintain responsiveness under heavy workloads.
Downtime issues: Maintenance tasks often disrupt operations, leading to costly interruptions.
Scaling limitations: Technical constraints can hinder database growth and adaptability.
AI integration complexity: Incorporating AI typically requires external tools, adding layers of intricacy.
Resource-intensive management: A DBMS requires expertise and significant investment in maintenance, upgrades, and system resources, often straining IT budgets.
Addressing these issues is crucial for innovation and cost-efficiency in our increasingly data-driven world.
IDC’s Business Value White Paper found that organizations using Cloud SQL have achieved an impressive average three-year ROI of 246%, with a rapid 11-month payback period. Study participants attributed this high return to several factors, including:
Increased operational efficiency: Database administrators and infrastructure teams can focus on strategic and innovative tasks rather than routine maintenance.
Cost reduction: Organizations benefit from lower overall database operational costs, including reduced infrastructure and database expenses.
Enhanced agility: Faster deployment and scaling of database resources enable businesses to better support development activities and adapt to changing needs.
Business growth: Organizations are winning more business by delivering faster, higher-quality products and services, improving application performance, and user experiences.
Further advancements in database management
Since the publication of the IDC study, Google Cloud has enhanced Cloud SQL in two key areas: price performance and generative AI capabilities.
First, Enterprise Plus edition now provides businesses with an available and reliable database solution in addition to the core service. This includes increased read throughput and improved write latency, enhanced scalability with tenfold expanded table support, greater efficiency through near-zero planned downtime with rolling updates for both scaling up and down, and improved disaster recovery capabilities via enhanced failover processes and testing.
Second, Cloud SQL provides a comprehensive set of generative AI tools and capabilities. This includes pgvector support in PostgreSQL and native vector support in MySQL for efficient vector similarity search, alongside streamlined connectivity to Vertex AI, LangChain, and various foundation models through extensions. This enables direct AI application development within the database.
Conclusion
The IDC Business Value White Paper on Cloud SQL provides data that aligns with many of my observations regarding cloud-based database solutions, highlighting several key areas of improvement:
44% increase in DBA efficiency
28% lower three-year cost of operations
96% faster creation and deployment of new databases
An average annual revenue increase of $21.75 million per organization
These results suggest that managed database services like Cloud SQL may offer significant benefits in operational efficiency, cost reduction, and potential revenue growth.
For those interested in a more comprehensive analysis of these findings and their potential implications, I recommend reviewing the full IDC Business Value White Paper, “The Business Value of Cloud SQL: Google Cloud’s Relational Database Service for MySQL, PostgreSQL, and SQL Server,” sponsored by Google Cloud.
Agents are top of mind for enterprises, but often we find customers building one “super” agent – a jack of all trades – instead creating multiple agents that can specialize and work together. Monolithic agents often crumble under their own weight because of instruction overload, inaccurate outputs, and brittle systems that are impossible to scale.
The good news: A team of specialized AI agents, each an expert in its domain, can deliver higher fidelity, better control, and true scalability.
The challenge: Building robust multi-agent workflows is complex. This is where Google’s Agent Development Kit (ADK) becomes essential. The ADK provides the framework to design, build, and orchestrate these sophisticated agentic systems, leveraging the power of Gemini. In this post, we’ll show you how you can build a multi-agentic workflow using ADK.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3eb6263d5280>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Step 1: Create specialized agents
Instead of one monolithic agent trying to do everything and getting confused, we’ll break the problem down. We’re building a team of focused specialist agents, each with clear instructions for a single job. In this case, we’ll take a travel example:
FlightAgent: Knows only about flights.
HotelAgent: An expert in accommodation.
SightseeingAgent: A dedicated tour guide.
code_block
<ListValue: [StructValue([(‘code’, ‘from google.adk.agents import LlmAgentrnrn# Flight Agent: Specializes in flight booking and informationrnflight_agent = LlmAgent(rn model=’gemini-2.0-flash’,rn name=”FlightAgent”,rn description=”Flight booking agent”,rn instruction=f”””You are a flight booking agent… You always return a valid JSON…”””)rnrn# Hotel Agent: Specializes in hotel booking and informationrnhotel_agent = LlmAgent(rn model=’gemini-2.0-flash’,rn name=”HotelAgent”,rn description=”Hotel booking agent”,rn instruction=f”””You are a hotel booking agent… You always return a valid JSON…”””)rnrn# Sightseeing Agent: Specializes in providing sightseeing recommendationsrnsightseeing_agent = LlmAgent(rn model=’gemini-2.0-flash’,rn name=”SightseeingAgent”,rn description=”Sightseeing information agent”,rn instruction=f”””You are a sightseeing information agent… You always return a valid JSON…”””)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263d5700>)])]>
To manage these specialists, build a coordinator workflow. Then, create a TripPlanner root agent whose only job is to understand a user’s request and route it to the correct specialist.
code_block
<ListValue: [StructValue([(‘code’, ‘# Root agent acting as a Trip Planner coordinatorrnroot_agent = LlmAgent(rn model=’gemini-2.0-flash’,rn name=”TripPlanner”,rn instruction=f”””rn Acts as a comprehensive trip planner.rn – Use the FlightAgent to find and book flightsrn – Use the HotelAgent to find and book accommodationrn – Use the SightSeeingAgent to find information on places to visitrn …rn “””,rn sub_agents=[flight_agent, hotel_agent, sightseeing_agent] # The coordinator manages these sub-agentsrn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263d5640>)])]>
While this works beautifully for simple queries (e.g., “Find me a flight to Paris” is immediately dispatched to the FlightAgent), a new problem quickly becomes apparent. When asked, “Book a flight to Paris and then find a hotel,” the coordinator calls the FlightAgent and stops. It has done its job of routing the initial request, but it cannot orchestrate a multi-step workflow. The manager is a great receptionist but a poor project manager
This limitation stems from how the system handles sub-agents. When the Root Agent calls the Flight Agent as a sub-agent, the responsibility for answering the user is completely transferred to the Flight Agent. The Root Agent is effectively out of the loop. All subsequent user input will be handled solely by the Flight Agent. This often leads to incomplete or irrelevant answers because the broader context of the initial multi-step request is lost, directly reflecting why the manager struggles as a “project manager” in these scenarios.
Step 2: Give your coordinator tools
The coordinator needed an upgrade. It shouldn’t just forward a request; it needed the ability to use its specialists to complete a bigger project. This led to the next evolution: the Dispatcher Agent with Agent Tools.
Instead of treating the specialists as destinations, we will treat them as tools in the root agent’s toolbox. The root agent could then reason about a complex query and decide to use multiple tools to get the job done.
Using the ADK, the specialized agents are converted into AgentTools.
code_block
<ListValue: [StructValue([(‘code’, ‘from google.adk.agents import agent_toolrnrn# Convert specialized agents into AgentToolsrnflight_tool = agent_tool.AgentTool(agent=flight_agent)rnhotel_tool = agent_tool.AgentTool(agent=hotel_agent)rnsightseeing_tool = agent_tool.AgentTool(agent=sightseeing_agent)rnrn# Root agent now uses these agents as toolsrnroot_agent = LlmAgent(rn model=’gemini-2.0-flash’,rn name=”TripPlanner”,rn instruction=f”””Acts as a comprehensive trip planner…rn Based on the user request, sequentially invoke the tools to gather all necessary trip details…”””,rn tools=[flight_tool, hotel_tool, sightseeing_tool] # The root agent can use these toolsrn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263d5460>)])]>
This is a game-changer. When the complex query “Book a flight to Paris and then find a hotel” is run, the root agent understands and it intelligently calls the flight_tool, gets the result, and then calls the hotel_tool. It can also suggest two top places to visit using Sightseeing_tool. The to-and-fro communication between the root agent and its specialist tools enabled a true multi-step workflow.
However, as the system worked, an inefficiency became noticeable. It found the flight, then it found the hotel. These two tasks are independent. Why couldn’t they be done at the same time?
Step 3: Implement parallel execution
The system is smart, but it’s not as fast as it could be. For tasks that don’t depend on each other, they can be run concurrently to save time.
The ADK provides a ParallelAgent for this. We use this to fetch flight and hotel details simultaneously. Then, a SequentialAgent is used to orchestrate the entire workflow. It first gets the sightseeing info , then “fan-out” to the parallel agent for flights and hotels, and finally, “gather” all the results with a TripSummaryAgent.
code_block
<ListValue: [StructValue([(‘code’, ‘from google.adk.agents import SequentialAgent, ParallelAgentrnrn# 1. Create a parallel agent for concurrent tasksrnplan_parallel = ParallelAgent(rn name=”ParallelTripPlanner”,rn sub_agents=[flight_agent, hotel_agent], # These run in parallelrn)rnrn# 2. Create a summary agent to gather resultsrntrip_summary = LlmAgent(rn name=”TripSummaryAgent”,rn instruction=”Summarize the trip details from the flight, hotel, and sightseeing agents…”,rn output_key=”trip_summary”)rnrn# 3. Create a sequential agent to orchestrate the full workflowrnroot_agent = SequentialAgent(rn name=”PlanTripWorkflow”,rn # Run tasks in a specific order, including the parallel steprn sub_agents=[sightseeing_agent, plan_parallel, trip_summary])’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263d5430>)])]>
We now have an optimized workflow. The system is now not only handling complex queries, but it is doing so efficiently. It is close to the finish line, but one final doubt remains. Is the final summary good? Does it always meet the strict quality guidelines?
Step 4: Create feedback loops
A feedback loop is needed for the system to review its own work.
The idea is to add two more agents to the sequence:
TripSummaryReviewer: An agent whose only job is to evaluate the summary generated by the TripSummaryAgent. It checks for completeness and structure, outputting a simple “pass” or “fail.”
ValidateTripSummaryAgent: A custom agent that checks the reviewer’s status and provides the final, validated output or an error message.
This pattern works by having agents communicate through a shared state. The TripSummaryAgent writes its output to the trip_summary key, and the TripSummaryReviewer reads from that same key to perform its critique.
code_block
<ListValue: [StructValue([(‘code’, ‘# Agent to check if the trip summary meets quality standardsrntrip_summary_reviewer = LlmAgent(rn name=”TripSummaryReviewer”,rn instruction=f”””Review the trip summary in {{trip_summary}}.rn If the summary meets quality standards, output ‘pass’. If not, output ‘fail'”””,rn output_key=”review_status”, # Writes its verdict to a new keyrn)rnrn# Custom agent to check the status and provide feedbackrnrnclass ValidateTripSummary(BaseAgent):rn async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event, None]:rn status = ctx.session.state.get(“review_status”, “fail”)rn review = ctx.session.state.get(“trip_summary”, None)rn if status == “pass”:rn yield Event(author=self.name, content=Content(parts=[Part(text=f”Trip summary review passed: {review}”)]))rn else:rn yield Event(rn content=Content(parts=[Part(author=self.name,rn text=”Trip summary review failed. Please provide a valid requirements”)]))rnValidateTripSummaryAgent = ValidateTripSummary(rn name=”ValidateTripSummary”,rn description=”Validates the trip summary review status and provides feedback based on the review outcome.”,)rnrn# The final, self-regulating workflowrnroot_agent = SequentialAgent(rn name=”PlanTripWorkflow”,rn sub_agents=[rn sightseeing_agent,rn plan_parallel,rn trip_summary,rn trip_summary_reviewer,rn ValidateTripSummaryAgent() # The final validation step rn])’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263d55b0>)])]>
With this final piece in place,, our AI system is no longer a single, confused genius but a highly efficient, self-regulating team of specialists. It can handle complex, multi-step queries with parallel execution for speed and a final review process for quality assurance.
Get started
Ready to build your own multi-agent workflows? Here’s how to get started:
The evolution of AI agents has led to powerful, specialized models capable of complex tasks. The Google Agent Development Kit (ADK) – a toolkit designed to simplify the construction and management of language model-based applications – makes it easy for developers to build agents, usually equipped with tools via the Model Context Protocol (MCP) for tasks like web scraping. However, to unlock their full potential, these agents must be able to collaborate. The Agent-to-Agent (A2A) framework – a standardized communication protocol that allows disparate agents to discover each other, understand their capabilities, and interact securely – provides the standard for this interoperability.
This guide provides a step-by-step process for converting a standalone ADK agent that uses an MCP tool into a fully A2A-compatible component, ready to participate in a larger, multi-agent ecosystem. We will use a MultiURLBrowser agent, designed to scrape web content, as a practical example
Step 1: Define the core agent and its MCP tool (agent.py)
The foundation of your agent remains its core logic. The key is to properly initialize the ADK LlmAgent and configure its MCPToolset to connect with its external tool.
In agent.py, the _build_agent method is where you specify the LLM and its tools. The MCPToolset is configured to launch the firecrawl-mcp tool, passing the required API key through its environment variables
code_block
<ListValue: [StructValue([(‘code’, ‘# agents/search_agent/agent.pyrnimport osrnfrom adk.agent import LlmAgentrnfrom adk.mcp import MCPToolsetrnfrom adk.mcp.servers import StdioServerParametersrn# … other importsrnrnclass MultiURLBrowser:rn def _build_agent(self) -> LlmAgent:rn firecrawl_api_key = os.getenv(“FIRECRAWL_API_KEY”)rn if not firecrawl_api_key:rn raise ValueError(“FIRECRAWL_API_KEY environment variable not set.”)rnrn return LlmAgent(rn model=”gemini-1.5-pro-preview-0514″,rn name=”MultiURLBrowserAgent”,rn description=”Assists users by intelligently crawling and extracting information from multiple specified URLs.”,rn instruction=”You are an expert web crawler…”,rn tools=[rn MCPToolset(rn connection_params=StdioServerParameters(rn command=’npx’,rn args=[“-y”, “firecrawl-mcp”],rn env={“FIRECRAWL_API_KEY”: firecrawl_api_key}rn )rn )rn ]rn )rn # …’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb626384400>)])]>
Step 2: Establish a public identity (__main__.py)
For other agents to discover and understand your agent, it needs a public identity. This is achieved through the AgentSkill and AgentCard in the __main__.py file, which also serves as the entry point for the A2A server.
1. Define AgentSkill: This object acts as a declaration of the agent’s capabilities. It includes a unique ID, a human-readable name, a description, and examples
code_block
<ListValue: [StructValue([(‘code’, ‘# agents/search_agent/__main__.pyrnfrom a2a.skills.skill_declarations import AgentSkillrnrnskill = AgentSkill(rn id=”MultiURLBrowser”,rn name=”MultiURLBrowser_Agent”,rn description=”Agent to scrape content from the URLs specified by the user.”,rn tags=[“multi-url”, “browser”, “scraper”, “web”],rn examples=[rn “Scrape the URL: https://example.com/page1”,rn “Extract data from: https://example.com/page1 and https://example.com/page2″rn ]rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb626384460>)])]>
2. Define AgentCard: This is the agent’s primary metadata for discovery. It includes the agent’s name, URL, version, and, crucially, the list of skills it possesses.
code_block
<ListValue: [StructValue([(‘code’, ‘# agents/search_agent/__main__.pyrnfrom a2a.cards.agent_card import AgentCard, AgentCapabilitiesrnrnagent_card = AgentCard(rn name=”MultiURLBrowser”,rn description=”Agent designed to efficiently scrape content from URLs.”,rn url=f”http://{host}:{port}/”,rn version=”1.0.0″,rn defaultInputModes=[‘text’],rn defaultOutputModes=[‘text’],rn capabilities=AgentCapabilities(streaming=True),rn skills=[skill],rn supportsAuthenticatedExtendedCard=True,rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263844c0>)])]>
Step 3: Implement the A2A task manager (task_manager.py)
The AgentTaskManager is the bridge between the A2A framework and your agent’s logic. It implements the AgentExecutor interface, which requires execute and cancel methods.
The execute method is triggered by the A2A server upon receiving a request. It manages the task’s lifecycle, invokes the agent, and streams status updates and results back to the server via an EventQueue and TaskUpdater.
Step 4: Create the agent’s invoke method (agent.py)
The invoke method is the entry point into the agent’s core ADK logic. It is called by the AgentTaskManager and is responsible for running the ADK Runner. As the runner processes the query, this asynchronous generator yields events, allowing for streaming of progress updates and the final response.
With all components correctly configured, the MultiURLBrowser agent is now a fully operational A2A agent. When a client sends it a request to scrape content, it processes the task and returns the final result. The terminal output below shows a successful interaction, where the agent has received a mission and provided the extracted information as its final response.
Once you have A2A-compatible agents, you can create an “Orchestrator Agent” that delegates sub-tasks to them. This allows for the completion of complex, multi-step workflows.
Step 1: Discover available agents
An orchestrator must first know what other agents are available. This can be achieved by querying a known registry endpoint that lists the AgentCard for all registered agents.
code_block
<ListValue: [StructValue([(‘code’, ‘# Scrap_Translate/agent.pyrnimport httpxrnrnAGENT_REGISTRY_BASE_URL = “http://localhost:10000″rnrnasync with httpx.AsyncClient() as httpx_client:rn base_url = AGENT_REGISTRY_BASE_URL.rstrip(“/”)rn resolver = A2ACardResolver(rn httpx_client=httpx_client,rn base_url=base_url,rn # agent_card_path and extended_agent_card_path use defaults if not specifiedrn )rn final_agent_card_to_use: AgentCard | None = Nonernrn try:rn # Fetches the AgentCard from the standard public path.rn public_card = await resolver.get_agent_card()rn final_agent_card_to_use = public_cardrn except Exception as e:rn # Handle exceptions as needed for your specific use case.rn # For a blog post, you might simplify or omit detailed error handlingrn # if the focus is purely on the successful path.rn print(f”An error occurred: {e}”)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263845e0>)])]>
Step 2: Call other agents as tools
The orchestrator interacts with other agents using the a2a.client. The call_agent function demonstrates how to construct a SendMessageRequest and dispatch it to a target agent.
code_block
<ListValue: [StructValue([(‘code’, “# Scrap_Translate/agent.pyrnfrom a2a.client import A2AClientrnfrom a2a.client.protocols import SendMessageRequest, MessageSendParamsrnfrom uuid import uuid4rnrnasync def call_agent(agent_name: str, message: str) -> str:rn # In a real implementation, you would resolve the agent’s URL firstrn # using its card from list_agents().rn client = A2AClient(httpx_client=httpx.AsyncClient(timeout=300), agent_card=cards)rnrn payload = {rn ‘message’: {rn ‘role’: ‘user’,rn ‘parts’: [{‘kind’: ‘text’, ‘text’: message}],rn ‘messageId’: uuid4().hex,rn },rn }rn request = SendMessageRequest(id=str(uuid4()), params=MessageSendParams(**payload))rnrn response_record = await client.send_message(request)rn # Extract the text content from the response recordrn response_model = response_record.model_dump(mode=’json’, exclude_none=True)rn return response_model[‘result’][‘status’][‘message’][‘parts’][0][‘text’]”), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb626384640>)])]>
Step 3: Configure the orchestrator’s LLM
Finally, configure the orchestrator’s LlmAgent to use the discovery and delegation functions as tools. Provide a system instruction that guides the LLM on how to use these tools to break down user requests and coordinate with other agents
code_block
<ListValue: [StructValue([(‘code’, ‘# Scrap_Translate/agent.pyrnfrom adk.agent import LlmAgentrnfrom adk.tools import FunctionToolrnrnsystem_instr = (rn “You are a root orchestrator agent. You have two tools:\n”rn “1) list_agents() → Use this tool to see available agents.\n”rn “2) call_agent(agent_name: str, message: str) → Use this tool to send a task to another agent.\n”rn “Fulfill user requests by discovering and interacting with other agents.”rn)rnrnroot_agent = LlmAgent(rn model=”gemini-1.5-pro-preview-0514″,rn name=”root_orchestrator”,rn instruction=system_instr,rn tools=[rn FunctionTool(list_agents),rn FunctionTool(call_agent),rn ],rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb6263846a0>)])]>
By following these steps, you can create both specialized, A2A-compatible agents and powerful orchestrators that leverage them, forming a robust and collaborative multi-agent system.
The true power of this architecture becomes visible when the orchestrator agent is run. Guided by its instructions, the LLM correctly interprets a user’s complex request and uses its specialized tools to coordinate with other agents. The screenshot below from a debugging UI shows the orchestrator in action: it first calls list_agents to discover available capabilities and then proceeds to call_agent to delegate the web-scraping task, perfectly illustrating the multi-agent workflow we set out to build.
Get started
This guide details the conversion of a standalone ADK/MCP agent into an A2A-compatible component and demonstrates how to build an orchestrator to manage such agents. The complete source code for all examples, along with official documentation, is available at the links below.
For over two decades, Google has been a pioneer in AI, conducting groundwork that has shaped the industry. Concurrently, in the Web3 space, Google focuses on empowering the developer community by providing public goods resources like BigQuery blockchain datasets and testnet faucets, as well as the cloud infrastructure builders will need to bring their decentralized applications to life.
AI x Web3 Landscape
AI for Web3 compasses the practical ways AI can be applied as a tool to improve efficiency and effectiveness of Web3 companies and projects – from analytics to market research to chatbots. But one of the most powerful synergies is Web3 AI agents. These autonomous agents leverage AI’s intelligence to operate within the Web3 ecosystem, and they rely on Web3’s principles of decentralization and provenance to operate in a trustworthy manner, for use cases ranging from cross-border payments to trust and provenance.
AI agents – autonomous software systems, often powered by Large Language Models (LLMs) – are set to revolutionize Web3 interactions. They can execute complex tasks, manage DeFi portfolios, enhance gaming, analyze data, and interact with blockchains or even other agents without direct human intervention. Imagine agents, equipped with crypto wallets, engage in transactions between each other using the A2A protocol and facilitate economic activities using stablecoins, simplifying complex transactions.
Key applications of AI for Web3
Some sophisticated libraries now equip developers with the tools to build and deploy them. These libraries often come with ready-to-use “skills” or “tools” that grant agents immediate capabilities, such as executing swaps on a DEX, posting to decentralized social media, or fetching and interpreting on-chain data. A key innovation is the ability to understand natural language instructions and take action on them. For example, an agent can “swap 1 ETH for USDC on the most liquid exchange” without manual intervention. To function, these agents must be provisioned with access to essential Web3 components: RPC nodes to read and write to the blockchain, indexed datasets for efficient querying, and dedicated crypto wallets to hold and transact with digital assets.
How to build Web3 AI Agents with Google Cloud
Google Cloud provides a flexible, end-to-end suite of tools for building Web3 AI Agents, allowing you to start simple and scale to highly complex, customized solutions:
1. For rapid prototyping and no-code development: Vertex AI Agent Builder Conversational Agents allows for rapid prototyping and deployment of agents through a user-friendly interface, making it accessible even for non-technical users (refer to the Agent Builder codelab for a quick start). To facilitate this simplicity and speed, the platform provides a focused set of foundational tools. Agents can be easily augmented with standard capabilities like leveraging datastores, performing Google searches, or accessing websites and files. However, for more advanced functionalities—such as integrating crypto wallets, ensuring MCP compatibility, or implementing custom models and orchestration—custom development is the recommended path.
2. For full control and custom agent architecture: Open-source frameworks on Vertex AI For highly customized needs, developers can build their own agent architecture using open-source frameworks (Agent Development Kit, LangGraph, CrewAI) powered by state-of-the-art LLMs like Gemini (including Gemini 2.5 Pro which leads the Chatbot Arena at the time of publication)and Claude which are available through Vertex AI. A typical Web3 Agent architecture (shown below) involves a user interface, an agent runtime orchestrating tasks, an LLM for reasoning, memory for state management, and various tools/plugins (blockchain connectors, wallet managers, search, etc.) connected via adapters.
Example of a Web3 agent architecture
Some of the key features when using Agent Development Kit are as follows:
Easily define and orchestrate multiple agents across many agents and tools – For example you can use sub agents each handling part of the logic. In the crypto agent example above, one agent can find trending projects or tokens on Twitter/X, while another agent will do some research about those projects via Google Search and another agent can take actions on the user’s behalf using the crypto wallet.
Model agnostic – you can use any model from Google or other providers and change very easily
Intuitive local development for fast iteration – One can visualize agent topology and trace agent’s actions very easily. Just run the ADK agent locally and start testing by chatting with the agent.
Screenshot of ADK Dev UI used for testing and developing agents
Supports MCP and A2A (agent to agent standard) out-of-the-box: Allow your agents to communicate with other services and other agents seamlessly using standardised protocols
Deployment agnostic: Agents can be containerized and deployed on Agent Engine, Cloud Run or GKE easily. Vertex AI Agent Engine offers a managed runtime environment, where Google Cloud handles scaling, security, infrastructure management, as well as providing easy tools for evaluating and testing the agents. This abstracts away deployment and scaling complexities, letting developers focus on agent functionality.
Get started
We are always looking for Web3 companies to build with us. If this is an area you want to explore, please express your interest here.
For more details on how Web3 customers are leveraging Google Cloud, refer to this webinar on the Intersection of AI and Web3.
Thank you to Pranav Mehrotra, Web3 Strategic Pursuit Lead, for his help writing and reviewing this article.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e173bf8a370>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Welcome to the second Cloud CISO Perspectives for June 2025. Today, Thiébaut Meyer and Bhavana Bhinder from Google Cloud’s Office of the CISO discuss our work to help defend European healthcare against cyberattacks.
As with all Cloud CISO Perspectives, the contents of this newsletter are posted to the Google Cloud blog. If you’re reading this on the website and you’d like to receive the email version, you can subscribe here.
aside_block
<ListValue: [StructValue([(‘title’, ‘Get vital board insights with Google Cloud’), (‘body’, <wagtail.rich_text.RichText object at 0x3e4d7ea57130>), (‘btn_text’, ‘Visit the hub’), (‘href’, ‘https://cloud.google.com/solutions/security/board-of-directors?utm_source=cloud_sfdc&utm_medium=email&utm_campaign=FY24-Q2-global-PROD941-physicalevent-er-CEG_Boardroom_Summit&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>
The global threats facing European hospitals and health organizations
By Thiébaut Meyer, director, Office of the CISO, and Bhavana Bhinder, European healthcare and life sciences lead, Office of the CISO
Thiébaut Meyer, director, Office of the CISO
As the global threat landscape continues to evolve, hospitals and healthcare organizations remain primary targets for cyber threat actors. To help healthcare organizations defend themselves so they can continue to provide critical, life-saving patient care — even while facing cyberattacks — the European Commission has initiated the European Health Security Action Plan to improve the cybersecurity of hospitals and healthcare providers.
There are two imperative steps that would both support Europe’s plan and bolster resilience in our broader societal fabric: Prioritizing healthcare as a critical domain for cybersecurity investment, and emphasizing collaboration with the private sector. This approach, acknowledging the multifaceted nature of cyber threats and the interconnectedness of healthcare systems, is precisely what is required to secure public health in an increasingly digitized world. It’s great to see the European Commission has recently announced funding to improve cybersecurity, including for European healthcare entities.
Bhavana Bhinder, European healthcare and life sciences lead, Office of the CISO
At Google, we have cultivated extensive industry partnerships across the European Union to help healthcare organizations of all levels of digital sophistication and capability be more resilient in the face of cyberattacks.
Collaboration across healthcare organizations, regulators, information sharing bodies and technology providers like Google is essential to get and stay ahead of these attacks.
Cyberattacks targeting the healthcare domain, especially those that leverage ransomware, can take over healthcare systems – completely upending their operations and stopping them from providing life-saving medical procedures, coordinating critical scheduling and payment activities, stopping delivery of critical supplies like blood and tissue donations, and can even render the care facilities physically unsafe. In some cases, these cyberattacks have contributed to patient mortality. The statistics paint a grim picture:
Ransomware attacks accounted for 54% of analyzed cybersecurity incidents in the EU health sector between 2021 and 2023, with 83% financially motivated.
71% of ransomware attacks impacted patient care and were often coupled with patient data breaches, according to a 2024 European Commission report.
Healthcare’s share of posts on data leak sites has doubled over the past three years, even as the number of data leak sites tracked by Google Threat Intelligence Group increased by nearly 50% in 2024. In one example, a malicious actor targeting European organizations said that they were willing to pay 2% to 5% more for hospitals — particularly ones with emergency services.
In-hospital mortality shoots up 35% to 41% among patients already admitted to a hospital when a ransomware attack takes place.
The U.K.’s National Health Service (NHS) has confirmed that a major cyberattack harmed 170 patients in 2024.
“Achieving resilience necessitates a holistic and adaptive approach, encompassing proactive prevention that uses modern, secure-by-design technologies paired with robust detection and incident response, stringent supply chain management, comprehensive human factor mitigation, strategic utilization of artificial intelligence, and targeted investment in securing unique healthcare vulnerabilities,” said Google Cloud’s Taylor Lehmann, director, Healthcare and Life Sciences, Office of the CISO. “Collaboration across healthcare organizations, regulators, information sharing bodies and technology providers like Google is essential to get and stay ahead of these attacks.”
Bold action is needed to combat this scourge, and that action should include helping healthcare providers migrate to modern technology that has been built securely by design and stays secure in use. We believe security must be embedded from the outset — not as an afterthought — and continuously thereafter. Google’s secure-by-design products and services have helped support hospitals and health organizations across Europe in addressing the pervasive risks posed by cyberattacks, including ransomware.
Secure-by-design is a proactive approach that ensures core technologies like Google Cloud, Google Workspace, Chrome, and ChromeOS are built with inherent protections, such as:
Encrypting Google Cloud customer data at rest by default and data in transit across its physical boundaries, offering multiple options for encryption key management and key access justification.
Building security and compliance into ChromeOS, which powers Chromebooks, to help protect against ransomware attacks. ChromeOS boasts a record of no reported ransomware attacks. Its architecture includes capabilities such as Verified Boot, sandboxing, blocked executables, and user space isolation, along with automatic, seamless updates that proactively patch vulnerabilities.
Providing health systems with a secure alternative through Chrome Enterprise Browser and ChromeOS for accessing internet-based and internal IT resources crucial for patient care.
Committing explicitly in our contracts to implementing and maintaining robust technical, organizational, and physical security measures, and supporting NIS2 compliance efforts for Google Cloud and Workspace customers.
Our products and services are already helping modernize and secure European healthcare organizations, including:
In Germany, healthcare startup Hypros has been collaborating with Google Cloud to help hospitals detect health incidents without compromising patient privacy. Hypros’ innovative patient monitoring system uses our AI and cloud computing capabilities to detect and alert staff to in-hospital patient emergencies, such as out-of-bed falls, delirium onset, and pressure ulcers. They’ve tested the technology in real-world trials at leading institutions including the University Hospital Schleswig-Holstein, one of the largest medical care centers in Europe.
With the CUF, Portugal’s largest healthcare provider with 19 hospitals and clinics. CUF has embraced Google Chrome and cloud applications to enhance energy efficiency and streamline IT operations. ChromeOS is noted in the industry for its efficiency, enabling operations on machines that consume less energy and simplifying IT management by reducing the need for on-site hardware maintenance.
For the Canary Islands 112 Emergency and Safety Coordination Center, which is migrating to Google Cloud. Led by the public company Gestión de Servicios para la Salud y Seguridad en Canary Islands (GCS) and developed in conjunction with Google Cloud, this migration is one of the first in which a public emergency services administration has moved to the public cloud. They’re also using Google Cloud’s sovereign cloud solutions to help securely share critical information, such as call recordings and personal data, with law enforcement and judicial bodies.
We believe that information sharing must extend beyond threat intelligence to encompass data-supported conclusions regarding effective practices, counter-measures, and successes. Reducing barriers to sophisticated and rapid intelligence-sharing, coupled with verifiable responses, can be the decisive factor between a successful defense and a vulnerable one.
Our engagement with organizations including the international Health-ISAC and ENISA underscores our commitment to building trust across many communities, a concept highly pertinent to the EU’s objective of supporting the European Health ISAC and the U.S.-based Health-ISAC’s EU operations.
Protecting European health data with Sovereign Cloud and Confidential Computing
We’re committed to digital sovereignty for the EU and to helping healthcare organizations take advantage of the transformative potential of cloud and AI without compromising on security or patient privacy.
We’ve embedded our secure-by-design principles in our approach to our digital sovereignty solutions. By enabling granular control over data location, processing, and access, European healthcare providers can confidently adopt scalable cloud infrastructure and deploy advanced AI solutions, secure in the knowledge that their sensitive patient data remains protected and compliant with European regulations like GDPR, the European Health Data Space (EHDS), and the Network and Information Systems Directive.
Additionally, Confidential Computing, technology that we helped pioneer, has helped narrow that critical security gap by protecting data in use.
Google Cloud customers such as AiGenomix leverage Confidential Computing to deliver infectious disease surveillance and early cancer detection. Confidential Computing helps them ensure privacy and security for genomic and related health data assets, and also align with the EHDS’s vision for data-driven improvements in healthcare delivery and outcomes.
Building trust in global healthcare resilience
We believe that these insights and capabilities offered by Google can significantly contribute to the successful implementation of the European Health Security Action Plan. We are committed to continued collaboration with the European Commission, EU member states, and all stakeholders to build a more secure and resilient digital future for healthcare.
To learn more about Google’s efforts to secure and support healthcare organizations around the world, contact our Office of the CISO.
aside_block
<ListValue: [StructValue([(‘title’, ‘Join the Google Cloud CISO Community’), (‘body’, <wagtail.rich_text.RichText object at 0x3e4d7ea57af0>), (‘btn_text’, ‘Learn more’), (‘href’, ‘https://rsvp.withgoogle.com/events/ciso-community-interest?utm_source=cgc-blog&utm_medium=blog&utm_campaign=2024-cloud-ciso-newsletter-events-ref&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>
In case you missed it
Here are the latest updates, products, services, and resources from our security teams so far this month:
Securing open-source credentials at scale: We’ve developed a powerful tool to scan open-source package and image files by default for leaked Google Cloud credentials. Here’s how to use it. Read more.
Audit smarter: Introducing our Recommended AI Controls framework: How can we make AI audits more effective? We’ve developed an improved approach that’s scalable and evidence-based: the Recommended AI Controls framework. Read more.
Google named a Strong Performer in The Forrester Wave for security analytics platforms: Google has been named a Strong Performer in The Forrester Wave™: Security Analytics Platforms, Q2 2025, in our first year of participation. Read more.
Mitigating prompt injection attacks with a layered defense strategy: Our prompt injection security strategy is comprehensive, and strengthens the overall security framework for Gemini. We found that model training with adversarial data significantly enhanced our defenses against indirect prompt injection attacks in Gemini 2.5 models. Read more.
Just say no: Build defense in depth with IAM Deny and Org Policies: IAM Deny and Org Policies provide a vital, scalable layer of security. Here’s how to use them to boost your IAM security. Read more.
Please visit the Google Cloud blog for more security stories published this month.
What’s in an ASP? Creative phishing attack on prominent academics and critics of Russia: We detail two distinct threat actor campaigns based on research from Google Threat Intelligence Group (GTIG) and external partners, who observed a Russia state-sponsored cyber threat actor targeting prominent academics and critics of Russia and impersonating the U.S. Department of State. The threat actor often used extensive rapport building and tailored lures to convince the target to set up application specific passwords (ASPs). Read more.
Remote Code Execution on Aviatrix Controller: A Mandiant Red Team case study simulated an “Initial Access Brokerage” approach and discovered two vulnerabilities on Aviatrix Controller, a software-defined networking utility that allows for the creation of links between different cloud vendors and regions. Read more.
Please visit the Google Cloud blog for more threat intelligence stories published this month.
Now hear this: Podcasts from Google Cloud
AI red team surprises, strategies, and lessons: Daniel Fabian joins hosts Anton Chuvakin and Tim Peacock to talk about lessons learned from two years of AI red teaming at Google. Listen here.
Practical detection-as-code in the enterprise: Is detection-as-code just another meme phrase? Google Cloud’s David French, staff adoption engineer, talks with Anton and Tim about how detection-as-code can help security teams. Listen here.
Cyber-Savvy Boardroom: What Phil Venables hears on the street: Phil Venables, strategic security adviser for Google Cloud, joins Office of the CISO’s Alicja Cade and David Homovich to discuss what he’s hearing directly from boards and executives about the latest in cybersecurity, digital transformation, and beyond. Listen here.
Beyond the Binary: Attributing North Korean cyber threats: Who names the world’s most notorious APTs? Google reverse engineer Greg Sinclair shares with host Josh Stroschein how he hunts down and names malware and threat actors, including Lazarus Group, the North Korean APT. Listen here.
To have our Cloud CISO Perspectives post delivered twice a month to your inbox, sign up for our newsletter. We’ll be back in a few weeks with more security-related updates from Google Cloud.
Written by: Seemant Bisht, Chris Sistrunk, Shishir Gupta, Anthony Candarini, Glen Chason, Camille Felx Leduc
Introduction — Why Securing Protection Relays Matters More Than Ever
Substations are critical nexus points in the power grid, transforming high-voltage electricity to ensure its safe and efficient delivery from power plants to millions of end-users. At the core of a modern substation lies the protection relay: an intelligent electronic device (IED) that plays a critical role in maintaining the stability of the power grid by continuously monitoring voltage, current, frequency, and phase angle. Upon detecting a fault, it instantly isolates the affected zone by tripping circuit breakers, thus preventing equipment damage, fire hazards, and cascading power outages.
As substations become more digitized, incorporating IEC 61850, Ethernet, USB, and remote interfaces, relays are no longer isolated devices, but networked elements in a broader SCADA network. While this enhances visibility and control, it also exposes relays to digital manipulation and cyber threats. If compromised, a relay can be used to issue false trip commands, alter breaker logic, and disable fault zones. Attackers can stealthily modify vendor-specific logic, embed persistent changes, and even erase logs to avoid detection. A coordinated attack against multiple critical relays can lead to a cascading failure across the grid, potentially causing a large-scale blackout.
This threat is not theoretical. State-sponsored adversaries have repeatedly demonstrated their capability to cause widespread blackouts, as seen in the INDUSTROYER (2016), INDUSTROYER.V2 (2022), and novel living-off-the-land technique (2022) attacks in Ukraine, where they issued unauthorized commands over standard grid protocols. The attack surface extends beyond operational protocols to the very tools engineers rely on; as Claroty’s Team82 revealed a denial-of-service vulnerability in Siemens DIGSI 4 configuration software. Furthermore, the discovery of malware toolkits like INCONTROLLER shows attackers are developing specialized capabilities to map, manipulate, and disable protection schemes across multiple vendors.
Recent events have further underscored the reality of these threats, with heightened risks of Iranian cyberattacks targeting vital networks in the wake of geopolitical tensions. Iran-nexus threat actors such as UNC5691 (aka CyberAv3ngers) have a history of targeting operational technology, in some cases including U.S. water facilities. Similarly, persistent threats from China, such as UNC5135, which at least partially overlaps with publicly reported Volt Typhoon activity, demonstrate a strategic effort to embed within U.S. critical infrastructure for potential future disruptive or destructive cyberattacks. The tactics of these adversaries, which range from exploiting weak credentials to manipulating the very logic of protection devices, make the security of protection relays a paramount concern.
These public incidents mirror the findings from our own Operational Technology (OT) Red Team simulations, which consistently reveal accessible remote pathways into local substation networks and underscore the potential for adversaries to manipulate protection relays within national power grids.
Protection relays are high-value devices, and prime targets for cyber-physical attacks targeting substation automation systems and grid management systems. Securing protection relays is no longer just a best practice; it’s absolutely essential for ensuring the resilience of both transmission and distribution power grids.
Inside a Substation — Components and Connectivity
To fully grasp the role of protection relays within the substation, it’s important to understand the broader ecosystem they operate in. Modern substations are no longer purely electrical domains. They are cyber-physical environments where IEDs, deterministic networking, and real-time data exchange work in concert to deliver grid reliability, protection, and control.
Core Components
Protection & Control Relays (IEDs): Devices such as the SEL-451, ABB REL670, GE D60, and Siemens 7SJ85 serve as the brains of both protection and control. They monitor current, voltage, frequency, and phase angle, and execute protection schemes like:
Overcurrent (ANSI 50/51)
Distance protection (ANSI 21)
Differential protection (ANSI 87)
Under/over-frequency (ANSI 81)
Synch-check (ANSI 25)
Auto-reclose (ANSI 79)
Breaker failure protection (ANSI 50BF)
Logic-based automation and lockout (e.g., ANSI 94)
(Note: These ANSI function numbers follow the IEEE Standard C37.2 and are universally used across vendors to denote protective functions.)
Circuit Breakers & Disconnectors: High-voltage switching devices operated by relays to interrupt fault current or reconfigure line sections. Disconnectors provide mechanical isolation and are often interlocked with breaker status to prevent unsafe operation.
Current Transformers (CTs) & Potential Transformers (PTs): Instrument transformers that step down high voltage and current for safe and precise measurement. These form the primary sensing inputs for protection and metering functions.
Station Human-Machine Interfaces (HMIs): Provide local visualization and control for operators. HMIs typically connect to relay networks via the station bus, offering override, acknowledgment, and command functions without needing SCADA intervention.
Remote Terminal Units (RTUs) or Gateway Devices: In legacy or hybrid substations, RTUs aggregate telemetry from field devices and forward it to control centers. In fully digital substations, this function may be handled by SCADA gateways or station-level IEDs that natively support IEC 61850 or legacy protocol bridging.
Time Synchronization Devices: GPS clocks or PTP servers are deployed to maintain time alignment across relays, sampled value streams, and event logs. This is essential for fault location, waveform analysis, and sequence of events (SoE) correlation.
Network Architecture
Modern digital substations are engineered with highly segmented network architectures to ensure deterministic protection, resilient automation, and secure remote access. These systems rely on fiber-based Ethernet communication and time-synchronized messaging to connect physical devices, intelligent electronics, SCADA systems, and engineering tools across three foundational layers.
Figure 1: Substation Network Architecture
Network Topologies: Substations employ redundant Ethernet designs to achieve high availability and zero-packet-loss communication, especially for protection-critical traffic.
Common topologies include:
RSTP (Rapid Spanning Tree Protocol) – Basic redundancy by blocking loops in switched networks
PRP (Parallel Redundancy Protocol) – Simultaneous frame delivery over two independent paths
HSR (High-availability Seamless Redundancy) – Ring-based protocol that allows seamless failover for protection traffic
Communication Layers: Zones and Roles
Modern substations are structured into distinct functional network layers, each responsible for different operations, timing profiles, and security domains. Understanding this layered architecture is critical to both operational design and cyber risk modeling.
Process Bus / Bay Level Communication
This is the most time-sensitive layer in the substation. It handles deterministic, peer-to-peer communication between IEDs (Intelligent Electronic Devices), Merging Units (MUs), and digital I/O modules that directly interact with primary field equipment.
Includes:
Protection and Control IEDs – Relay logic for fault detection and breaker actuation
MUs – Convert CT/PT analog inputs into digitized Sampled Values (SV)
IED I/O Modules – Digitally interface with trip coils and status contacts on breakers
Circuit Breakers, CTs, and PTs – Primary electrical equipment connected through MUs and I/O
Master clock or time source – Ensures time-aligned SV and event data using PTP (IEEE 1588) or IRIG-B
IEC 61850-9-2 (SV) – Real-time sampled analog measurements
Time Sync (PTP/IRIG-B) – Sub-millisecond alignment across protection systems
Station Bus / Substation Automation LAN (Supervisory and Control Layer)
The Station Bus connects IEDs, local operator systems, SCADA gateways, and the Substation Automation System (SAS). It is responsible for coordination, data aggregation, event recording, and forwarding data to control centers.
Includes:
SAS – Central event and logic manager
HMIs – Local operator access
Engineering Workstation (EWS) – Access point for authorized relay configuration and diagnostics
RTUs / SCADA Gateways – Bridge to EMS/SCADA networks
Managed Ethernet Switches (PRP/HSR) – Provide reliable communication paths
IEC 60870-5-104 / DNP3 – Upstream telemetry to control center
Modbus (legacy) – Field device communication
SNMP (secured) – Network health monitoring
Engineering Access (Role-Based, Cross-Layer): Engineering access is not a stand-alone communication layer but a privileged access path used by protection engineers and field technicians to perform maintenance, configuration, and diagnostics.
Access Components:
EWS – Direct relay interface via MMS or console
Jump Servers / VPNs – Controlled access to remote or critical segments
Terminal/ Serial Consoles – Used for maintenance and troubleshooting purposes
What Protection Relays Really Do
In modern digital substations, protection relays—more accurately referred to as IEDs—have evolved far beyond basic trip-and-alarm functions. These devices now serve as cyber-physical control points, responsible not only for detecting faults in real time but also for executing programmable logic, recording event data, and acting as communication intermediaries between digital and legacy systems.
At their core, IEDs monitor electrical parameters, such as voltage, current, frequency, and phase angle, and respond to conditions like overcurrent, ground faults, and frequency deviations. Upon fault detection, they issue trip commands to circuit breakers—typically within one power cycle (e.g., 4–20 ms)—to safely isolate the affected zone and prevent equipment damage or cascading outages.
Beyond traditional protection: Modern IEDs provide a rich set of capabilities that make them indispensable in fully digitized substations.
Trip Logic Processing: Integrated logic engines (e.g., SELogic, FlexLogic, CFC) evaluate multiple real-time conditions to determine if, when, and how to trip, block, or permit operations.
Event Recording and Fault Forensics: Devices maintain Sequence of Events (SER) logs and capture high-resolution oscillography (waveform data), supporting post-event diagnostics and root-cause analysis.
Local Automation Capabilities: IEDs can autonomously execute transfer schemes, reclose sequences, interlocking, and alarm signaling often without intervention from SCADA or a centralized controller.
Protocol Bridging and Communication Integration: Most modern relays support and translate between multiple protocols, including IEC 61850, DNP3, Modbus, and IEC 60870-5-104, enabling them to function as data gateways or edge translators in hybrid communication environments.
Application across the grid: These devices ensure rapid fault isolation, coordinated protection, and reliable operation across transmission, distribution, and industrial networks.
Transmission and distribution lines (e.g., SIPROTEC)
Power Transformers (e.g., ABB RET615)
Feeders, Motors and industrial loads (e.g., GE D60)
How Attackers Can Recon and Target Protection Relays
As substations evolve into digital control hubs, their critical components, particularly protection relays, are no longer isolated devices. These IEDs are now network-connected through Ethernet, serial-to-IP converters, USB interfaces, and in rare cases, tightly controlled wireless links used for diagnostics or field tools.
While this connectivity improves maintainability, remote engineering access, and real-time visibility, it also expands the cyberattack surface exposing relays to risks of unauthorized logic modification, protocol exploitation, or lateral movement from compromised engineering assets.
Reconnaissance From the Internet
Attackers often begin with open-source intelligence (OSINT), building a map of the organization’s digital and operational footprint. They aren’t initially looking for IEDs or substations; they’re identifying the humans who manage them.
Social Recon: Using LinkedIn, engineering forums, or vendor webinars, attackers look for job titles like “Substation Automation Engineer,” “Relay Protection Specialist,” or “SCADA Administrator.”
OSINT Targeting: Public resumes and RFI documents may reference software like DIGSI, PCM600, or AcSELerator. Even PDF metadata from utility engineering documents can reveal usernames, workstation names, or VPN domains.
Infrastructure Scanning: Tools like Shodan or Censys help identify exposed VPNs, engineering portals, and remote access gateways. If these systems support weak authentication or use outdated firmware, they become initial entry points.
Exploitation of Weak Vendor Access: Many utilities still use stand-alone VPN credentials for contractors and OEM vendors. These accounts often bypass centralized identity systems, lack 2FA, and are reused across projects.
Reconnaissance in IT — Mapping the Path to OT
Once an attacker gains a foothold within the IT network—typically through phishing, credential theft, or exploiting externally exposed services—their next objective shifts toward internal reconnaissance. The target is not just domain dominance, but lateral movement toward OT-connected assets such as substations or Energy Management Systems (EMS).
Domain Enumeration: Using tools like BloodHound, attackers map Active Directory for accounts, shares, and systems tagged with OT context (e.g., usernames like scada_substation_admin, and groups like scada_project and scada_communication).
This phase allows the attacker to pinpoint high-value users and their associated devices, building a shortlist of engineering staff, contractors, or control center personnel who likely interface with OT assets.
Workstation & Server Access: Armed with domain privileges and OT-centric intelligence, the attacker pivots to target the workstations or terminal servers used by the identified engineers. These endpoints are rich in substation-relevant data, such as:
Relay configuration files (.cfg, .prj, .set)
VPN credentials or profiles for IDMZ access
Passwords embedded in automation scripts or connection managers
Access logs or RDP histories indicating commonly used jump hosts
At this stage, the attacker is no longer scanning blindly; they’re executing highly contextual moves to identify paths from IT into OT.
IDMZ Penetration — Crossing the Last Boundary
Using gathered VPN credentials, hard-coded SSH keys, or jump host details, the attacker attempts to cross into the DMZ. This zone typically mediates communication between IT and OT, and may be accessed via:
Engineering jump hosts (dual-homed systems, often less monitored)
Poorly segmented RDP gateways with reused credentials
Exposed management ports on firewalls or remote access servers
Once in the IDMZ, attackers map accessible subnets and identify potential pathways into live substations.
Substation Discovery and Technical Enumeration
Once an attacker successfully pivots into the substation network often via compromised VPN credentials, engineering jump hosts, or dual-homed assets bridging corporate and OT domains—the next step is to quietly enumerate the substation landscape. At this point, they are no longer scanning broadly but conducting targeted reconnaissance to identify and isolate high-value assets, particularly protection relays.
Rather than using noisy tools like nmap with full port sweeps, attackers rely on stealthier techniques tailored for industrial networks. These include passive traffic sniffing and protocol-specific probing to avoid triggering intrusion detection systems or log correlation engines. For example, using custom Python or Scapy scripts, the attacker might issue minimal handshake packets for protocols such as IEC 61850 MMS, DNP3, or Modbus, observing how devices respond to crafted requests. This helps fingerprint device types and capabilities without sending bulk probes.
Simultaneously, MAC address analysis plays a crucial role in identifying vendors. Many industrial devices use identifiable prefixes unique to specific power control system manufacturers. Attackers often leverage this to differentiate protection relays from HMIs, RTUs, or gateways with a high degree of accuracy.
Additionally, by observing mirrored traffic on span ports or through passive sniffing on switch trunks, attackers can detect GOOSE messages, Sampled Values (SV), or heartbeat signals indicative of live relay communication. These traffic patterns confirm the presence of active IEDs, and in some cases, help infer the device’s operational role or logical zone.
Once relays, protocol gateways, and engineering HMIs have been identified, the attacker begins deeper technical enumeration. At this stage, they analyze which services are exposed on each device such as Telnet, HTTP, FTP or MMS, and gather banner information or port responses that reveal firmware versions, relay models or serial numbers. Devices with weak authentication or legacy configurations are prioritized for exploitation.
The attacker may next attempt to log in using factory-set or default credentials, which are often easily obtainable from device manuals. Alarmingly, these credentials are often still active in many substations due to lax commissioning processes. If login is successful, the attacker escalates from passive enumeration to active control—gaining the ability to view or modify protection settings, trip logic, and relay event logs.
If the relays are hardened with proper credentials or access controls, attackers might try other methods, such as accessing rear-panel serial ports via local connections or probing serial-over-IP bridges linked to terminal servers. Some adversaries have even used vendor software (e.g., DIGSI, AcSELerator, PCM600) found on compromised engineering workstations to open relay configuration projects, review programmable logic (e.g., SELogic or FlexLogic), and make changes through trusted interfaces.
Another critical risk in substation environments is the presence of undocumented or hidden device functionality. As highlighted in CISA advisory ICSA-24-095-02, SEL 700-series protection relays were found to contain undocumented capabilities accessible to privileged users.
Separately, some relays may expose backdoor Telnet access through hard-coded or vendor diagnostic accounts. These interfaces are often enabled by default and left undocumented, giving attackers an opportunity to inject firmware, wipe configurations, or issue commands that can directly trip or disable breakers.
By the end of this quiet but highly effective reconnaissance phase, the attacker has mapped out the protection relay landscape, assessed device exposure, and identified access paths. They now shift from understanding the network to understanding what each relay actually controls, entering the next phase: process-aware enumeration.
Process-Aware Enumeration
Once attackers have quietly mapped out the substation network (identifying protection relays, protocol gateways, engineering HMIs, and confirming which devices expose insecure services) their focus shifts from surface-level reconnaissance to gaining operational context. Discovery alone isn’t enough. For any compromise to deliver strategic impact, adversaries must understand how these devices interact with the physical power system.
This is where process-aware enumeration begins. The attacker is no longer interested in just controlling any relay they want to control the right relay. That means understanding what each device protects, how it’s wired into the breaker scheme, and what its role is within the substation topology.
Armed with access to engineering workstations or backup file shares, the attacker reviews substation single-line diagrams (SLDs), often from SCADA HMI screens or documentation from project folders. These diagrams reveal the electrical architecture—transformers, feeders, busbars—and show exactly where each relay fits. Identifiers like “BUS-TIE PROT” or “LINE A1 RELAY” are matched against configuration files to determine their protection zone.
By correlating relay names with breaker control logic and protection settings, the attacker maps out zone hierarchies: primary and backup relays, redundancy groups, and dependencies between devices. They identify which relays are linked to auto-reclose logic, which ones have synch-check interlocks, and which outputs are shared across multiple feeders.
This insight enables precise targeting. For example, instead of blindly disabling protection across the board, which would raise immediate alarms, the attacker may suppress tripping on a backup relay while leaving the primary untouched. Or, they might modify logic in such a way that a fault won’t be cleared until the disturbance propagates, creating the conditions for a wider outage.
At this stage, the attacker is not just exploiting the relay as a networked device. They’re treating it as a control surface for the substation itself. With deep process context in hand, they move from reconnaissance to exploitation: manipulating logic, altering protection thresholds, injecting malicious firmware, or spoofing breaker commands and because their changes are aligned with system topology, they maximize impact while minimizing detection.
Practical Examples of Exploiting Protection Relays
The fusion of network awareness and electrical process understanding makes modern substation attacks particularly dangerous—and why protection relays, when compromised, represent one of the highest-value cyber-physical targets in the grid.
To illustrate how such knowledge is operationalized by attackers, let’s examine a practical example involving the SEL-311C relay, a device widely deployed across substations. Note: While this example focuses on SEL, the tactics described here apply broadly to comparable relays from other major OEM vendors such as ABB, GE, Siemens, and Schneider Electric. In addition, the information presented in this section does not constitute any unknown vulnerabilities or proprietary information, but instead demonstrates the potential for an attacker to use built-in device features to achieve adversarial objectives.
Figure 2: Attack Vectors for a SEL-311C Protection Relay
Physical Access
If an attacker gains physical access to a protection relay, either through the front panel or by opening the enclosure they can trigger a hardware override by toggling the internal access jumper, typically located on the relay’s main board. This bypasses all software-based authentication, granting unrestricted command-level access without requiring a login. Once inside, the attacker can modify protection settings, reset passwords, disable alarms, or issue direct breaker commands effectively assuming full control of the relay.
However, such intrusions can be detected, if the right safeguards are in place. Most modern substations incorporate electronic access control systems (EACS) and SCADA-integrated door alarms. If a cabinet door is opened without an authorized user logged as onsite (via badge entry or operator check-in), alerts can be escalated to dispatch field response teams or security personnel.
Relays themselves provide telemetry for physical access events. For instance, SEL relays pulse the ALARM contact output upon use of the 2ACCESS command, even when the correct password is entered. Failed authentication attempts assert the BADPASS logic bit, while SETCHG flags unauthorized setting modifications. These SEL WORDs can be continuously monitored through SCADA or security detection systems for evidence of tampering.
Toggling the jumper to bypass relay authentication typically requires power-cycling the device, a disruptive action that can itself trigger alarms or be flagged during operational review.
To further harden the environment, utilities increasingly deploy centralized relay management suites (e.g., SEL Grid Configurator, GE Cyber Asset Protection, or vendor-neutral tools like Subnet PowerSystem Center) that track firmware integrity, control logic uploads, and enforce version control tied to access control mechanisms.
In high-assurance deployments, relay configuration files are often encrypted, access-restricted, and protected by multi-factor authentication, reducing the risk of rollback attacks or lateral movement even if the device is physically compromised.
Command Interfaces and Targets
With access established whether through credential abuse, exposed network services, or direct hardware bypass the attacker is now in a position to issue live commands to the relay. At this stage, the focus shifts from reconnaissance to manipulation, leveraging built-in interfaces to override protection logic and directly influence power system behavior.
Here’s how these attacks unfold in a real-world scenario:
Manual Breaker Operation: An attacker can directly issue control commands to the relay to simulate faults or disrupt operations.
Example commands include:
==>PUL OUT101 5; Pulse output for 5 seconds to trip breaker
=>CLO; Force close breaker
=>OPE; Force open breaker
These commands bypass traditional protection logic, allowing relays to open or close breakers on demand. This can isolate critical feeders, create artificial faults, or induce overload conditions—all without triggering standard fault detection sequences.
Programmable Trip Logic Manipulation
Modern protection relays such as those from SEL (SELogic), GE (FlexLogic), ABB (CAP tools), and Siemens (CFC), support customizable trip logic through embedded control languages. These programmable logic engines enable utilities to tailor protection schemes to site-specific requirements. However, this powerful feature also introduces a critical attack surface. If an adversary gains privileged access, they can manipulate core logic equations to suppress legitimate trips, trigger false operations, or embed stealthy backdoors that evade normal protection behavior.
One of the most critical targets in this logic chain is the Trip Request (TR) output, the internal control signal that determines whether the relay sends a trip command to the circuit breaker.
This equation specifies the fault conditions under which the relay should initiate a trip. Each element represents a protection function or status input, such as zone distance, overcurrent detection, or breaker position and collectively they form the basis of coordinated relay response.
In the relay operation chain, the TR equation is at the core of the protection logic.
Figure 3: TR Logic Evaluation within the Protection Relay Operation Chain
In SEL devices, for example, this TR logic is typically defined using a SELogic control equation. A representative version might look like this:
Zone 1 Ground distance element, trips on ground faults within Zone 1
M2PT
Phase distance element from Channel M2, Phase Trip (could be Zone 2)
Z2GT
Zone 2 Ground distance Trip element, for ground faults in Zone 2
51GT
Time-overcurrent element for ground faults (ANSI 51G)
51QT
Time-overcurrent element for negative-sequence current (unbalanced faults)
50P1
Instantaneous phase overcurrent element (ANSI 50P) for Zone 1
SH0
Breaker status input, logic 1 when breaker is closed
Table 1: Elements of TR
In the control equation, the + operator means logical OR, and * means logical AND. Therefore, the logic asserts TR if:
Any of the listed fault elements (distance, overcurrent) are active, or
An instantaneous overcurrent occurs while the breaker is closed.
In effect, the breaker is tripped:
If a phase or ground fault is detected in Zone 1 or Zone 2
If a time-overcurrent condition develops
Or if there’s an instantaneous spike while the breaker is in service
How Attackers Can Abuse the TR Logic
With editing access, attackers can rewrite this logic to suppress protection, force false trips, or inject stealthy backdoors.
Table 2 shows common logic manipulation variants.
Attack Type
Modified Logic
Effect
Disable All Trips
TR = 0
Relay never trips, even during major faults. Allows sustained short circuits, potentially leading to fires or equipment failure.
Force Constant Tripping
TR = 1, TRQUAL = 0
Relay constantly asserts trip, disrupting power regardless of fault status.
Impossible Condition
TR = 50P1 * !SH0
Breaker only trips when already open, a condition that never occurs.
Remove Ground Fault Detection
TR = M1P + M2PT + 50P1 * SH0
Relay ignores ground faults entirely, a dangerous and hard-to-detect attack.
Hidden Logic Backdoor
TR = original + RB15
Attacker can trigger trip remotely via RB15 (a Remote Bit), even without a real fault.
Table 2: TR logic bombs
Disable Trip Unlatching (ULTR)
ULTR = 0
Impact: Prevents the relay from resetting after a trip. The breaker stays open until manually reset, which delays recovery and increases outage durations.
Reclose Logic Abuse
79RI = 1 ; Reclose immediately
79STL = 0 ; Skip supervision logic
Impact: Forces breaker to reclose repeatedly, even into sustained faults. Can damage transformer windings, burn breaker contacts, or create oscillatory failures.
LED Spoofing
LED12 = !TRIP
Impact: Relay front panel shows a “healthy” status even while tripped. Misleads field technicians during visual inspections.
Event Report Tampering
=>EVE; View latest event
=>TRI; Manually trigger report
=>SER C; Clear Sequential Event Recorder
Impact: Covers attacker footprints by erasing evidence. Removes Sequential Event Recorder (SER) logs and trip history. Obstructs post-event forensics.
Change Distance Protection Settings
In the relay protection sequence, distance protection operates earlier in the decision chain, evaluating fault conditions based on impedance before the trip logic is executed to issue breaker commands.
Figure 4: Distance protection settings in a Relay Operation Chain
Impact: Distance protection relies on accurately configured impedance reach (Z1MAG) and impedance angle (Z1ANG) to detect faults within a predefined section of a transmission line (typically 80–100% of line length for Zone 1). Manipulating these values can have the following consequences:
Under-Reaching: Reducing Z1MAG to 0.3 causes the relay to detect faults only within 30% of the line length, making it blind to faults in the remaining 70% of the protected zone. This can result in missed trips, delayed fault clearance, and cascading failures if the backup protection does not act in time.
Impedance Angle Misalignment: Changing Z1ANG affects the directional sensitivity and fault classification. If the angle deviates from system characteristics, the relay may misclassify faults or fail to identify high-resistance faults, particularly on complex line configurations like underground cables or series-compensated lines.
False Trips: In certain conditions, especially with heavy load or load encroachment, a misconfigured distance zone may interpret normal load flow as a fault, resulting in nuisance tripping and unnecessary outages.
Compromised Selectivity & Coordination: The distance element’s coordination with other relays (e.g., Zone 2 or remote end Zone 1) becomes unreliable, leading to overlapping zones or gaps in coverage, defeating the core principle of selective protection.
Restore Factory Defaults
=>>R_S
Impact: Wipes all hardened settings, password protections, and customized logic. Resets the relay to an insecure factory state.
Password Modification for Persistence
=>>PAS 1 <newpass>
Impact: Locks out legitimate users. Maintains long-term attacker access. Prevents operators from reversing changes quickly during incident response.
What Most Environments Still Get Wrong
Despite increasing awareness, training, and incident response playbooks, many substations and critical infrastructure sites continue to exhibit foundational security weaknesses. These are not simply oversights—they’re systemic, shaped by the realities of substation lifecycle management, legacy system inertia, and the operational constraints of critical grid infrastructure.
Modernizing substation cybersecurity is not as simple as issuing new policies or buying next-generation tools. Substations typically undergo major upgrades on decade-long cycles, often limited to component replacement rather than full network redesigns. Integrating modern security features like encrypted protocols, central access control, or firmware validation frequently requires adding computers, increasing bandwidth, and introducing centralized key management systems. These changes are non-trivial in bandwidth-constrained environments built for deterministic, low-latency communication—not IT-grade flexibility.
Further complicating matters, vendor product cycles move faster than infrastructure refresh cycles. It’s not uncommon for new protection relays or firmware platforms to be deprecated or reworked before they’re fully deployed across even one utility’s fleet, let alone hundreds of substations.
The result? A patchwork of legacy protocols, brittle configurations, and incomplete upgrades that adversaries continue to exploit. In the following section, we examine some of the most critical and persistent gaps, why they still exist, and what can realistically be done to address them.
This section highlights the most common and dangerous security gaps observed in real-world environments.
Legacy Protocols Left Enabled
Relays often come with older communication protocols such as:
Telnet (unencrypted remote access)
FTP (insecure file transfer)
Modbus RTU/TCP (lacks authentication or encryption)
These are frequently left enabled by default, exposing relays to:
Credential sniffing
Packet manipulation
Unauthorized control commands
Recommendation: Where possible, disable legacy services and transition to secure alternatives (e.g., SSH, SFTP, or IEC 62351 for secured GOOSE/MMS). If older services must be retained, tightly restrict access via VLANs, firewalls, and role-based control.
IT/OT Network Convergence Without Isolation
Modern substations may share network infrastructure with enterprise IT environments:
VPN access to Substation networks
Shared switches or VLANs between SCADA systems and relay networks
Lack of firewalls or access control lists (ACLs)
This exposes protection relays to malware propagation, ransomware, or lateral movement from compromised IT assets.
Recommendation: Establish strict network segmentation using firewalls, ACLs, and dedicated protection zones. All remote access should be routed through Privileged Access Management (PAM) platforms with MFA, session recording, and Just-In-Time access control.
Default or Weak Relay Passwords
In red team and audit exercises, default credentials are still found in the field sometimes printed on the relay chassis itself.
Factory-level passwords like LEVEL2, ADMIN, or OPERATOR remain unchanged.
Passwords physically labeled on devices
Password sharing among field teams compromises accountability.
These practices persist due to operational convenience, lack of centralized credential management, and difficulty updating devices in the field.
Recommendation: Mandate site-specific, role-based credentials with regular rotation and enforced via centralized relay management tools. Ensure audit logging of all access attempts and password changes.
Built-in Security Features Left Unused
OEM vendors already provide a suite of built-in security features, yet these are rarely configured in production environments. Security features such as role-based access control (RBAC), secure protocol enforcement (e.g., HTTPS, SSH), user-level audit trails, password retry lockouts, and alert triggers (e.g., BADPASS or SETCHG bits) are typically disabled or ignored during commissioning. In many cases, these features are not even evaluated due to time constraints, lack of policy enforcement, or insufficient familiarity among field engineers.
These oversight patterns are particularly common in environments that inherit legacy commissioning templates, where security features are left in their default or least-restrictive state for the sake of expediency or compatibility.
Recommendation: Security configurations must be explicitly reviewed during commissioning and validated periodically. At a minimum:
Enable RBAC and enforce user-level permission tiers.
Configure BADPASS, ALARM, SETCHG, and similar relay logic bits to generate real-time telemetry.
Use secure protocols (HTTPS, SSH, IEC 62351) where supported.
Integrate security bit changes and access logs into central SIEM or NMS platforms for correlation and alerting.
Engineering Laptops with Stale Firmware Tools
OEM vendors also release firmware updates to fix any known security vulnerabilities and bugs. However:
Engineering laptops often use outdated configuration software
Old firmware loaders may upload legacy or vulnerable versions
Security patches are missed entirely
Recommendation: Maintain hardened engineering baselines with validated firmware signing, trusted toolchains, and controlled USB/media usage. Track firmware versions across the fleet for vulnerability exposure.
No Alerting on Configuration or Logic Changes
Protection relays support advanced logic and automation features like SELogic and FlexLogic but in many environments, no alerting is configured for changes. This makes it easy for attackers (or even insider threats) to silently:
Modify protection logic
Switch setting groups
Suppress alarms or trips
Recommendation: Enable relay-side event-based alerting for changes to settings, logic, or outputs. Forward logs to a central SIEM or security operations platform capable of detecting unauthorized logic uploads or suspicious relay behavior.
Relays Not Included in Security Audits or Patch Cycles
Relays are often excluded from regular security practices:
Not scanned for vulnerabilities
Not included in patch management systems
No configuration integrity monitoring or version tracking
This blind spot leaves highly critical assets unmanaged, and potentially exploitable.
Recommendation: Bring protection relays into the fold of cybersecurity governance, with scheduled audits, patch planning, and configuration monitoring. Use tools that can validate settings integrity and detect tampering, whether via vendor platforms or third-party relay management suites.
Physical Tamper Detection Features Not Monitored
Many modern protection relays include hardware-based tamper detection features designed to alert operators when the device enclosure is opened or physically manipulated. These may include:
Chassis tamper switches that trigger digital inputs or internal flags when the case is opened.
Access jumper position monitoring, which can be read via relay logic or status bits.
Power cycle detection, especially relevant when jumpers are toggled (e.g., SEL relays require a power reset to apply jumper changes).
Relay watchdog or system fault flags, indicating unexpected reboots or logic resets post-manipulation.
Despite being available, these physical integrity indicators are rarely wired into the SCADA system or included in alarm logic. As a result, an attacker could open a relay, trigger the access jumper, or insert a rogue SD card—and leave no real-time trace unless other controls are in place.
Recommendation: Utilities should enable and monitor all available hardware tamper indicators:
Wire tamper switches or digital input changes into RTUs or SCADA for immediate alerts.
Monitor ALARM, TAMPER, SETCHG, or similar logic bits in relays that support them (e.g., SEL WORD bits).
Configure alert logic to correlate with badge access logs or keycard systems—raising a flag if physical access occurs outside scheduled maintenance windows.
Include physical tamper status as a part of substation security monitoring dashboards or intrusion detection platforms.
From Oversights to Action — A New Baseline for Relay Security
The previously outlined vulnerabilities aren’t limited to isolated cases, they reflect systemic patterns across substations, utilities, and industrial sites worldwide. As the attack surface expands with increased connectivity, and as adversaries become more sophisticated in targeting protection logic, these security oversights can no longer be overlooked.
But securing protection relays doesn’t require reinventing the wheel. It begins with the consistent application of fundamental security practices, drawn from real-world incidents, red-team assessments, and decades of power system engineering wisdom.
While these practices can be retrofitted into existing environments, it’s critical to emphasize that security is most effective when it’s built in by design, not bolted on later. Retrofitting controls in fragile operational environments often introduces more complexity, risk, and room for error. For long-term resilience, security considerations must be embedded into system architecture from the initial design and commissioning stages.
To help asset owners, engineers, and cybersecurity teams establish a defensible and vendor-agnostic baseline, Mandiant has compiled the “Top 10 Security Practices for Substation Relays,” a focused and actionable framework applicable across protocols, vendors, and architectures.
In developing this list, Mandiant has drawn inspiration from the broader ICS security community—particularly initiatives like the “Top 20 Secure PLC Coding Practices” developed by experts in the field of industrial automation and safety instrumentation. While protection relays are not the same as PLCs, they share many characteristics: firmware-driven logic, critical process influence, and limited error tolerance.
The Top 20 Secure PLC Coding Practices have shaped secure programming conversations for logic-bearing control systems and Mandiant aims for this “Top 10 Security Practices for Substation Relays” list to serve a similar purpose for the protection engineering domain.
Top 10 Security Practices for Substation Relays
#
Practice
What It Protects
Explanation
1
Authentication & Role Separation
Prevents unauthorized relay access and privilege misuse
Ensure each user has their own account with only the permissions they need (e.g., Operator, Engineer). Remove default or unused credentials.
2
Secure Firmware & Configuration Updates
Prevents unauthorized or malicious software uploads
Only allow firmware/configuration updates using verified, signed images through secure tools or physical access. Keep update logs.
Disable unused services like HTTP, Telnet, or FTP. Use authenticated communication for SCADA protocols (IEC 61850, DNP3). Whitelist IPs.
4
Time Synchronization & Logging Protection
Ensures forensic accuracy and prevents log tampering or replay attacks
Use authenticated SNTP or IRIG-B for time. Protect event logs (SER, fault records) from unauthorized deletion or overwrite.
5
Custom Logic Integrity Protection
Prevents logic-based sabotage or backdoors in protection schemes
Monitor and restrict changes to programmable logic (trip equations, control rules). Maintain version history and hash verification.
6
Physical Interface Hardening
Blocks unauthorized access via debug ports or jumpers
Disable, seal, or password-protect physical interfaces like USB, serial, or Ethernet service ports. Protect access jumpers.
7
Redundancy and Failover Readiness
Ensures protection continuity during relay failure or communication outage
Test pilot schemes (POTT, DCB, 87L). Configure redundant paths and relays with identical settings and failover behavior.
8
Remote Access Restrictions & Monitoring
Prevents dormant vendor backdoors and insecure remote control
Disable remote services when not needed. Remove unused vendor/service accounts. Alert on all remote access attempts.
9
Command Supervision & Breaker Output Controls
Prevents unauthorized tripping or closing of breakers
Add logic constraints (status checks, delays, dual-conditions) to all trip/close outputs. Log all manual commands.
10
Centralized Log Forwarding & SIEM Integration
Enables detection of attacks and misconfigurations across systems
Relay logs and alerts should be sent to a central monitoring system (SIEM or historian) for correlation, alerts, and audit trails.
Call to Action
In an era of increasing digitization and escalating cyber threats, the integrity of our power infrastructure hinges on the security of its most fundamental guardians: protection relays. The focus of this analysis is to highlight the criticality of enabling existing security controls and incorporating security as a core design principle for every new substation and upgrade. As sophisticated threat actors, including nation-state-sponsored groups from countries like Russia, China and Iran, actively target critical infrastructure, the need to secure these devices has never been more urgent.
Mandiant recommends that all asset owners prioritize auditing remote access paths to substation automation systems and investigate the feasibility of implementing the “Top 10 Security Practices for Substation Relays” highlighted in this document. Defenders should also consider building a test relay lab or a relay digital twin, which are cloud-based replicas of their physical systems offered by some relay vendors, for robust security and resilience testing in a safe environment. By using real-time data, organizations can use test relay labs or digital twins to—among other things—test for essential subsystem interactions and repercussions of their systems transitioning from a secure state to an insecure state, all without disrupting production. To validate these security controls against a realistic adversary, a Mandiant OT Red Team exercise can safely simulate the tactics, techniques, and procedures used in real-world attacks and assess your team’s detection and response capabilities. By taking proactive steps to harden these vital components, we can collectively enhance the resilience of the grid against a determined and evolving threat landscape.