AWS HealthImaging announces rich hierarchical search per the DICOMweb QIDO-RS standard as well as an improved data management experience. With this launch, HealthImaging automatically organizes image sets into DICOM Study and Series resources. Incoming DICOM SOP instances are automatically merged to the same DICOM Series.
Rich DICOMweb QIDO-RS search capabilities make it easier to find and retrieve data, enabling customers to focus more on empowering end users and less on infrastructure management. HealthImaging’s automatic organization of data by DICOM Studies and Series makes it easier for healthcare and life sciences customers to manage their data at scale by eliminating the need for post-import workflows, saving time and reducing complexity. This helps customers more efficiently organize data and better resolve any inconsistencies. This launch also delivers significant reductions in the last byte latency of DICOMweb WADO-RS APIs, and faster import of large instances (such as digital pathology whole slide imaging).
AWS HealthImaging is generally available in the following AWS Regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), and Europe (Ireland).
AWS HealthImaging is a HIPAA-eligible service that empowers healthcare providers, life sciences researchers, and their software partners to store, analyze, and share medical images at petabyte scale. To learn more, see the AWS HealthImaging Developer Guide.
Today, AWS HealthImaging announces support for retrieving the metadata for all DICOM instances in a series via a single API action. This new feature extends HealthImaging’s support for the DICOMweb standard, simplifying integrations and improving interoperability with existing applications.
This launch significantly reduces the cost and complexity of retrieving series level metadata, especially when DICOM series contain hundreds or even thousands of instances. With this enhancement, it is easier than ever to retrieve instance metadata with consistent low latency, enabling clinical, AI, and research use cases.
AWS HealthImaging is generally available in the following AWS Regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), and Europe (Ireland).
AWS HealthImaging is a HIPAA-eligible service that empowers healthcare providers, life sciences researchers, and their software partners to store, analyze, and share medical images at petabyte scale. To learn more, see the AWS HealthImaging Developer Guide.
Today, AWS Control Tower introduces a new ‘Enabled controls’ page, helping customers track, filter, and manage their enabled controls across their AWS Control Tower organization. This enhancement significantly improves visibility and streamlines the management of your AWS Control Tower controls, saving valuable time and reducing the complexity of managing enabled controls. For organizations managing hundreds or thousands of AWS accounts, this feature provides a centralized view of control coverage, making it easier to maintain consistent governance at scale.
Previously, to assess the enabled controls coverage, you had to navigate to the organizational unit (OU) or account details page in the console to track the controls deployed per target. With this release, the Enabled controls view centralizes all the enabled controls across your AWS Control Tower environment, giving you a single, unified location to track, filter, and manage enabled controls. With this new feature, you can now more easily identify gaps in your control coverage. For instance, you can quickly search and filter for all enabled preventive controls and verify if they’re applied consistently across critical OUs.
You can drill down by organizational units, behavior, severity and implementation to see exactly which controls are enabled, giving you a targeted visibility into your governance posture across your environment. Lastly, you can also get a pre-filtered list of enabled controls by behavior from the AWS Control Tower dashboard’s Controls summary page.
To benefit from the new Enabled controls view page, navigate to the Controls section in your AWS Control Tower console. To learn more, visit the AWS Control Tower homepage or see the AWS Control Tower User Guide. For a full list of AWS Regions where AWS Control Tower is available, see the AWS Region Table.
Amazon Relational Database Service (Amazon RDS) Custom for Oracle now supports R7i and M7i instances. These instances are powered by custom 4th Generation Intel Xeon Scalable custom processors, available only on AWS. R7i and M7i instances are available in sizes up to 48xlarge, or 50% larger than the previous generation R6i and M6i instances.
M7i and R7i instances are available for Amazon RDS Custom for Oracle in Bring Your Own License model for Oracle Database Enterprise Edition (EE) and Oracle Database Standard Edition 2 (SE2) . You can modify your existing RDS instance or create a new instance with just a few clicks on the Amazon RDS Management Console or using the AWS SDK or CLI. Visit Amazon RDS Custom Pricing Page for pricing details and region availability.
Amazon RDS Custom for Oracle is a managed database service for legacy, custom, and packaged applications that require access to the underlying operating system and database environment. To get started with Amazon RDS Custom for Oracle, refer the User Guide.
The next generation of Anthropic’s Claude models, Claude Opus 4 and Claude Sonnet 4, are now available in Amazon Bedrock, representing significant advancements in AI capabilities. These models excel at coding, enable AI agents to analyze thousands of data sources, execute long-running tasks, write high-quality content, and perform complex actions. Both Opus 4 and Sonnet 4 are hybrid reasoning models offering two modes: near-instant responses and extended thinking for deeper reasoning.
Claude Opus 4: Opus 4 is Anthropic’s most powerful Claude model to date and Anthropic’s benchmarks show it is the best coding model available, excelling at autonomously managing complex, multi-step tasks with accuracy. It can independently break down abstract projects, plan architectures, and maintain high code quality throughout extended tasks. Opus 4 is ideal for powering agentic AI applications that require uncompromising intelligence for orchestrating cross-functional enterprise workflows or handling a major code migration for a large codebase.
Claude Sonnet 4: Sonnet 4 is a midsize model designed for high-volume use cases and can function effectively as a task-specific sub-agent within broader AI systems. It efficiently handles specific tasks like code generation, search, data analysis, and content synthesis, making it well suited for production AI applications requiring a balance of quality, costeffectiveness, and responsiveness.
You can now use both Claude 4 models in Amazon Bedrock. To get started, visit the Amazon Bedrock console. Integrate it into your applications using the Amazon Bedrock API or SDK. For more information including region availability, see the AWS News Blog, Anthropic’s Claude in Amazon Bedrock product page, and the Amazon Bedrock pricing page.
Amazon Managed Service for Prometheus, a fully managed Prometheus-compatible monitoring service, now provides the capability to identify expensive PromQL queries, and limit their execution. This enables customers to monitor and control the types of queries being issued against their Amazon Managed Service for Prometheus workspaces.
Customers have highlighted the need for tighter governance controls for queries, specifically around high cost queries. You can now monitor queries above a certain Query Samples Processed (QSP) threshold, and log those queries to Amazon CloudWatch. The information in the vended logs allows you to identify expensive queries. The vended logs contain the PromQL query and metadata about where it originated from, such as from Grafana dashboard IDs or alerting rules. In addition, you can now set warning or error thresholds for query execution. To control query cost, you can pre-empt the execution of expensive queries by providing an error threshold in the HTTP headers to the QueryMetrics API. Alternatively, by setting a warning threshold, we return the query results, charge you for the QSP, and return a warning to the end-user that the query is more expensive than the limit set by your workspace administrator.
This feature is now available in all regions where Amazon Managed Service for Prometheus is generally available.
To learn more about Amazon Managed Service for Prometheus collector, visit the user guide or product page.
In today’s data-driven world, understanding large datasets often requires numerous, complex non-additive1 aggregation operations. But as the size of the data becomes massive2, these types of operations become computationally expensive and time-consuming using traditional methods. That’s where Apache DataSketches come in. We’re excited to announce the availability of Apache DataSketches functions within BigQuery, providing powerful tools for approximate analytics at scale.
Apache DataSketches is an open-source library of sketches, specialized streaming algorithms that efficiently summarize large datasets. Sketches are small probabilistic data structures that enable accurate estimates of distinct counts, quantiles, histograms, and other statistical measures – all with minimal memory, minimal computational overhead, and with a single pass through the data. All but a few of these sketches provide mathematically proven error bounds, i.e., the maximum possible difference between a true value and its estimated or approximated value. These error bounds can be adjusted by the user as a trade-off between the size of the sketch and the size of the error bounds. The larger the configured sketch, the smaller will be the size of the error bounds.
With sketches, you can quickly gain insights from massive datasets, especially when exact computations are impractical or impossible. The sketches themselves can be merged, making them additive and highly parallelizable, so you can combine sketches from multiple datasets for further analysis. This combination of small size and mergeability can translate into orders-of-magnitude improvement in speed of computational workload compared to traditional methods.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e6130342520>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Why DataSketches in BigQuery?
BigQuery is known for its ability to process petabytes of data, and DataSketches are a natural fit for this environment. With DataSketches functions, BigQuery lets you:
Perform rapid approximate queries: Get near-instantaneous results for distinct counts, quantile analysis, adaptive histograms and other non-additive aggregate calculations on massive datasets.
Save on resources: Reduce query costs and storage requirements by working with compact sketches instead of raw data.
Move between systems: DataSketches have well-defined stored binary representations that let sketches be transported between systems and interpreted by three major languages: Java, C++, and Python, all without losing any accuracy.
Apache DataSketches come to BigQuery through custom C++ implementations using the Apache DataSketches C++ core library compiled to WebAssembly (WASM) libraries, and then loaded within BigQuery Javascript user-defined aggregate functions (JS UDAFs).
How BigQuery customers use Apache DataSketches
Yahoo started the Apache DataSketches project in 2011, open-sourced it in 2015, and still uses the Apache DataSketches library. They use approximate results in various analytic query operations such as count distinct, quantiles, and most frequent items (a.k.a. Heavy Hitters). More recently, Yahoo adapted the DataSketches library to leverage the large scale of BigQuery, using the Google-defined JavaScript User Defined Aggregate Functions (UDAF) interface to the Google Cloud and BigQuery platform.
“Yahoo has successfully used the Apache DataSketches library to analyze massive data in our internal production processing systems for more than 10 years. Data sketching has allowed us to respond to a wide range of queries summarizing data in seconds, at a fraction of the time and cost of brute-force computation. As an early innovator in developing this powerful technology, we are excited about this fast, accurate, large-scale, open-source technology becoming available to those already working in a Google Cloud BigQuery environment.” – Matthew Sajban, Director of Software Development Engineering, Yahoo
Featured sketches
So, what can you do with Apache DataSketches? Let’s take a look at the sketches integrated with BigQuery.
Cardinality sketches
Hyper Log Log Sketch (HLL): The DataSketches library implements this historically famous sketch algorithm with lots of versatility. It is best suited for straightforward distinct counting (or cardinality) estimation. It can be adapted to a range of sizes from roughly 50 bytes to about 2MB depending on the accuracy requirements. It also comes in three flavors: HLL_4, HLL_6, HLL_8 that enable additional tuning of speed and size.
Theta Sketch: This sketch specializes in set expressions and allows not only normal additive unions but also full set expressions between sketches with set-intersection and set-difference. Because of its algebraic capability, this sketch is one of the most popular sketches. It has a range of sizes from a few hundred bytes to many megabytes, depending on the accuracy requirements.
CPC Sketch: This cardinality sketch takes advantage of recent algorithmic research and enables smaller stored size, for the same accuracy, than the classic HLL sketch. It is targeted for situations where accuracy per stored size is the most critical metric.
Tuple Sketch: This extends Theta Sketch to enable the association of other values with each unique item retained by the sketch. This allows the computation of summaries of attributes like impressions or clicks as well as more complex analysis of customer engagement, etc.
Quantile sketches
KLL Sketch: This Sketch is designed for quantile estimation (e.g., median, percentiles), and ideal for understanding distributions, creating density and histogram plots, and partitioning large data sets. The KLL algorithm used in this sketch has been proven to have statistically optimal quantile approximation accuracy for a given size. The KLL Sketch can be used with any kind of data that is comparable, i.e., has a defined sorting order between items. The accuracy of KLL is insensitive to the input data distribution.
REQ Sketch: This quantile sketch is designed for situations where accuracy at the ends of the rank domain is more important than at the median. In other words, if you’re most interested in accuracy at the 99.99th percentile and not so interested in the accuracy at the 50th percentile, this is the sketch to choose. Like the KLL Sketch, this sketch has mathematically proven error bounds. The REQ sketch can be used with any kind of data that is comparable, i.e., has a defined sorting order between items. By design, the accuracy of REQ is sensitive to how close an item is to the ends of the normalized rank domain (i.e., close to rank 0.0 or rank 1.0), otherwise it is insensitive to the input distribution.
T-Digest Sketch: This is also a quantile sketch, but it’s based on a heuristic algorithm and doesn’t have mathematically proven error properties. It is also limited to strictly numeric data. The accuracy of the T-Digest Sketch can be sensitive to the input data distribution. However, it’s a very good heuristic sketch, fast, has a small footprint, and can provide excellent results in most situations.
Frequency sketches
Frequent Items Sketch: This sketch is also known as a Heavy-Hitter sketch. Given a stream of items, this sketch identifies, in a single pass, the items that occur more frequently than a noise threshold, which is user-configured by the size of the sketch. This is especially useful in real-time situations. For example, what are the most popular items from a web site that are being actively queried, over the past hour, day, or minute? Its output is effectively an ordered list of the most frequently visited items. This list changes dynamically, which means you can query the sketch, say, every hour to help you understand the query dynamics over the course of a day. In static situations, for example, it can be used to discover the largest files in your database in a single pass and with only a modest amount of memory.
How to get started
To leverage the power of DataSketches in BigQuery, you can find the new functions within the bqutil.datasketches dataset (for US multi-region location) or bqutil.datasketches_<bq_region> dataset (for any other regions and locations). For detailed information on available functions and their usage, refer to the DataSketches README. You can also find demo notebooks in our GitHub repo for the KLL Sketch, Theta Sketch, and FI Sketch.
Example: Obtaining estimates of Min, Max, Median, 75th, 95th percentiles and total count using the KLL Quantile Sketch
Suppose you have 1 million comparable3 records in 100 different partitions or groups. You would like to understand how the records are distributed by their percentile or rank, without having to bring them all together in memory or even sort them.
SQL:
code_block
<ListValue: [StructValue([(‘code’, ‘## Creating sample data with 1 million records split into 100 groups of nearly equal sizernrnCREATE TEMP TABLE sample_data ASrnSELECTrn CONCAT(“group_key_”, CAST(RAND() * 100 AS INT64)) AS group_key,rn RAND() AS xrnFROMrn UNNEST(GENERATE_ARRAY(1, 1000000));rnrn## Creating KLL merge sketches for a group keyrnrnCREATE TEMP TABLE agg_sample_data ASrnSELECTrn group_key,rn count(*) AS total_count,rn bqutil.datasketches.kll_sketch_float_build_k(x, 250) AS kll_sketchrnFROM sample_datarnGROUP BY group_key;rnrn## Merge group based sketches into a single sketch and then get approx quantilesrnrnWITH agg_data AS (rn SELECTrn bqutil.datasketches.kll_sketch_float_merge_k(kll_sketch, 250) rnAS merged_kll_sketch,rn SUM(total_count) AS total_countrn FROM agg_sample_datarn)rnSELECTrn bqutil.datasketches.kll_sketch_float_get_quantile(merged_kll_sketch, 0.0, true) AS mininum,rn bqutil.datasketches.kll_sketch_float_get_quantile(merged_kll_sketch, 0.5, true) AS p50,rn bqutil.datasketches.kll_sketch_float_get_quantile(merged_kll_sketch, 0.75, true) AS p75,rn bqutil.datasketches.kll_sketch_float_get_quantile(merged_kll_sketch, 0.95, true) AS p95,rn bqutil.datasketches.kll_sketch_float_get_quantile(merged_kll_sketch, 1.0, true) AS maximum,rn total_countrnFROM agg_data;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6130342640>)])]>
The DataSketches Tuple Sketch is a powerful tool to analyze properties that have a natural association with unique identifiers.
For example, imagine you have a large-scale web application that records user identifiers and their clicks on various elements. You would like to analyze this massive dataset efficiently to obtain approximate metrics for clicks per unique user. The Tuple Sketch computes the number of unique users and allows you to track additional properties that are naturally associated with the unique identifiers as well.
SQL:
code_block
<ListValue: [StructValue([(‘code’, ‘## Creating sample data with 100M records (1 through 100M) split in 10 nearly equal sized groups of 10M values eachrnrnrnCREATE TEMP TABLE sample_data_100M ASrnSELECTrn CONCAT(“group_key_”, CAST(RAND() * 10 AS INT64)) AS group_key,rn 1000000 * x2 + x1 AS user_id, rn X2 AS clicksrnFROM UNNEST(GENERATE_ARRAY(1, 1000000)) AS x1,rn UNNEST(GENERATE_ARRAY(0, 99)) AS x2;rnrn## Creating Tuple sketches for a group key ( group key can be any dimension for example date, product, location etc ) rnrnrnCREATE TEMP TABLE agg_sample_data_100M ASrnSELECTrn group_key, count(distinct user_id) AS exact_uniq_users_ct,rn sum(clicks) AS exact_clicks_ct,rn bqutil.datasketches.tuple_sketch_int64_agg_int64(user_id, clicks) rn AS tuple_sketchrnFROM sample_data_100MrnGROUP BY group_key;rnrn## Merge group based sketches into a single sketch and then extract relevant metrics like distinct count estimate as well as the estimate of the sum of clicks and its upper and lower bounds.rnrnrnWITHrnagg_data AS (rn SELECTrn bqutil.datasketches.tuple_sketch_int64_agg_union(tuple_sketch)rn AS merged_tuple_sketch, SUM(exact_uniq_users_ct) rn AS total_uniq_users_ct, rn FROM agg_sample_data_100Mrn)rnSELECTrn total_uniq_users_ct,rn bqutil.datasketches.tuple_sketch_int64_get_estimate(merged_tuple_sketch)rn AS distinct_count_estimate,rn bqutil.datasketches.tuple_sketch_int64_get_sum_estimate_and_boundsrn (merged_tuple_sketch, 2)rn AS sum_estimate_and_boundsrnFROM agg_data;rnrnrn## The average clicks / unique user can be obtained by simple divisionrn## Note: the number of digits of precision in the estimates above are due to the fact that the returned values are floating point.’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6130342280>)])]>
In short, DataSketches in BigQuery unlocks a new dimension of approximate analytics, helping you gain valuable insights from massive datasets quickly and efficiently. Whether you’re tracking website traffic, analyzing user behavior, or performing any other large-scale data analysis, DataSketches are your go-to tools for fast, accurate estimations.
To start using DataSketches in BigQuery, refer to the DataSketches-BigQuery repository README for building, installing and testing the DataSketches-BigQuery library in your own environment. In each sketch folder there is a README that details the specific function specifications available for that sketch.
If you are working in a BigQuery environment, the DataSketches-BigQuery library is already available for you to use in all regional public BigQuery datasets.
1. Examples include distinct counting, quantiles, topN, K-means, density estimation, graph analysis, etc. The results from one parallel partition cannot be simply “added” to the results of another partition – thus the term non-additive (a.k.a. non-linear operations). 2. Massive ~ typically, much larger than what can be conveniently kept in random-access memory. 3. Any two items can be compared to establish their order, i.e. if A < B, then A precedes B.
Want to save some money on large AI training? For a typical PyTorch LLM training workload that spans thousands of accelerators for several weeks, a 1% improvement in ML Goodput can translate to more than a million dollars in cost savings1. Therefore, improving ML Goodput is an important goal for model training — both from an efficiency perspective, as well as for model iteration velocity.
However, there are several challenges to improving ML Goodput today: frequent interruptions that necessitate restarts from the latest checkpoint, slow inline checkpointing that interrupts training, and limited observability that makes it difficult to detect failures. These issues contribute to a significant increase in the time-to-market (TTM) and cost-to-train. There have been several industry publications articulating these issues, e.g., this Arxiv paper.
Improving ML Goodput
In order to improve ML Goodput, you need to minimize the impact of disruptive events on the progress of the training workload. To resume a job quickly, you can automatically scale down the job, or swap failed resources from spare capacity. At Google Cloud, we call this elastic training. Further, you can reduce workload interruptions during checkpointing and speed up checkpoint loads on failures from the nearest available storage location. We call these capabilities asynchronous checkpointing and multi-tier checkpointing.
The following picture illustrates how these techniques provide an end-to-end remediation workflow to improve ML Goodput for training. An example workload of nine nodes is depicted with three-way data parallelism (DP) and three-way pipeline parallelism (PP), with various remediation actions shown based on the failures and spare capacity.
You can customize the remediation policy for your specific workload. For example, you can choose between a hotswap and a scaling-down remediation strategy, or to configure checkpointing frequency, etc. A supervisor process receives failure, degradation, and straggler signals from a diagnostic service. The supervisor uses the policy to manage these events. In case of correctable errors, the supervisor might request an in-job restart, potentially restoring from a local checkpoint. For uncorrectable hardware failures, a hot swap can replace the faulty node, potentially restoring from a peer checkpoint. If no spare resources are available, the system can scale down. These mechanisms ensure training is more resilient and adaptable to resource changes. When a replacement node is available, training scales up automatically to maximize GPU utilization. During scale down and scale up, user-defined callbacks help adjust hyperparameters such as learning rate and batch size. You can set remediation policies using a Python script.
Let’s take a deeper look at the key techniques you can use when optimizing ML Goodput.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e612fbff4f0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Elastic training
Elastic training enhances the resiliency of LLM training by enabling failure sensing and mitigation capabilities for workloads. This allows jobs to automatically continue with remediation strategies including GPU reset, node hot swap, and scaling down the data-parallel dimension of a workload to avoid using faulty nodes, thereby reducing job interruption time and improving ML Goodput. Furthermore, elastic training enables automatic scaling up of data-parallel replicas when replacement nodes become available, maximizing training throughput.
Watch this short video to see elastic training techniques in action:
Sub-optimal checkpointing can lead to unnecessary overhead during training and significant loss of training productivity when interruptions occur and previous checkpoints are restored. You can substantially reduce these impacts by defining a dedicated asynchronous checkpointing process and optimizing it to quickly offload the training state from GPU high-bandwidth memory to host memory. Tuning the checkpoint frequency — based on factors such as the job interruption rate and the asynchronous overhead — is vital, as the best interval may range from several hours to mere minutes, depending on the workload and cluster size. An optimal checkpoint frequency minimizes both checkpoint overhead during training operation and computational loss during unexpected interruptions.
A robust way to meet the demands of frequent checkpointing is to leverage three levels of storage: local node storage, e.g., local SSD; peer node storage in the same cluster; and Google Cloud Storage. This multi-tiered checkpointing approach automatically replicates data across these storage tiers during save and restore operations via the host network interface or NCCL (the NVIDIA Collective Communications Library), allowing the system to use the fastest accessible storage option. By combining asynchronous checkpointing with a multi-tier storage strategy, you can achieve quicker recovery times and more resilient training workflows while maintaining high productivity and minimizing the loss of computational progress.
Watch this short video to see optimized checkpointing techniques in action :
These ML Goodput improvement techniques leverage NVIDIA Resiliency Extension, which provides failure signaling and in-job restart capabilities, as well as recent improvements to PyTorch’s distributed checkpointing, which support several of the previously mentioned checkpoint-related optimizations. Further, these capabilities are integrated with Google Kubernetes Engine (GKE) and the NVIDIA NeMo training framework, pre-packaged into a container image and available with an ML Goodput optimization recipe for easy deployment.
Elastic training in action
In a recent internal case study with 1,024 A3 Mega GPU-accelerated instances (built on NVIDIA Hopper), workload ML Goodput improved from 80%+ to 90%+ using a combination of these techniques. While every workload may not benefit in the same way, this table shows the specific metric improvements and ML Goodput contribution of each of the techniques.
Example: Case study experiment used an A3 Mega cluster with 1024 GPUs running ~40hr jobs with ~5 simulated interruptions per day
Conclusion
In summary, elastic training and optimized checkpointing, along with easy deployment options, are key strategies to maximize ML Goodput for large PyTorch Training workloads. As seen from the case study above, they can contribute to meaningful ML Goodput improvements and provide significant efficiency savings. These capabilities are customizable and composable through a python script. If you’re running PyTorch GPU training workloads on Google Cloud today, we encourage you to try out our ML Goodput optimization recipe, which provides a starting point with recommended configurations for elastic training and checkpointing. We hope you have fun building and share your feedback!
Various teams and individuals within Google Cloud contributed to this effort. Special thanks to – Jingxin Ye, Nicolas Grande, Gerson Kroiz, and Slava Kovalevskyi, as well as our collaborative partners – Jarek Kazmierczak, David Soto, Dmitry Kakurin, Matthew Cary, Nilay Goyal and Parmita Mehta for their immense contributions to developing all of the components that made this project a success.
1. Assuming A3 Ultra pricing for 20,000 GPUs with jobs spanning 8 weeks or longer
Confidential Computing has redefined how organizations can securely process their sensitive workloads in the cloud. The growth in our hardware ecosystem is fueling a new wave of adoption, enabling customers to use Confidential Computing to support cutting-edge uses such as building privacy-preserving AI and securing multi-party data analytics.
We are thrilled to share our latest Confidential Computing innovations, highlighting the creative ways our customers are using Confidential Computing to protect their most sensitive workloads including AI workloads.
Building on our foundational work last year, we’ve seen remarkable progress through our deep collaborations with industry leaders including Intel, AMD, and NVIDIA. Together, we’ve significantly expanded the reach of Confidential Computing, embedding critical security features across the latest generations of CPUs, and also extending them to high-performance GPUs.
Confidential VMs and GKE Nodes with NVIDIA H100 GPUs for AI workloads, in preview
An ongoing, top goal for Confidential Computing is to expand our capabilities for secure computation.
We unveiled Confidential Virtual Machines on the accelerator-optimized A3 machine series with NVIDIA H100 GPUs last year, which extends hardware-based data protection from the CPU to GPUs. Confidential VMs can help ensure the confidentiality and integrity of artificial intelligence, machine learning, and scientific simulation workloads using protected GPUs while the data is in use.
“AI and Agentic workflows are accelerating and transforming every aspect of business. As these technologies are integrated into the fabric of everyday operations — data security and protection of intellectual property are key considerations for businesses, researchers and governments,” said Daniel Rohrer, vice president, software product security, NVIDIA. “Putting data and model owners in direct control of their data’s journey — NVIDIA’s Confidential Computing brings advanced hardware-backed security for accelerated computing providing more confidence when creating and adopting innovative AI solutions and services.”
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e612fe7d370>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Confidential Vertex AI Workbench, in preview
We are expanding Confidential Computing support on Vertex AI. Vertex AI Workbench customers can now use Confidential Computing to enhance their data privacy needs, and is now in preview. This integration offers greater privacy and confidentiality with just a few clicks.
How to enable Confidential VMs in Vertex AI Workbench instances.
Confidential Space with Intel TDX (generally available) and NVIDIA H100 GPUs, in preview
We are excited to announce that Confidential Space is now generally available on the general-purpose C3 machine series with Intel® Trust Domain Extensions (Intel® TDX) technology, and coming soon in preview on the accelerator-optimized A3 machine series with NVIDIA H100 GPUs.
Built on our Confidential Computing portfolio, Confidential Space provides a secure enclave, also known as a Trusted Execution Environment (TEE), that Google Cloud customers can use for privacy-focused applications such as joint data analysis, joint machine learning (ML) model training or secure sharing of proprietary ML models.
Importantly, Confidential Space is designed to protect data from all parties involved — including removing the operator of the environment from the trust boundary along with hardened protection against cloud service provider access. These properties can help organizations harden their products from insider threats, and ultimately provide stronger data privacy guarantees to their own customers.
Confidential Space enables secure collaboration.
Confidential GKE Nodes on C3 machines with Intel TDX and built-in acceleration, generally available
Confidential GKE Nodes are now generally available with Intel TDX. These nodes are powered by the general purpose C3 machine series, which run on the 4th generation Intel Xeon Scalable processors (code-named Sapphire Rapids) and have the Intel Advanced Matrix Extensions (Intel AMX) built in and on by default.
Confidential GKE Nodes with Intel TDX offers nodes an additional isolation layer from the host and hypervisor to protect nodes against a broad range of software and hardware attacks.
“Intel Xeon processors deliver outstanding performance and value for many machine learning and AI inference workloads, especially with Intel AMX acceleration,” said Anand Pashupathy, vice president and general manager, Security Software and Services, Intel. “Google Cloud’s C3 machine series will not only impress with their performance on AI and other workloads, but also protect the confidentiality of the user’s data.”
How to enable Confidential GKE Nodes with Intel TDX.
Confidential GKE Nodes on N2D machines with AMD SEV-SNP, generally available
Confidential GKE nodes are also now generally available with AMD Secure Encrypted Virtualization-Secure Nested Paging (AMD SEV-SNP) technology. These nodes use the general purpose N2D machine series and run on the 3rd generation AMD EPYC™ (code-named Milan) processors. Confidential GKE nodes with AMD SEV-SNP provides security for cloud workloads through assurance that workloads are running and encrypted on secured hardware.
Confidential VMs on C4D machines with AMD SEV, in Preview
The C4D machine series are powered by the 5th generation AMD EPYC™ (code-named Turin) processors and designed to deliver optimal, reliable, and consistent performance with Google’s Titanium hardware.
Today, we offer global availability of Confidential Compute on AMD machine families such as N2D, C2D, and C3D. We’re happy to share that Confidential VMs on general purpose C4D machine series with AMD Secure Encrypted Virtualization (AMD SEV) technology are in preview today, and will be generally available soon.
Unlocking new use cases with Confidential Computing
We’re seeing impact across all major verticals where organizations are using Confidential Computing to unlock business innovations.
AiGenomix AiGenomix is leveraging Google Cloud Confidential Computing to deliver highly differentiated infectious disease surveillance, early detection of cancer, and therapeutics intelligence with a global ecosystem of collaborators in the public and private sector.
“Our customers are dealing with extremely sensitive data about pathogens. Adding relevant data sets like patient information and personalized therapeutics further adds to the complexity of compliance. Preserving privacy and security of pathogens, patients’ genomic and related health data assets is a requirement for our customers and partners,” said Dr. Jonathan Monk, head of bioinformatics, AiGenomix.
“Our Trusted AI for Healthcare solutions leveraging Google Cloud Confidential Computing overcome the barriers to accelerated global adoption by making sure that our assets and processes are secure and compliant. With this, we are able to contribute towards the mitigation of the ever-growing risk emerging from infectious diseases and drug resistance resulting in loss of lives and livelihood,” said Dr. Harsh Sharma, chief AI strategist, AiGenomix.
Google Ads Google Ads has introduced confidential matching to securely connect customers’ first-party data for their marketing. This marks the first use of Confidential Computing in Google Ads products, and there are plans to bring this privacy-enhancing technology to more products over time.
“Confidential matching is now the default for any data connections made for Customer Match including Google Ads Data Manager — with no action required from you. For advertisers with very strict data policies, it also means the ability to encrypt the data yourself before it ever leaves your servers,” said Kamal Janardhan, senior director, Product Management, Measurement, Google Ads.
Google Ads plans to further integrate Confidential Computing across more services, such as the new Google tag gateway for advertisers. This update will give marketers conversion tag data encrypted in the browser, by default, and at no extra cost. The Google tag gateway for advertisers can help drive performance improvements and strengthen the resilience of advertisers’ measurement signals, while also boosting security and increasing transparency on how data is collected and processed.
Swift Swift is using Confidential Computing to ensure that sensitive data from some of the largest banks remains completely private while powering a money laundering detection model.
“We are exploring how to leverage the latest technologies to build a global anomaly detection model that is trained on the historic fraud data of an entire community of institutions in a secure and scalable way. With a community of banks we are exploring an architecture which leverages Google Cloud Confidential Computing and verifiable attestation, so participants can ensure that their data is secure even during computation as they locally train the global model and rely on verifiable attestation to ensure the security posture of every environment in the architecture,” said Rachel Levi, head of artificial intelligence, Swift.
Expedite your Confidential Compute journey with Gemini Cloud Assist, in preview
To make it easy for you to use Confidential Computing we’re providing AI-powered assistance directly in existing configuration workflows by integrating Gemini Cloud Assist across Confidential Compute, now in preview.
Through natural language chat, Google Cloud administrators can get tailored explanations, recommendations, and step-by-step guidance for many security and compliance tasks. One such example is Confidential Space, where Gemini Cloud Assist can guide you through the journey of setting up the environment as a Workload Author, Workloads Operator, or a Data Collaborator. This significantly reduces the complexity and the time to set up such an environment for organizations.
Gemini Cloud Assist for Confidential Space
Next steps
By continuously innovating and collaborating, we’re committed to making Confidential Computing the cornerstone of a secure and thriving cloud ecosystem.
Our latest video covers several creative ways organizations are using Confidential Computing to move their AI journeys forward. You can watch it here.
Amazon Aurora Global Database now supports adding up to 10 secondary Regions to your global cluster, further enhancing scalability and availability for globally distributed applications.
With Global Database, a single Aurora cluster can span multiple AWS Regions, providing disaster recovery from Region-wide outages and enabling fast local reads for globally distributed applications. This launch increases the number of secondary Regions that can be added to a global cluster from the previously supported limit of up to 5 secondary Regions to up to 10 secondary Regions, providing a larger global footprint for operating your applications See documentation to learn more about Global Database.
Amazon Aurora combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. To get started with Amazon Aurora, take a look at our getting started page.
Amazon Elastic Kubernetes Service (EKS) announces the general availability of EKS Dashboard, a new feature that provides centralized visibility into Kubernetes infrastructure across multiple AWS Regions and accounts. EKS Dashboard provides comprehensive insights into your Kubernetes clusters, enabling operational planning and governance. You can access the Dashboard in EKS console through AWS Organizations’ management and delegated administrator accounts.
As you expand your Kubernetes footprint to address operational and strategic objectives, such as improving availability, ensuring business continuity, isolating workloads, and scaling infrastructure, the EKS Dashboard provides centralized visibility across your Kubernetes infrastructure. You can now visualize your entire Kubernetes infrastructure without switching between AWS Regions or accounts, gaining aggregated insights into clusters, managed node groups, and EKS add-ons. This includes clusters running specific Kubernetes versions, support status, upcoming end of life auto-upgrades, managed node group AMI versions, EKS add-on versions, and more. This centralized approach supports more effective oversight, auditability, and operational planning for your Kubernetes infrastructure.
The EKS Dashboard can be accessed in the us-east-1 AWS Region, aggregating EKS cluster metadata from all commercial AWS Regions. To get started, see the EKS user guide.
Starting today, customers can now configure System Integrity Protection (SIP) settings on their EC2 Mac instances, providing greater flexibility and control over their development environments. SIP is a critical macOS security feature that helps prevent unauthorized code execution and system-level modifications. This enhancement enables developers to temporarily disable SIP for development and testing purposes, install and validate system extensions and DriverKit drivers, optimize testing performance through selective program management, and maintain security compliance while meeting development requirements.
The new SIP configuration capability is available across all EC2 Mac instance families, including both Intel (x86) and Apple silicon platforms. Customers can access this feature in all AWS regions where EC2 Mac instances are currently supported. To learn more about this feature, please visit the documentation here and our launch blog here. To learn more about EC2 Mac instances, click here.
AWS Database Migration Service (AWS DMS) now supports Data Resync, a new feature that automatically corrects data inconsistencies identified during validation between source and target databases.
Data Resync integrates with your existing DMS migration tasks and supports both Full Load and Change Data Capture (CDC) phases. It uses your current task settings—including connection configurations, table mappings, and transformations—to apply corrections automatically, helping ensure accurate and reliable migrations without manual intervention. With Data Resync, AWS DMS can detect and resolve common data issues, such as missing records, duplicate entries, or mismatched values, based on validation results.
Data Resync is available starting with AWS DMS replication engine version 3.6.1, and currently supports migration paths from Oracle and SQL Server to PostgreSQL. For detailed information on how Data Resync enhances migration accuracy, please refer to the AWS DMS Technical Documentation.
AWS Cost Anomaly Detection now integrates with AWS User Notifications (via Amazon EventBridge), enabling customers to create enhanced alerting capabilities in the AWS User Notifications console. This integration lets customers configure sophisticated alert rules based on service, account, or other cost dimensions to identify and respond to unexpected spending changes faster. Using AWS User Notifications, customers can receive immediate or aggregated alerts through multiple channels including email, AWS Chatbot, and the AWS Console Mobile Application, while maintaining a centralized history of alert notifications.
This new capability allow customers to customize their cost monitoring by creating alert rules in AWS User Notifications. Now customers can configure rules with higher thresholds for machine learning services that naturally experience cost spikes during training, while setting lower thresholds for stable services like databases where small changes might indicate configuration issues. Customers also benefit from verified contact management, ensuring alerts reach the right teams through validated delivery channels that can be reused across multiple alert configurations.
These enhancements are available in all AWS Regions, except the AWS GovCloud (US) Regions and the China Regions. To learn more about setting up alerts in AWS User Notifications and getting started, visit the AWS Cost Anomaly Detection product page and documentation.
Starting today, AWS Deadline Cloud will support the latest version of Foundry Nuke, a powerful compositing tool widely used for visual effects and post-production workflows. AWS Deadline Cloud is a fully managed service that simplifies render management for teams creating computer-generated graphics and visual effects, for films, television and broadcasting, web content, and design.
With support for Nuke version 16, you can access the latest improvements for Nuke while leveraging AWS Deadline Cloud’s managed infrastructure for your rendering pipelines, giving you the ability to create high-quality content using cutting-edge compositing features.
This new version is now available in all AWS regions where AWS Deadline Cloud is currently offered. To learn more about AWS Deadline Cloud and how to leverage Nuke version 16 in your workflows, visit the AWS Deadline Cloud documentation.
Amazon RDS announces a new capability that helps you view engine lifecycle support dates for your databases. This new feature provides a centralized and convenient place to access engine support dates, offering greater control over your database lifecycle management
You can view start and end dates for RDS Standard Support and RDS Extended Support for RDS and Aurora major engine versions through the RDS API or AWS CLI. If RDS Extended Support is available for an engine version then both RDS Standard and Extended Support dates are shown. If RDS Extended Support is not available for an engine version, the response includes only RDS Standard Support dates
With this feature you can view lifecycle support dates for RDS MySQL, RDS MariaDB, RDS PostgreSQL, Aurora MySQL, and Aurora PostgreSQL engines. To learn more, visit Amazon RDS User Guide and Amazon Aurora User Guide.
Amazon RDS makes it simple to set up, operate, and scale database deployments in the cloud. Create or update a fully managed Amazon RDS database in the Amazon RDS Management Console.
Amazon Aurora is designed for unparalleled high performance and availability at global scale with full MySQL and PostgreSQL compatibility. To get started with Amazon Aurora, take a look at our getting started page.
AWS Marketplace now supports partial disbursements for AWS Marketplace invoices transacted through the AWS Inc., Europe, Middle East and Africa (EMEA), Australia (AU), and Japan (JP) Marketplace Operators (MPOs), allowing sellers to receive funds as buyers make partial payments on AWS Marketplace invoices. AWS Marketplace now automatically processes partial disbursements based on the invoice amount paid by the buyer, aligned with the seller’s disbursement schedule configured in the AWS Marketplace Management Portal (AMMP). Previously, sellers had to wait for complete invoice payments by buyers before receiving disbursements for invoices.
Sellers can now access funds faster through disbursement of partial payments without waiting for buyers to pay invoices in full. Enhancements have also been made to AWS Marketplace Seller reporting to provide better visibility into partially disbursed invoices. For more details on the AWS Marketplace Seller reporting experience, visit the billed revenue dashboard and collections and disbursement dashboard guides.
Partial disbursements are available to AWS Marketplace sellers who transact through the AWS Inc., EMEA, AU, and JP MPOs.
For more information about partial disbursements for AWS Marketplace invoices and updates to seller dashboards, access the partial disbursements documentation.
AWS Transfer Family now supports ML- KEM (FIPS-203), a post-quantum algorithm standardized by the National Institute of Standards and Technology (NIST), for SFTP file transfers. Quantum-resistant public-key exchange helps protect transfers of data files that require long-term confidentiality against “harvest now, decrypt later“ threats. In such scenarios, an adversary may be recording present day traffic for decrypting once cryptanalytically relevant quantum computers become available.
AWS Transfer Family offers fully managed support for the transfer of files over SFTP, AS2, FTPS, FTP, and web browser-based transfers directly into and out of AWS storage services. With this launch, you can now use post-quantum (PQ) hybrid security policies that combine classical Elliptic Curve Diffie-Hellman with quantum-resistant ML-KEM key exchanges between your AWS Transfer Family SFTP endpoints and clients like OpenSSH, Putty, and JSch that support PQ algorithms. When using a PQ hybrid policy, your Transfer Family SFTP server preserves the standard connection options supported by most clients today, while leveraging the most secure PQ connection options with clients that support quantum-resistant key exchange.
ML-KEM quantum-resistant key exchange for SFTP file transfers is supported in all AWS Regions where AWS Transfer Family is available. Older PQ key exchange methods which included ML-KEM’s pre-standardized version (Kyber), introduced in AWS Transfer in 2023, will be removed from existing policies and no longer be included in the new PQ policy. To learn more about using PQ security policies to enable quantum-resistant key exchange, visit our documentation.
Welcome to the first Cloud CISO Perspectives for May 2025. Today, Iain Mulholland, senior director, Security Engineering, pulls back the curtain on how Google Cloud approaches security engineering and how we take secure by design from mindset to production.
As with all Cloud CISO Perspectives, the contents of this newsletter are posted to the Google Cloud blog. If you’re reading this on the website and you’d like to receive the email version, you can subscribe here.
aside_block
<ListValue: [StructValue([(‘title’, ‘Get vital board insights with Google Cloud’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7f61c98580>), (‘btn_text’, ‘Visit the hub’), (‘href’, ‘https://cloud.google.com/solutions/security/board-of-directors?utm_source=cloud_sfdc&utm_medium=email&utm_campaign=FY24-Q2-global-PROD941-physicalevent-er-CEG_Boardroom_Summit&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>
How Google Cloud’s security team helps engineers build securely
By Iain Mulholland, senior director, Security Engineering
Velocity is a chief concern in every executive office, but it falls to CISOs to balance the tension between keeping the business secure and ensuring the business keeps up. At Google, we’re constantly thinking about how to enable both resilience and innovation.
For decades, we’ve been taking a holistic approach to how security decision-making can work better. We believe that the success we’ve seen with our security teams is achievable at many organizations, and can help lead to better security and business outcomes.
My team is responsible for ensuring Google Cloud is the most secure cloud, and we approach security as an engineering function. It’s a different lens than traditional IT or compliance views, two parts of the business where security priorities are often set, which results in improved decision-making and security outcomes.
Our Office of the CISO security engineering team partners with product team software engineers at all stages of the software development lifecycle to find paths to ship secure software — all while maintaining product-release velocity and adhering to secure-by-design principles.
We’re still seeing too many organizations rely on defenses that were designed for the desktop era — despite successful efforts to convince business leaders to invest in more modern security tools, as Phil Venables and Andy Wen noted last year.
“To be truly resilient in today’s security landscape, organizations must consider an IT overhaul and rethink their strategy toward solutions with modern, secure-by-design architectures that nullify classes of vulnerabilities and attack vectors,” they said.
To turn this core security philosophy into reality, we’ve used it to guide how we build our teams. Cloud security engineers are embedded with product teams to help the entire organization “shift left” and take an engineering-centered approach to security. Our Office of the CISO security engineering team partners with product team software engineers at all stages of the software development lifecycle (SDLC) to find paths to ship secure software — all while maintaining product-release velocity and adhering to secure-by-design principles.
You can see this in action with our threat modelling practice. Security engineers and software development teams work closely to analyze potential threats to the product and to identify actions and product capabilities that can mitigate risks. Because this happens in the design phase, the team can eliminate these threats early in the SDLC, ensuring our products are secure by design.
With engineering as our security foundation, we can build capabilities at breadth, at depth, and in clear relationship to each other, so that our total power exceeds the sum of these parts.
Instead of simulating risk, we deploy our researchers to consider the whole cloud as an attack surface. They chain vulnerabilities in novel ways to improve our overall security architecture.
Protecting against threats is a great example of the impact of this approach. We characterize the vast cloud threat landscape in three specific areas: outbound network attacks (such as DDoS, outbound intrusion attempts, and vulnerability scans); resource misuse (such as cryptocurrency mining, illegal video streaming, and bots); and content-based threats (such as phishing and malware).
Across that landscape, threat actors often use similar techniques and exploit similar vulnerabilities. To combat these tactics, the team generates intelligence to prevent, detect, and mitigate risk in Google Cloud offerings before they become problems to our customers.
We “shift left” on threats, too: Identifying this systemic risk feeds into the lifecycle of software and product development. Once we identify a threat vector, we work closely with our security and product engineers to harden product defenses to help eliminate threats before they can take root.
We use AI, advanced data science, and analytics solutions to protect Google Cloud and our customers from future threats by focusing on three key capabilities: predicting future user behavior, proactively identifying risky security patterns, and improving the efficiency and measurability of threats and security operations.
It’s vital to our mission that we find attack paths before attackers do, reducing unknown security risks by finding vulnerabilities in our products and services before they are made available to customers. In addition to simulating risk, we push our researchers to consider the whole cloud as an attack surface. They chain vulnerabilities in novel ways to improve our overall security architecture.
Responding to threats is a critical third element of our engineering environment’s interlocking capabilities. Our security response operations assess and implement remediation strategies that come from external parties, and we frequently participate in comprehensive, industry-wide responses. Regular collaboration with Google Cloud’s Vulnerability Rewards Program has been a major driver of our success in this area.
Across all of these areas, there is incredible complexity, but the philosophy that guides the work is simple: By baking security into engineering processes, you can secure systems better and earlier than bolting security on at the end. Investing in a deep engineering bench coupled with embedding security personnel, processes, and procedures as early as possible in the development lifecycle can strengthen decision-making confidence and business resilience across the organization.
You can learn more about how you can incorporate security best practices into your organization’s engineering environment from our Office of the CISO.
aside_block
<ListValue: [StructValue([(‘title’, ‘Join the Google Cloud CISO Community’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7f61c98940>), (‘btn_text’, ‘Learn more’), (‘href’, ‘https://rsvp.withgoogle.com/events/ciso-community-interest?utm_source=cgc-blog&utm_medium=blog&utm_campaign=2024-cloud-ciso-newsletter-events-ref&utm_content=-&utm_term=-‘), (‘image’, <GAEImage: GCAT-replacement-logo-A>)])]>
In case you missed it
Here are the latest updates, products, services, and resources from our security teams so far this month:
How boards can boost resiliency with the updated U.K. cyber code: Here’s how Google Cloud can help your organization and board of directors adapt to the newly updated U.K. cyber code. Read more.
What’s new in IAM, Access Risk, and Cloud Governance: A core part of our mission is to help you meet your policy, compliance, and business objectives. Here’s what’s new for IAM, Access Risk, and Cloud Governance. Read more.
3 new ways to use AI as your security sidekick: Generative AI is already providing clear and impactful security results. Here’s three decisive examples that organizations can adopt right now. Read more.
Expanding our Risk Protection Program with new insurance partners and AI coverage: We unveiled at Next ‘25 major updates to our Risk Protection Program, an industry-first collaboration between Google and cyber insurers. Here’s what’s new. Read more.
From insight to action: M-Trends, agentic AI, and how we’re boosting defenders at RSAC 2025: From the latest M-Trends report to updates across Google Unified Security, our product portfolio, and our AI capabilities, here’s what’s new from us at RSAC. Read more.
The dawn of agentic AI in security operations: Agentic AI promises a fundamental, tectonic shift for security teams, where intelligent agents work alongside human analysts. Here’s our vision for the agentic future. Read more.
What’s new in Android security and privacy in 2025: We’re announcing new features and enhancements that build on our industry-leading protections to help keep you safe from scams, fraud, and theft on Android. Read more.
Please visit the Google Cloud blog for more security stories published this month.
COLDRIVER using new malware to steal data from Western targets and NGOs: Google Threat Intelligence Group (GTIG) has attributed new malware to the Russian government-backed threat group COLDRIVER (also known as UNC4057, Star Blizzard, and Callisto) that has been used to steal data from western governments and militaries, as well as journalists, think tanks, and NGOs. Read more.
Cybercrime hardening guidance from the frontlines: The U.S. retail sector is currently being targeted in ransomware operations that GTIG suspects is linked to UNC3944, also known as Scattered Spider. UNC3944 is a financially-motivated threat actor characterized by its persistent use of social engineering and brazen communications with victims. Here’s our latest proactive hardening recommendations to combat their threat activities. Read more.
Please visit the Google Cloud blog for more threat intelligence stories published this month.
Now hear this: Podcasts from Google Cloud
How cyber-savvy is your board: We’ve long extolled the importance of bringing boards of directors up to speed on cybersecurity challenges both foundational and cutting-edge, which is why we’ve launched “Cyber Savvy Boardroom,” a new monthly podcast from our Office of the CISO’s David Homovich, Alicja Cade, and Nick Godfrey. Our first three episodes feature security and business leaders known for their intuition, expertise, and guidance, including Karenann Terrell, Christian Karam, and Don Callahan. Listen here.
From AI agents to provenance in MLSecOps: What is MLSecOps, and what should CISOs know about it? Diana Kelley, CSO, Protect AI, goes deep on machine-learning model security with hosts Anton Chuvakin and Tim Peacock. Listen here.
What we learned at RSAC 2025: Anton and Tim discuss their RSA Conference experiences this year. How did the show floor hold up to the complicated reality of today’s information security landscape? Listen here.
Deconstructing this year’s M-Trends: Kirstie Failey, GTIG, and Scott Runnels, Mandiant Incident Response, chat with Anton and Tim about the challenges of turning standard incident reports into bigger-picture review found in this year’s M-Trends. Listen here.
Defender’s Advantage: How UNC5221 targeted Ivanti Connect Secure VPNs: Mandiant’s Matt Lin and Ivanti’s Daniel Spicer join host Luke McNamara as they dive into the research and response of UNC5221’s campaigns against Ivanti. Listen here.
To have our Cloud CISO Perspectives post delivered twice a month to your inbox, sign up for our newsletter. We’ll be back in a few weeks with more security-related updates from Google Cloud.
The telecommunications industry is undergoing a profound transformation, with AI and generative AI emerging as key catalysts. Communication service providers (CSPs) are increasingly recognizing that these technologies are not merely incremental improvements but fundamental drivers for achieving strategic business and operational objectives. This includes enabling digital transformation, fostering service innovation, optimizing monetization strategies, and enhancing customer retention.
To provide a comprehensive and data-driven analysis of this evolving landscape, Google Cloud partnered with Analysys Mason to conduct an in-depth study “ Gen AI in the network: CSP progress in adopting gen AI for network operations. This research examines CSPs’ progress, priorities, challenges, and best practices in leveraging gen AI to reshape their networks, offering quantifiable insights into this critical transformation.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7f6108bc40>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Key findings: A data-driven roadmap
The Analysys Mason study offers valuable insights into the current state of gen AI adoption in telecom, providing a data-driven roadmap for CSPs seeking to navigate this transformative journey:
1. Widespread gen AI adoption and future intentions
Demonstrating the strong momentum behind gen AI, 82% of CSPs surveyed are currently trialing or using it in at least one network operations area, and this adoption is set to expand further, with an additional 9% planning to implement it within the next 2 years.
2. Strategic importance of gen AI
Gen AI empowers CSPs to achieve strategic goals within the network: 57% surveyed see it as a key enabler of autonomous, cloud-based network transformation initiatives, 52% for the transition to new business models like NetCo/ServCo and more digitally driven organizations, and all with the aim of enhancing customer experience and driving broader transformation.
3. Key drivers for gen AI investment
CSPs are strategically prioritizing gen AI investments to achieve a range of network objectives, including optimizing network performance and reliability, enhancing application quality of experience (QoE), and improving network resource utilization, recognizing gen AI’s potential to move beyond a productivity tool and become a cornerstone of future network operations and automation..
4. Challenges in achieving model accuracy
While gen AI offers significant potential, the study found that 80% of CSPs face challenges in achieving the expected accuracy from gen AI models, a hurdle that impacts use case scaling and ROI. These accuracy issues are linked to data-related problems, which many CSPs across different maturity levels are still working to resolve, and the complexity of customizing models for specific network operations.
5. Addressing the skills gap
With over 50% of CSPs citing it as a key concern, employee skillsets represent a major challenge, highlighting the urgent imperative for CSPs to invest in upskilling and reskilling initiatives to cultivate in-house expertise in AI, gen AI, and data science related fields.
6. Gen AI implementation strategies
While many CSPs begin their gen AI implementation by utilizing vendor-provided applications with embedded gen AI capabilities (the most common approach), the study emphasizes that to fully address their diverse network needs, CSPs also seek to customize models using techniques like fine-tuning and prompt engineering; this customization, however, is heavily reliant on a strong data strategy to overcome challenges such as data silos and data quality issues, which significantly impact the accuracy and effectiveness of the resulting gen AI solutions.
7. Deployment preferences
While 51% of CSPs indicated hybrid cloud environments as the predominant deployment choice for gen AI platforms in network operations, reflecting the need for flexibility and control, a significant 39% of CSPs show a strong preference for private cloud-only deployments specifically for their data platforms, driven by the critical importance of data security and control. Public cloud deployments are preferred for AI model deployments.
Recommendations for CSPs
In summary, to secure a competitive edge, CSPs will need to prioritize gen AI use caseswith clear ROI by adopting early-win gen AI use cases while developing a long-term strategy, transform their organizational structure and invest in upskilling initiatives, develop and implement a robust data strategy to support all AI initiatives and cultivate strong partnerships with expert vendors to accelerate their gen AI journey.
Google Cloud: Your partner for network transformation
Google Cloud empowers CSPs’ data-driven transformation by providing expertise in operating planetary-scale networks, a unified data platform, AI model optimization, professional services for gen AI, hybrid cloud solutions, and a rich partner ecosystem. This is further strengthened by Google Cloud’s proven success in driving network transformation for major telcos, leveraging infrastructure, platforms, and tools that deliver the required near real-time processing and scale.