For years, BigQuery has been synonymous with fully managed, fast, petabyte-scale analytics. Its columnar architecture and decoupled storage and compute have made it the go-to data warehouse for deriving insights from massive datasets.
But what about the moments between the big analyses? What if you need to:
-
Modify a handful of customer records across huge tables without consuming all your slots or running for minutes on end?
-
Track exactly how some data has evolved row by row?
-
Act immediately on incoming streaming data, updating records on the fly?
Historically, these types of “transactional” needs might have sent you searching for a database solution or required you to build complex ETL/ELT pipelines around BigQuery. The thinking was clear: BigQuery was for analysis, and you used something else for dynamic data manipulation.
That’s changing. At Google Cloud, we’ve been steadily evolving BigQuery, adding powerful capabilities that blur these lines and bring near-real-time, transactional-style operations directly into your data warehouse. This isn’t about turning BigQuery into a traditional OLTP database; rather, it’s about empowering you to handle common data management tasks more efficiently within the BigQuery ecosystem.
This shift means fewer complex workarounds, faster reactions to changing data, and the ability to build more dynamic and responsive applications right where your core data lives.
Today, we’ll explore three game-changing features that are enabling this evolution:
-
Efficient fine-grained DML mutations: Forget costly table rewrites for small modifications. Discover how BigQuery now handles targeted UPDATE
s, DELETE
s, and MERGE
s with significantly improved performance and resource efficiency.
-
Change history support for updates and deletes: Go beyond simple snapshots. See how BigQuery can now capture the granular history of UPDATE
s and DELETE
s, providing a detailed audit trail of data within your tables.
-
Real-time updates with DML over streaming data: Don’t wait for data to settle. Learn how you can apply UPDATE
, DELETE
, and MERGE
operations directly to data as it streams into BigQuery, enabling immediate data correction, enrichment, or state management.
Ready to see how these capabilities can simplify your workflows and unlock new possibilities within BigQuery? Let’s dive in and see them in action!
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e20b1cde1f0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
1. Efficient fine-grained DML mutations
BigQuery has supported Data Manipulation Language (DML) statements like UPDATE
, DELETE
, and MERGE
for years, allowing you to modify data without recreating entire tables. However, historically, performing these operations — especially small, targeted changes on very large tables — was less efficient than you might have hoped for. The challenge? Write amplification.
When you executed a DML mutation, BigQuery needed to rewrite entire underlying storage blocks (think of them as internal file groups) containing the rows you modified. Even if your statement only affected a few rows within a block, the whole block might have needed to be rewritten. This phenomenon, sometimes called “write amplification,” could lead to significant slot consumption and longer execution times, particularly for sparse mutations (changes scattered across many different blocks in a large table). This sometimes made operations like implementing GDPR’s “right to be forgotten” by deleting specific user records slow or costly.
To address this, we introduced fine-grained DML in BigQuery, a set of performance enhancements that optimize sparse DML mutation operations.
When enabled, instead of always rewriting large storage blocks, BigQuery fine-grained DML can pinpoint and modify data with much finer granularity. It leverages optimized metadata indexes to rewrite only the necessary mutated data, drastically reducing the processing, I/O, and consequently, the slot time consumed for sparse DML. The result? Faster, more cost-effective DML, making BigQuery much more practical for workloads involving frequent, targeted data changes.
Grupo Catalana Occidente, a leading global insurance provider, is excited about fine-grain DML’s ability to help them integrate changes to their data in real time:
“In our integration project between Google BigQuery, SAP, and MicroStrategy, we saw an 83% improvement in DML query runtime when we enabled BigQuery fine-grained DML. Fine-grained DML allows us to achieve adequate performance and reduces the time of handling large volumes of data. This is an essential functionality for implementing the various data initiatives we have in our pipeline.” – Mayker Oviedo, Chief Data Officer, Grupo Catalana Occidente
Let’s quantify this improvement ourselves. To really see the difference, we need a large table where updates are likely to be sparse. We’ll use a copy of the bigquery-public-data.wikipedia.pageviews_2024
dataset, which contains approximately 58.7 billion rows and weighs in at ~2.4 TB.
(Important Note: Running the following queries involves copying a large dataset and processing significant amounts of data. This will incur BigQuery storage and compute costs based on your pricing model. Proceed with awareness if you choose to replicate this experiment.)
Step 1: Create the Table Copy
First, let’s copy the public dataset into our own project. We’ll also enable change history, which we’ll use later on.
- code_block
- <ListValue: [StructValue([(‘code’, “– Make a copy of the public 2024 Wikipedia page views tablernCREATE OR REPLACE TABLE `my_dataset.wikipedia_pageviews_copy`rnCOPY `bigquery-public-data.wikipedia.pageviews_2024`;rnrn– Enable change history on your new table. We’ll use this later.rnALTER TABLE `my_dataset.wikipedia_pageviews_copy`rnSET OPTIONS(rn enable_change_history = TRUErn);”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e20b1cde580>)])]>
Step 2: Run Baseline UPDATE (without optimization)
Now, let’s perform a sparse update, modifying about 0.1% of the rows scattered across the table.
- code_block
- <ListValue: [StructValue([(‘code’, “– Baseline UPDATE: Modify ~0.1% of rowsrnUPDATE `my_dataset.wikipedia_pageviews_copy`rnSET views = views + 1000rnWHERE title LIKE ‘%Goo%’rn AND datehour IS NOT NULL;”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e20b1cde070>)])]>
Result: This update modified approximately 61.2 million records. In our test environment, without the optimization enabled, it took roughly 10 minutes and 49 seconds to complete and consumed ~787.3 million slot milliseconds.
Step 3: Enable fine-grained mutations
Next, we’ll enable the optimization using a simple ALTER TABLE
statement.
- code_block
- <ListValue: [StructValue([(‘code’, ‘– Enable fine-grained mutationsrnALTER TABLE `my_dataset.wikipedia_pageviews_copy`rnSET OPTIONS(rn enable_fine_grained_mutations = TRUErn);’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e20afc6a1f0>)])]>
Step 4: Run the optimized UPDATE
Let’s run a similar update, again modifying roughly 0.1% of the data.
- code_block
- <ListValue: [StructValue([(‘code’, “– Optimized UPDATE: Modify the same number of rowsrnUPDATE `my_dataset.wikipedia_pageviews_copy`rnSET views = views – 999 — Change the value slightly for a distinct operationrnWHERE title LIKE ‘%Goo%’rn AND datehour IS NOT NULL;”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e209c76f5b0>)])]>
Result: This time, the update (again affecting ~61.2 million sparse records) completed dramatically faster. It took only 44 seconds and consumed ~51.8 million slot milliseconds.
Now let’s compare the results:
Metric
|
Baseline (No Optimization)
|
Optimized with fine-grained DML
|
Improvement Factor
|
Query execution time
|
10 min 49 sec
|
44 sec
|
~14.8x Faster
|
Slot Milliseconds
|
~787.3 million
|
~51.8 million
|
~15.2x Less
|
Wow! Enabling fine-grained mutations resulted in a massive ~14.8x reduction in query time and ~15.2x reduction in slot consumption! This illustrates how this optimization makes targeted DML operations significantly more performant and cost-effective on large BigQuery tables.
2. Tracking row-level history with the CHANGES TVF
Understanding how data evolves row by row is crucial for auditing, debugging unexpected data states, and building downstream processes that react to specific modifications. While BigQuery’s time travel feature lets you query historical snapshots of a table, it doesn’t easily provide a granular log of individual UPDATE
, DELETE
, and INSERT
operations. Another feature, the APPENDS
Table-Valued Function (TVF), only tracks additions, but not modifications or deletions.
Enter the BigQuery change history function, CHANGES
TVF, which provides access to a detailed, row-level history of appends and modifications made to a BigQuery table. It allows you to see not just what data exists now, but how it got there — including the sequence of insertions, updates, and deletions.
It’s important to note that you must enable change history tracking on the table before the changes you want to query occur. BigQuery retains this detailed change history for a table’s configured time travel duration. By default, this is 7 days. Also, the CHANGES
function can’t query the last ten minutes of a table’s history. Therefore, the end_timestamp argument value must be at least ten minutes prior to the current time.
To explore this further, let’s look at the changes we made to our Wikipedia pageviews table earlier. We’ll look for changes made to the Google Wikipedia article from January 1st, 2024.
- code_block
- <ListValue: [StructValue([(‘code’, ‘– Query the same Wikipedia pageviews table described above. Keep in mind this must run 10 min after you ran the DML update above and you must have already set enable_change_history to TRUE.rnSELECTrn *rnFROMrn CHANGES(TABLE `my_dataset.wikipedia_pageviews_copy`, NULL, TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 601 SECOND))rnWHERE rn title LIKE “Google” rn AND wiki = “en”rn AND datehour = “2024-01-01″rnORDER BY _CHANGE_TIMESTAMP ASC’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e209c76fa60>)])]>
As you can see from the query results, there are two new pseudo columns within our table, _CHANGE_TYPE
and _CHANGE_TIMESTAMP
. The _CHANGE_TYPE
column refers to the type of change that produced the row, while the _CHANGE_TIMESTAMP
column indicates the commit time of the transaction that made the change.
Thus, parsing the changes made to the table, you can see:
-
Our table initially received an INSERT with this record’s views totaling 288. This resulted from the initial copy from the Wikipedia pageviews public dataset.
-
The table then simultaneously recorded an UPDATE
and DELETE
operation from our first DML statement, which added 1,000 views to the record. This is to reflect our original event of 288 views being deleted and replaced with an event showing 1,288 views.
-
Then finally, our table again simultaneously recorded an UPDATE
and DELETE
operation for our second DML. The delete was for the record with 1,288 views, and the update was for the final event, showing 289 views.
This detailed, row-level change tracking provided by the CHANGES
TVF is incredibly powerful for building robust audit trails, debugging data anomalies by tracing their history, and even for building disaster recovery pipelines that replicate BigQuery changes to other systems in near real-time.
3. Real-time mutations: DML on freshly streamed data
BigQuery’s Storage Write API provides a high-throughput, low-latency way to stream data into your tables, making it immediately available for querying. This is fantastic for powering real-time dashboards and immediate analysis.
While the Storage Write API lets you instantly query this freshly streamed data, historically, you couldn’t immediately modify it using DML statements like UPDATE
, DELETE
, or MERGE
. The incoming data first lands in a temporary, write-optimized storage (WOS) buffer, designed for efficient data ingestion. Before DML could target these rows, they needed to be automatically flushed and organized into BigQuery’s main columnar, read-optimized storage (ROS) by a background process. This optimization step, while essential for query performance, meant there was often a delay (potentially varying from minutes up to ~30 minutes or more) before you could apply corrections or updates via DML to the newest data.
That waiting period is no longer a hard requirement! BigQuery now supports executing UPDATE
, DELETE
, and MERGE
statements that can directly target rows residing in write-optimized storage, before they are flushed to the columnar storage.
Why does this matter? This is a significant enhancement for real-time data architectures built on BigQuery. It eliminates the delay between data arrival and the ability to manipulate it within the warehouse itself. You can now react instantly to incoming events, correct errors on the fly, or enrich data as it lands, without waiting for background processes to complete or implementing complex pre-ingestion logic outside of BigQuery.
This capability unlocks powerful scenarios directly within your data warehouse like:
-
Immediate data correction: Did a sensor stream an obviously invalid reading? Or did an event arrive with incorrect formatting? Run an UPDATE
or DELETE
immediately after ingestion to fix or remove the bad record before it impacts real-time dashboards or downstream consumers.
-
Real-time enrichment: As events stream in, UPDATE
them instantly with contextual information looked up from other dimension tables within BigQuery (e.g., adding user details to a clickstream event).
-
On-the-fly filtering/flagging: Implement real-time quality checks. If incoming data fails validation, immediately DELETE
it or UPDATE
it with a ‘quarantine’ flag.
By enabling DML operations directly on data in the streaming buffer, BigQuery significantly shortens the cycle time for acting on real-time data, simplifying workflows and allowing for faster, more accurate data-driven responses.
BigQuery for dynamic data management
As we’ve explored, we’ve significantly expanded BigQuery’s capabilities beyond its traditional analytical strengths. Features like fine-grained DML, change history support for updates and deletes, and the ability to run DML directly on freshly streamed data represent a major leap forward.
While we’re not aiming to replace your specialized OLTP databases with BigQuery for high-volume, low-latency transactions, it’s undeniably becoming a far more versatile platform. These enhancements mean data practitioners can increasingly:
-
Perform targeted UPDATE
s and DELETE
s efficiently, even on massive tables
-
Track the precise history of data modifications for auditing and debugging
-
React to and modify streaming data in near real-time
All of this happens within the familiar, scalable, and powerful BigQuery environment you already use for analytics. This convergence simplifies data architectures, reduces the need for complex external pipelines, and enables faster, more direct action on your data.
Customers like Statsig, a leading product development company which enables their customers to build faster and make smarter decisions can now use BigQuery for new use cases:
“BigQuery adding new features like fine-grained DML allows us to use BigQuery for more transactional use cases here at Statsig.” – Pablo Beltran, Staff Software Engineer, Statsig
So, the next time your project requires a blend of deep analysis and more dynamic data management, remember these powerful tools in your BigQuery toolkit.
Ready to learn more? Explore the official Google Cloud documentation:
for the details.