GCP – BigQuery under the hood: Enhanced vectorization in the advanced runtime
Under the hood, there’s a lot of technology and expertise that goes into delivering the performance you get from BigQuery, Google Cloud’s data to AI platform. Separating storage and compute provides unique resource allocation flexibility and enables petabyte-scale analysis, while features like compressed storage, compute autoscaling, and flexible pricing contribute to its efficiency. Then there’s the infrastructure — technologies like Borg, Colossus, Jupiter, and Dremel, as we discussed in a previous post.
BigQuery is continually pushing the limits of query price/performance. Google infrastructure innovations such as L4 in Colossus, userspace host networking, optimized BigQuery storage formats, and a cutting-edge data center network have allowed us to do a complete modernization of BigQuery’s core data warehousing technology. We do this while adhering to core principles of self-tuning and zero user intervention, to guarantee the best possible price/performance for all queries. Collectively, we group these improvements into BigQuery’s advanced runtime. In this blog post, we introduce you to one of these improvements, enhanced vectorization, now in preview. Then, stay tuned for future blog posts where we’ll go deep on other technologies and techniques in the advanced runtime family.
Enhanced vectorization: next-level query execution
Before diving into enhanced vectorization, let’s talk about vectorized query execution. In vectorized query execution, columnar data is processed in blocks the size of the CPU cache using Single Instruction Multiple Data (SIMD) instructions, which is now the de-facto industry standard for efficient query processing. BigQuery’s enhanced vectorization expands on vectorized query execution by applying it to key aspects of query processing, such as filter evaluation in BigQuery storage, support for parallel execution of query algorithms, and through specialized data encodings and optimization techniques. Let’s take a closer look.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e7a058daf40>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Data encodings
Modern columnar storage formats use space-efficient data encodings such as dictionary and run-length encodings. For instance, if a column has a million rows but only 10 unique values, dictionary encoding stores those 10 values once and assigns a smaller integer ID to each row rather than repeating the full value. Enhanced vectorization can directly process this encoded data, eliminating redundant computations and significantly boosting query performance. The smaller memory footprint of this encoded data also improves cache locality, creating more opportunities for vectorization.
Figure 1: Dictionary and run-length encodings
For example, as figure 1 demonstrates, “Sedan”, “Wagon” and “SUV” string values are encoded in the dictionary, replacing the repeated string literals with integers that represent indices in the dictionary built from those string values. Subsequently, the repeated integer values can be further represented with run-length encoding. Both types of encodings can offer substantial space and processing savings.
Expression folding and common subexpression elimination
Enhanced vectorization integrates native support for dictionary and run-length encoded data directly into its algorithms. This, combined with optimization techniques such as expression folding, folding propagation, and common subexpression elimination, allows it to intelligently reshape query execution plans. The result can be a significant reduction, or indeed complete removal, of unnecessary data processing.
Consider a scenario where REGEXP_CONTAINS(id, '[0-9]{2}$') AS shard
receives dictionary-encoded input. The REGEXP_CONTAINS
calculation is performed only once for each unique dictionary value, and the resulting expression is also dictionary-encoded, reducing the number of evaluations significantly and leading to performance improvements.
Figure 2: Dictionary folding
Here, the calculation is applied to the input dictionary-encoded data directly, producing output of dictionary-encoded data and skipping the dictionary expansion.
With enhanced vectorization, we take expression folding optimization even further by, in some cases, converting an expression into a constant. Consider this query:
- code_block
- <ListValue: [StructValue([(‘code’, “SELECT SUM(number) FROM tablernWHERE REGEXP_CONTAINS(id, ‘^.*[0-9]{2}’);”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7a06941640>)])]>
If the id
in the Capacitor file for this table is dictionary-encoded, the system’s expression folding will evaluate all dictionary values, and, because none of its values contain two digits, determine that the REGEXP_CONTAINS
condition is always false, and replace the WHERE
clause with a constant false
. As a result, BigQuery completely skips scanning the Capacitor file for this table, significantly boosting performance. Of course, these optimizations are applicable across a broad range of scenarios and not just to the query used in this example.
Data-encoding-enabled optimizations
Our state-of-the art join algorithm tries to preserve dictionary and run-length-encoded data wherever possible and makes runtime decisions taking data encoding into account. For example, if the probe side in the join key is dictionary-encoded, we can use that knowledge to avoid repeated hash-table lookups. Also, during aggregation, we can skip building a hashmap if data is already dictionary-encoded and its cardinality is known.
Parallelizable join and aggregation algorithms
Enhanced vectorization harnesses sophisticated parallelizable algorithms for efficient joins and aggregations. When parallel execution is enabled in a Dremel leaf node for certain query-execution modes, the join algorithm can build and probe the right-hand side hash table in parallel using multiple threads. Similarly, aggregation algorithms can perform both local and global aggregations across multiple threads simultaneously. This parallel execution of join and aggregation algorithms leads to a substantial acceleration of query execution.
Tighter integration with Capacitor
We re-engineered Capacitor for the enhanced vectorization runtime, making it smarter and more efficient. This updated version now natively supports semi-structured and JSON data, using sophisticated operators to rebuild JSON data efficiently. Capacitor enables enhanced vectorization runtime to directly access dictionary and run-length-encoded data and apply various optimizations based on data. It intelligently applies folding to a constant optimization when an entire column has the same value. And it can prune expressions in functions expecting NULL
, such as IF_NULL
and COALESCE
, when a column is confirmed to be NULL
-free.
Filter pushdown in Capacitor
Capacitor leverages the same vectorized engine as enhanced vectorization to efficiently push down filters and computations. This allows for tailored optimizations based on specific file characteristics and the expressions used. When combined with dictionary and run-length-encoded data, this approach delivers exceptionally fast and efficient data scans, enabling further optimizations like expression folding.
Enhanced vectorization in action
Let’s illustrate the power of these techniques with a concrete example. Enhanced vectorization accelerated one query by 21 times, slashing execution time from over one minute (61 seconds) down to 2.9 seconds.
The query that achieved this dramatic speedup was:
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECTrn ANY_VALUE(id) AS id,rn hash_idrnFROM (rn SELECTrn CAST(source_id AS STRING) AS id,rn TO_HEX(SHA1(CAST(source_id AS STRING))) AS hash_idrn FROM `source_data`)rnWHERErn hash_id IS NOT NULLrnGROUP BYrn hash_id’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7a06582850>)])]>
This query ran against a table with over 13 billion logical rows spread across 167 partitions, stored in Capacitor columnar storage format and optimized with dictionary and run-length-encoded data.
Without enhanced vectorization
Executing this query with a regular query engine would involve several steps:
-
Reading all data for each partition, fully expanding the dictionary and run-length-encoded columnar data.
-
Computing
CAST(source_id AS STRING)
andTO_HEX(SHA1(CAST(source_id AS STRING)))
for every single columnar data value. -
Building a hashmap from all the non-NULL
hash_id
values.
With enhanced vectorization
When enhanced vectorization processed the same query over the same dataset, it automatically applied these crucial optimizations:
-
It directly scanned the columnar data in the Capacitor file while preserving its dictionary-encoded data.
-
It detected and eliminated duplicate computations for
CAST(source_id AS STRING)
by identifying them as common subexpressions. -
It folded the
TO_HEX(SHA1(CAST(source_id AS STRING)))
computation, propagating the resulting dictionary-encoded data directly to the aggregation step. -
The aggregation step recognized the data was already dictionary-encoded, allowing it to completely skip building a hashmap for aggregation.
This example of 21-times query speedup vividly demonstrates how tight integration between enhanced vectorization runtime and Capacitor and various optimization techniques can lead to substantial query performance improvements.
What’s next
BigQuery’s enhanced vectorization significantly improves query price/performance. Internally, we’ve seen a substantial reduction in query latency with comparable or even lower slot utilization with enhanced vectorization runtime, though individual query results can differ. This performance gain comes from innovations in both enhanced vectorization and BigQuery’s storage formats.
We’re dedicated to continuously improving both, applying even more advanced optimizations alongside Google’s infrastructure advancements in storage, compute, and networking to further boost query efficiency and expand the range of queries that the advanced runtime can handle. Over the coming months, BigQuery’s advanced runtime’s enhanced vectorization will be enabled for all customers by default, but you can enable it earlier for your project today. Next up: We’ll offer BigQuery enhanced vectorization for Parquet files and Iceberg tables!
Read More for the details.