GCP – Beyond GROUP BY: Introducing advanced aggregation functions in BigQuery
When it comes to large-scale data analysis, BigQuery is a powerhouse, with fundamental aggregate functions like SUM, AVG, and COUNT allowing you to extract meaningful insights for all types of workloads. And today, we’re excited to take your data analytics to the next level with a suite of advanced aggregation features that unlock more complex and sophisticated use cases. These aggregate functions fall into three categories:
-
Group by extension (grouping sets/cube, group by struct, array, group by all)
-
User-defined aggregate functions (Javascript/SQL UDAFs)
-
Approximate aggregate functions (KLL quantiles, Apache DataSketches)
We built these functions as they were top feature requests from our customer council group. Here is what New York Times had to say:
“I just want to say thank you to BigQuery’s team who launched GROUP BY ROLLUP. We had a daily query taking more than 2 hours to run that now takes 10 minutes using this instead, and many other teams that want to use it now because of it. We saw slot consumption drop by about 96%!” – Edward Podojil, consultant, The New York Times
Let’s take a deeper look at these new aggregate functions, and how to use them in your data analytics workflows.
Group-by extensions
Grouping sets provide flexibility in calculating aggregations in multiple dimensions in one single statement without having to use UNION ALL. Group by struct, array allows you to easily group by commonly used data types in BigQuery. Group by lets you group by all non-aggregate columns in the Select statement without having to repeat each column twice.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e0675914d90>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
GROUP BY GROUPING SETS, CUBE (GA)
Users often need to slice and dice their data in multiple dimensions. Currently you have to rely on repeated UNION ALL/CROSS Join to group your data, which can make queries cumbersome and hard to understand. GROUP BY GROUPING SETS allow you to group different dimensions in a single statement. For example, the following query give you the sum of amount grouped by different combinations:
-
Date: total sales per day
-
Region: total sales per region
-
Product: total sales per product
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT Date, Region, Product, SUM(Amount) AS Total_AmountrnFROM salesrnGROUP BY GROUPING SETS ( (Date), (Region), (Product), () )rnORDER BY Date, Region, Product;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914fa0>)])]>
GROUP BY CUBE(x, y) is a shorthand syntax for GROUP BY GROUPING SETS ((x,y), x, y, ()), so you can GROUP BY all combinations of different dimensions.
GROUP BY STRUCT, ARRAY (GA)
STRUCT and ARRAY are among the most commonly used data types in BigQuery today. Working with STRUCT and ARRAY data in BigQuery just got easier! You can now use GROUP BY and SELECT DISTINCT for these semi-structured data types directly. This means no more time-consuming workarounds like converting STRUCT/ARRAY to JSON strings. This will simplify your queries and boost performance, making complex analysis more efficient (documentation).
GROUP BY ALL (GA)
GROUP BY ALL deduces non-aggregated columns from the SELECT clause, eliminating the need to list the same columns twice in SELECT and GROUP BY. It’s often used in queries with many dimensions and few aggregations. Listing all columns twice would be long and tedious. For example, you can group Chicago taxi trips by company, payment_type and taxi_id with a simple GROUP BY ALL query.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT company, payment_type, tax_id, SUM(trip_total)rnFROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`rnGROUP BY ALL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914a00>)])]>
User-defined aggregate functions (Javascript/SQL UDAF)
User-defined aggregate functions, or UDAFs, let you define custom aggregations once, and reuse them across projects and teams. You no longer need to write repeated logic again and again to unlock advanced functionality like weighted average, merging JSON data, or even simulating geospatial functions.
Javascript UDAF (GA)
JavaScript user-defined aggregate functions (JS UDAFs) let you create custom aggregation logic beyond built-in functions, so you can calculate metrics tailored precisely to your needs. For instance, you can craft UDAFs to compute weighted averages, specialized statistics, or even construct data sketches. Here’s an example of a JavaScript UDAF that simulates the mode() function, returning the most frequent value within a group.
- code_block
- <ListValue: [StructValue([(‘code’, ‘REATE OR REPLACE AGGREGATE FUNCTION udaf.mode(x INT64)rnRETURNS STRUCT<itemSk INT64, itemCount INT64>rnLANGUAGE js ASrn”””rn export function initialState() {rn return {frequencyMap: new Map()};rn }rn export function aggregate(state, value) {rn var frequencyMap = state.frequencyMap;rn if (frequencyMap.has(value)) {rn frequencyMap.set(value, frequencyMap.get(value) + 1);rn } else {rn frequencyMap.set(value, 1);rn }rn }rn export function merge(state, partial_state) {rn var frequencyMap = state.frequencyMap;rn for (let [key, count] of partial_state.frequencyMap) {rn if (frequencyMap.has(key)) {rn frequencyMap.set(key, frequencyMap.get(key) + count);rn } else {rn frequencyMap.set(key, count);rn }rn }rn }rn export function finalize(state) {rn var maxCount = 0;rn var maxKey = 0;rn for (let [key, count] of state.frequencyMap) {rn if (count > maxCount) {rn maxKey = key;rn maxCount = count;rn }rn }rn return {itemSk: maxKey, itemCount: maxCount};rn }rn”””;’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914700>)])]>
SQL UDAF (GA)
SQL user-defined aggregate functions (SQL UDAFs) allow users to encapsulate complex aggregate expressions into a UDF, for composability and reusability without having to write the same code again and again. For example, you can wrap multiple aggregate function calls into a UDAF through a struct constructor.
- code_block
- <ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE FUNCTION percentiles_struct(column_name)rnRETURNS STRUCTrnASrn(rn COUNT(column) AS n_samples,rn APPROX_QUANTILES (column_name, 100)[offset(50)] AS percentile_50,rn APPROX_QUANTILES(column_name, 100)[offset(95)] AS percentile_95rn)’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914820>)])]>
Approximate aggregate functions
Companies across ad-tech, retail, fin-tech usually have massive amounts of multi-dimensional data. However, extracting data insights from millions or billions of rows of user-behavior data can be very expensive and time-consuming. Many companies are willing to calculate approximate aggregated results within defined error bounds, to get faster responses. Sketches enable approximate estimates of distinct counts, quantiles, histograms, and other statistical measures — all with minimal memory and computational overhead, and with a single pass through the data at scale.
KLL quantile functions (Preview)
BigQuery supports quantile calculations using native KLL quantile functions. For example, you could estimate the median trip duration and trip distance for all taxi rides on a given day or month.
First, create daily KLL quantile sketches over trip_seconds and trip_miles from the `chicago_taxi_trips` data table.
- code_block
- <ListValue: [StructValue([(‘code’, ‘CREATE TABLE sketch_table ASrnSELECT rn DATE(trip_start_timestamp, “UTC”) AS day,rn KLL_QUANTILES.INIT_INT64(trip_seconds) AS trip_seconds_sktech,rn KLL_QUANTILES.INIT_FLOAT64(trip_miles) AS trip_miles_sktech,rnFROM bigquery-public-data.chicago_taxi_trips.taxi_tripsrnGROUP BY day;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914af0>)])]>
Then, with EXTRACT_POINT_INT64/FLOAT64 functions, you can get the median trip_seconds and trip_miles for all trips on a specific day.
- code_block
- <ListValue: [StructValue([(‘code’, “SELECT rn KLL_QUANTILES.EXTRACT_POINT_INT64(trip_seconds_sktech, 0.5) AS trip_seconds_median,rn KLL_QUANTILES.EXTRACT_POINT_FLOAT64(trip_miles_sktech, 0.5) AS trip_miles_medianrnFROM sketch_tablernWHERE day = ‘2023-12-10’;”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914d00>)])]>
You can also leverage the MERGE_POINT_INT64/FLOAT64 functions to estimate the median trip_seconds and trip_miles for all trips in a month. The query merges daily sketches at first, and then calculates quantile values over a month.
- code_block
- <ListValue: [StructValue([(‘code’, “SELECT rn KLL_QUANTILES.MERGE_POINT_INT64(trip_seconds_sktech, 0.5) AS trip_seconds_median,rn KLL_QUANTILES.MERGE_POINT_FLOAT64(trip_miles_sktech, 0.5) AS trip_miles_medianrnFROM sketch_tablernWHERE day >= ‘2023-12-01’ AND day <= ‘2023-12-31’;”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e06759148b0>)])]>
Apache DataSketches (GA)
Besides native support for sketches, BigQuery also supports open-source Apache DataSketches, a high-performance library of stochastic streaming algorithms. Initially developed by Yahoo, you can directly use them though the public UDF bigquery-utils repo powered by JS UDAFs. Here are some examples, just name a few (see more details in this blog post):
-
Theta Sketch: designed for cardinality estimation and set operations (union, intersection, difference)
-
KLL Sketch: designed for quantile estimation
-
Tuple Sketch: an extension of the Theta Sketch that supports associating values with the estimated unique items
The following examples use Theta Sketch over a public dataset, Chicago taxi trips, to demonstrate the power of the UDAF sketches library.
First, create a daily Theta Sketch to estimate the unique number of taxis running per day, using theta_sketch_agg_string.
- code_block
- <ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE TEMP TABLE sketch_table ASrnSELECT rn DATE(trip_start_timestamp, “UTC”) AS day,rn bqutil.datasketches.theta_sketch_agg_string(taxi_id) AS taxi_id_sktech,rnFROM bigquery-public-data.chicago_taxi_trips.taxi_tripsrnGROUP BY day;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914940>)])]>
You can estimate the unique number of taxis that run on both days, using theta_sketch_intersection.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT bqutil.datasketches.theta_sketch_get_estimate(rn bqutil.datasketches.theta_sketch_intersection(rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-10”),rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-11”)rn )rn);’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914a30>)])]>
You can estimate the unique number of taxis that run on either of these two days, using theta_sketch_union.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT bqutil.datasketches.theta_sketch_get_estimate(rn bqutil.datasketches.theta_sketch_union(rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-10”),rn (SELECT taxi_id_sketch FROM sketch_table WHERE day = “2023-12-11”)rn )rn);’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0675914dc0>)])]>
Do more with advanced aggregation
Using BigQuery’s advanced aggregation capabilities allows you to perform more complex analysis efficiently. Grouping sets provide flexibility in calculating aggregations at multiple dimensions. UDAFs empower you to define custom aggregations. Approximate aggregate functions provide speed, scalability and performance when approximate results are acceptable. By leveraging these features, you can unlock deeper insights from your data and help your team make timely business decisions. We can’t wait to hear your use cases and how you plan to use them in your day to day analysis!
Read More for the details.