GCP – Understand why your metrics moved with contribution analysis in BigQuery ML, now GA
The key to effective data-driven decision making is quickly processing and extracting insights from large amounts of data. However, doing this efficiently and at scale is a challenge.
Imagine a retail scenario where you’re trying to identify the highest performing promotions by analyzing sales data across products, stores, locations, and a large customer base, in conjunction with other marketing events and discount sales. Pinpointing these highest performing promotions is essential to making targeted decisions and achieving better results. With current tooling, you’d need to manually inspect the data and use a trial and error approach to visualization and querying to uncover interesting insights. This task becomes difficult due to the volume of data and the number of dimensions involved, as this could create a large amount of possible combinations to explore.
BigQuery ML contribution analysis, now generally available, performs this type of analysis at scale, automating insight generation and outputting key change drivers from multidimensional data so you can quickly pinpoint the areas where you need to take action.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e86267689a0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Originally announced in September 2024, the preview of contribution analysis generates key drivers for a metric of interest across test and control datasets. With the GA, contribution analysis now include several new features to help you identify the most important insights faster:
1. Automated support tuning with top-k insights by apriori support
In preview, contribution analysis provided an option for apriori support pruning, which would remove small data segments to decrease query execution time. For GA, as an alternative to specifying your own apriori support pruning threshold, you can now let the model set the threshold for you. Simply provide the number of insights you’d like returned and the model automatically retrieves the top insights by data segment size, which we refer to as apriori support. Both the min_apriori_support
and top_k_insights_by_apriori_support
option reduce query latency compared to returning all possible insights, which can consist of millions of rows.
2. Improved insight readability with redundant insight pruning
Insight redundancy occurs when multiple related insights share the same output metrics, which is especially prevalent in correlated data. For example, consider the table below with sales data that contains city and store name dimensions. Because all sales from the General Store also occurred in Iowa City, the underlying data for rows 1 and 2 are the same, which leads to the equal metric values for both rows.
contributors |
metric_test |
metric_control |
[city=’Iowa City’, store_name=’General Store / Iowa City’] |
640047.08 |
214317.46 |
[store_name=’General Store / Iowa City’] |
640047.08 |
214317.46 |
With the new pruning_method
option, you can choose to prune redundant insights, so you only see unique insights in the output table with the most descriptive segment. For the above table, you’d only see the first row that contains both the city and store name, rather than two rows with the same metrics.
3. Expanded metric support with the summable by category metric
For preview, contribution analysis initially offered two contribution metric types: a summable metric and a summable ratio metric. Specifying a summable metric allows you to aggregate a single measure of interest, while the summable ratio metric lets you aggregate a ratio across two measures.
For GA, in addition to the summable
and summable ratio
metrics, you can now analyze the sum of a metric of interest normalized by the unique values of a categorical variable with the new summable by category
metric. This allows you to analyze metrics like sales per customer or site visits per day.
The summable by category metric can help adjust for outliers in your data. Take the example in the table below, where within the test set we have an outlier in row 1 with a high amount of active minutes on the website product. If you were using the summable
metric on active minutes, you’d see that active minutes on the website in the test set were 2.5 times higher than active minutes on the app in the test set. But, if you adjust for the number of users on each product with the summable by category
metric, you’d see that active minutes per user on the website in the test set was only 1.6 times higher than active minutes per user on the app in the test set.
Active Minutes |
User Id |
Product |
Is Test |
100 |
user_a |
Website |
TRUE |
23 |
user_b |
Website |
TRUE |
15 |
user_c |
Website |
TRUE |
31 |
user_b |
App |
TRUE |
24 |
user_c |
App |
TRUE |
39 |
user_b |
Website |
FALSE |
… |
… |
… |
… |
Summable by category
metrics are also helpful when you have different numbers of test and control rows. Say you want to compare revenue data from 2024 to 2025, but you only have five months of data from 2025 compared to twelve months of data in 2024. With the summable by category
metric, you can analyze the differences in revenue per month to get a more direct comparison between the two groups.
Contribution analysis in action
To get started, let’s walk through a retail sales example with the new model options. Say you want to find the key contributors to why sales dropped between 2020 and 2021 for apparel products on the Google Merchandise Store public e-commerce dataset.
To start, create an input data table that contains the sales metric to aggregate over, the categorical user_id column to normalize by, the dimension columns to slice by, and the boolean is_test value to determine whether the data row is in the test or control subset.
CREATE OR REPLACE TABLE bqml_tutorial.ecommerce_data AS
(SELECT
event_value_in_usd AS sales,
user_pseudo_id AS user_id,
device.category AS device_category,
geo.country AS country,
traffic_source.medium as site_traffic_source,
FALSE AS is_test
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name IN ('in_app_purchase', 'purchase') and
traffic_source.medium IN ('referral', 'organic', 'cpc', '<Other>')
AND _TABLE_SUFFIX BETWEEN '20200101' AND '20201231')
UNION ALL
(SELECT
event_value_in_usd AS sales,
user_pseudo_id AS user_id,
device.category AS device_category,
geo.country AS country,
traffic_source.medium as site_traffic_source,
TRUE AS is_test
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name IN ('in_app_purchase', 'purchase') AND
traffic_source.medium IN ('referral', 'organic', 'cpc', '<Other>')
AND _TABLE_SUFFIX BETWEEN '20210101' AND '20211231');
Then, create a summable by category metric contribution analysis model. Using the top_k_insights_by_apriori_support
option, you get the top 15 insights with the largest apriori support, and using the prune_redundant_insights
pruning method, you ensure unique insights are returned.
CREATE OR REPLACE MODEL bqml_tutorial.ecommerce_summable_category_model
OPTIONS(
MODEL_TYPE='CONTRIBUTION_ANALYSIS',
DIMENSION_ID_COLS=['device_category', 'country', 'site_traffic_source'],
IS_TEST_COL='is_test',
CONTRIBUTION_METRIC='SUM(sales)/COUNT(DISTINCT user_id)',
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = 15,
PRUNING_METHOD='PRUNE_REDUNDANT_INSIGHTS'
) AS SELECT * FROM bqml_tutorial.ecommerce_data;
And finally, you get the insights from the model.
SELECT
contributors,
metric_test,
metric_control,
difference,
relative_difference,
apriori_support,
contribution
FROM ML.GET_INSIGHTS(MODEL
bqml_tutorial.ecommerce_summable_category_model);
The output is automatically ordered by the contribution
column, which is the absolute value of difference
, in descending order. Drilling down to specific results, in row 1, you can see that the sales per user in the United States that were referred from another website decreased from $101.65 in 2020 to $58.37 in 2021, a decrease of -$43.27 as seen by the difference
metric, or a -42.5% decrease as seen by the relative_difference
metric. In row 3, you can see that there was a -34.6% decrease in sales per user in the United States overall, a segment that makes up 47.3% of all data as shown in the apriori_support
metric. This is invaluable information to help inform your business strategy!
Get started today
To give contribution analysis a try on your own data, please see the tutorial and documentation.
Read More for the details.