GCP – Do you detect a pattern? BigQuery’s new MATCH_RECOGNIZE function can!
Identifying patterns and sequences within your data is crucial for gaining deeper insights. Whether you’re tracking user behavior, analyzing financial transactions, or monitoring sensor data, the ability to recognize specific sequences of events can unlock a wealth of information and actionable insights.
Imagine you’re a marketer at an e-commerce company trying to identify your most valuable customers by their purchasing trajectory. You know that customers who start with small orders and progress to mid-range purchases will usually end up becoming high-value purchasers and your most loyal segment. Having to figure out the complex SQL to aggregate and join this data could be quite the challenging task.
That’s why we’re excited to introduce MATCH_RECOGNIZE, a new feature in BigQuery that allows you to perform complex pattern matching on your data directly within your SQL queries!
What is MATCH_RECOGNIZE?
At its core, MATCH_RECOGNIZE is a tool built directly into GoogleSQL for identifying sequences of rows that match a specified pattern. It’s similar to using regular expressions, but instead of matching patterns in a string of text, you’re matching patterns in a sequence of rows within your tables. This capability is especially powerful for analyzing time-series data or any dataset where the order of rows is important.
With MATCH_RECOGNIZE, you can express complex patterns and define custom logic to analyze them, all within a single SQL clause. This reduces the need for cumbersome self-joins or complex procedural logic. It also lessens your reliance on Python to process data and will look familiar to users who have experience with Teradata’s nPath or other external MATCH_RECOGNIZE workloads (like Snowflake, Azure, Flink, etc.).
How it works
The MATCH_RECOGNIZE clause is highly structured and consists of several key components that work together to define your pattern-matching logic:
-
PARTITION BY: This clause divides your data into independent partitions, allowing you to perform pattern matching within each partition separately.
-
ORDER BY: Within each partition, ORDER BY sorts the rows to establish the sequence in which the pattern will be evaluated.
-
MEASURES: Here, you can define the columns that will be included in the output, often using aggregate functions to summarize the matched data.
-
PATTERN: This is the heart of the MATCH_RECOGNIZE clause, where you define the sequence of symbols that constitutes a match. You can use quantifiers like *, +, ?, and more to specify the number of occurrences for each symbol.
-
DEFINE: In this clause, you define the conditions that a row must meet to be classified as a particular symbol in your pattern.
Let’s look at a simple example. From our fictional scenario above, imagine you have a table of sales data, and as a marketing analyst, you want to identify customer purchase patterns where their spending starts low, increases to a mid-range, and then reaches a high level. With MATCH_RECOGNIZE, you could write a query like this:
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT *rnFROMrn Example_Project.Example_Dataset.SalesrnMATCH_RECOGNIZE (rn PARTITION BY customerrn ORDER BY sale_datern MEASURESrn MATCH_NUMBER() AS match_number,rn ARRAY_AGG(STRUCT(MATCH_ROW_NUMBER() AS row, CLASSIFIER() AS symbol, rn product_category)) AS salesrn PATTERN (low+ mid+ high+)rn DEFINErn low AS amount < 50,rn mid AS amount BETWEEN 50 AND 100,rn high AS amount > 100rn);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f692c19cb50>)])]>
In this example, we’re partitioning the data by customer and ordering it by sale_date. The PATTERN clause specifies that we’re looking for one or more “low” sales events, followed by one or more “mid” sales events, followed by one or more “high” sales events. The DEFINE clause then specifies the conditions for a sale to be considered “low”, “mid”, or “high”. The MEASURES clause decides how to summarize each match; here with match_number we are indexing each match starting from 1 and creating a ‘sales’ array that will track every match in order.
Below are example matched customers:
|
customer |
match_number |
sales.row |
sales.symbol |
sales.product_category |
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
This data highlights some sales trends and could offer insights for a market analyst to strategize conversion of lower-spending customers to higher-value sales based on these trends.
Use cases for MATCH_RECOGNIZE
The possibilities with MATCH_RECOGNIZE are vast. Here are just a few examples of how you can use this powerful feature:
-
Funnel analysis: Track user journeys on your website or app to identify common paths and drop-off points. For example, you could define a pattern for a successful conversion funnel (e.g., view_product -> add_to_cart -> purchase) and analyze how many users complete it.
-
Fraud detection: Identify suspicious patterns of transactions that might indicate fraudulent activity. For example, you could look for a pattern of multiple small transactions followed by a large one from a new account.
-
Financial analysis: Analyze stock market data to identify trends and patterns, such as a “W” or “V” shaped recovery.
-
Log analysis: Sift through application logs to find specific sequences of events that might indicate an error or a security threat.
-
Churn analysis: Identify patterns in your data that lead to customer churn and find actionable insights to reduce churn and improve customer sentiment.
-
Network monitoring: Identify a series of failed login attempts to track issues or potential threats.
-
Supply chain monitoring: Flag delays in a sequence of shipment events.
-
Sports analytics: Identify streaks or changes in output for different players / teams over games, such as winning or losing streaks, changes in starting lineups, etc.
Get started today
Ready to start using MATCH_RECOGNIZE in your own queries? The feature is now available to all BigQuery users! To learn more and dive deeper into the syntax and advanced capabilities, check out the official documentation and tutorial available on Colab, BigQuery, and GitHub.
MATCH_RECOGNIZE opens up a whole new world of possibilities for sequential analysis in BigQuery, and we can’t wait to see how you’ll use it to unlock deeper insights from your data.
Read More for the details.
