GCP – Better together: BigQuery and Spanner expand operational insights with external datasets
Data analysts have traditionally struggled to analyze data across different databases. Because of data silos, they need to copy data from transactional databases into analytical data stores using ETL processes. BigQuery made the problem a little easier to solve if you have data in Spanner and BigQuery. Using the EXTERNAL_QUERY table-valued function (TVF), you could issue federated queries to encapsulate your Spanner query and combine the results set with BigQuery using a TVF. This approach, while functional, introduced complexity such as requiring the analyst to craft complex SQL when joining data across both sources, and had limitations, such as limited query monitoring and query tuning insights.
Today, we’re excited to announce a major leap forward through the public preview of BigQuery external datasets for Spanner. This productivity-boosting feature lets you connect Spanner schema to BigQuery datasets, thus allowing data analysts to browse, explore and query Spanner tables as if they were native BigQuery tables. You can build analytics pipelines and dashboards with familiar GoogleSQL, combining BigQuery and Spanner tables without extra data movement or complex ETL processes.
Operational insights with Spanner external datasets
Spanner external datasets make it easy to gather operational insights that you could not without having to move data before.
Operational dashboards: A service provider maintains real-time transaction data in Spanner with historical analytics in BigQuery. This lets them create comprehensive real-time dashboards that help frontline staff perform day-to-day service tasks while getting direct insights into the critical business metrics measuring the organization’s efficiency.
Customer 360: A retail organization provides a holistic view of premier customers to call center staff by joining in-store transaction data with rich analytical insights on customer loyalty, as found in purchase history in their data lake.
Threat intelligence: Security Operations (SecOps) staff in information security organizations need to analyze real-time streaming data coming into their operations data store, while leveraging AI models trained on long-term data maintained in their analytical data store. SecOps personnel need to be able to use familiar SQL to query real-time and historical data through a single interface to compare incoming threats with previously established threat patterns.
Attain, a leading commerce data SaaS provider, is an early user of BigQuery external datasets, and reports that it has boosted productivity for its data analysts.
“We are huge fans of BigQuery and Spanner at Attain and are thrilled with the launch of BigQuery external datasets support for Spanner. This is a huge productivity and usability boost for our data analysts, who can now combine real-time transaction data in Spanner with analytical information in BigQuery with zero ETL required. The ability to access Spanner tables directly from BigQuery empowers our marketing team to make faster, data-driven decisions. This seamless integration between Spanner and BigQuery significantly enhances our marketers abilities to understand consumer behavior and drive faster and better business outcomes for their customers.” – Dmitri Rabinowitz, Director of Engineering, Attain, a leading commerce data SaaS provider
Benefits of Spanner external datasets
Together, Spanner and BigQuery provide data analysts looking for operational insights on their transactions and analytical data with the following benefits:
Simplified query writing: Work directly with data in Spanner as if it were already in BigQuery, eliminating the need for cumbersome federated queries.
Unified transaction analytics: Build integrated dashboards and reports by joining data between Spanner and BigQuery.
Real-time insights: BigQuery continually queries Spanner for the latest data, providing consistent, up-to-date insights without complex synchronization processes and without impacting production Spanner workloads.
Low-latency performance: BigQuery leverages parallelism and Spanner Data Boost capabilities to accelerate queries against Spanner, delivering results faster.
How it works
Imagine you have a Spanner database containing fresh ecommerce transactions and you want to use it in your BigQuery queries.
In BigQuery, you store all historical transactions and your analytical dashboards are built on top of them. However, sometimes you also need to see the joined view of historical and very fresh transactions. That’s when you can create an external dataset in BigQuery that reflects your Spanner database.
Let’s say that in Spanner you have a project named “myproject” with an instance “myinstance” and a database “ecommerce”, where you store current transactions from your ecommerce site. You can create an external dataset in BigQuery just like any regular dataset, only with an additional option “Link to an external database”:
Alternatively, you can use the following command line command:
<ListValue: [StructValue([(‘code’, ‘bq mk –external_source google-cloudspanner:/projects/myproject/instances/myinstance/databases/ecommerce current_orders’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7814145fa0>)])]>
Or an SQL command:
<ListValue: [StructValue([(‘code’, ‘CREATE EXTERNAL SCHEMA current_ordersrnOPTIONS (rn external_source = “google-cloudspanner:/projects/myproject/instances/myinstance/databases/ecommerce”rn location = “us-central1″rn);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e78141454f0>)])]>
Browse a Spanner external dataset
You can also see a selected Spanner database as an external dataset in the BigQuery Studio in the Google Cloud console. Select this dataset to expand it and you will see all your Spanner tables:
Sample queries
Now you can execute any query you want against tables from your external dataset (from your Spanner database in fact).
For example, let’s find today’s transactions with customer segments that are calculated and stored in BigQuery:
<ListValue: [StructValue([(‘code’, ‘SELECT o.id, o.customer_id, o.total_value, s.segment_namernFROM rn current_transactions.ecommerce_order orn left join crm_dataset.customer_segments s on o.customer_id=s.customer_idrnWHERErn o.order_date = ‘2024-09-01’’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7814145dc0>)])]>
Notice that the crm_dataset is a regular BigQuery dataset and current_transactions is an external dataset that points to a Spanner database.
Another example is a unified view for all of a customer’s transactions, current and historical:
<ListValue: [StructValue([(‘code’, ‘SELECT id, customer_id, total_valuernFROM rn current_transactions.ecommerce_order orn union transactions_history th’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7814145ee0>)])]>
And again, current_transactions is an external dataset and transactions_history lives in BigQuery.
Notice, that data is fetched live from Spanner and you don’t have to copy them manually with any ETL processes!
Once the query is over, you can examine the query plan. If you go to the EXECUTION GRAPH tab you should be able to see how the ecommerce_order table was used in a query and how many records were read from a Spanner database.
As you can see, tables under this Spanner external dataset are treated similar to any other table. You can query them in any way you want. This also includes aggregations or joins with BigQuery tables or even with the results of EXTERNAL_QUERY TVF.
Try it out by logging into BigQuery Studio and adding a new external dataset from Spanner!
Read More for the details.