GCP – Speeding up small queries in BigQuery with BI Engine
A quick and easy way to speed up small queries in BigQuery (such as to populate interactive applications or dashboards) is to use BI Engine. The New York Times, for example,uses the SQL interface to BI Engine to speed up their Data Reporting Engine.
The queries
To Illustrate, I’ll use three representative queries on tables between 100 MB and 3 GB — tables that are typically considered smallish by BigQuery standards. Because BigQuery is a columnar database, I’m reporting only the size of the columns that the query needs to read.
The first involves processing about 400 MB data to find the average cost of a Medicare claim:
The second processes a larger table (3GB) to find the average tip for a Chicago taxi based on payment type:
The third processes a smaller table (100 MB) to find the most polluted sites:
Measuring query performance
To measure the query performance, I’ll run any given query 5 times, taking care to turn off the cache:
The code prints out the total resources consumed (slot-milliseconds) and the total time taken by the query on the server.
If you are on a flat-rate pricing (most enterprise customers are), the slot-milliseconds reflects how long your slots are getting used. So, this is a reflection of the cost of the query to you.
The time elapsed is the time spent in computing the results. Note that the server_time
is the time taken to process the request (I don’t measure the network roundtrip time because it’s going to be the same whether or not you use BI Engine).
I first ran the three queries without BI Engine (see my notebook in GitHub).
Turning on BI Engine
I then went to the BigQuery web console and created a 10 GB BI Engine reservation (monthly cost: $300):
Why did I pick 10 GB?
One of the key ways that BI Engine speeds up queries on small tables is that it caches the tables in memory. It also does other speedups, but this is the main one. So, you want to provide it enough memory to comfortably hold the tables you will be querying. BI Engine will automatically manage the memory for you.
In my case, the Chicago taxicab query involves 3GB, so I used 10 GB. For the other two queries, 1 GB would have been enough. You can go up to 100 GB, something I’d recommend if you have many concurrent queries on small tables. You might be able to get away with less because BI Engine stores data compressed, but it’s better to provide some extra room.
You can turn on BI Engine from a script using the Reservations API. Note, however, that it takes a few minutes for the memory to become available, so this is something you should consider doing for a few hours at least, not on a per-query basis.
Turning on BI Engine
Here’s how the 3 queries compared with and without BI Engine:
As you can see, I got cost improvements of 8x to 20x and timing improvements of about 2x.
Note that some BI Engine capacity is included at no extra cost when you purchase a flat rate reservation–for example, you get 50 GB of BI Engine included for free with a 1000-slot annual commitment. At $30-$300/month, additional BI Engine capacity is a very cost-effective way to make your BigQuery reservations go further, and get a speedup too.
Enjoy!
Next steps
- Try it out.My code is on GitHub.
- For more about BI Engine and how to enroll for the preview, see this blog post.
- You can get the full speedup in your applications provided you are using the
query()
method in the BigQuery Client API [theinsertJob()
methods are also accelerated, but not as much]. So, check your code. - To learn more about BigQuery,read my book.
Note: The SQL interface to BI Engine is in preview at the time of writing (April 2021). Use this enrollment form to turn this on in your account. Thanks to Mosha Pasumansky and the BI Engine team for helpful discussions.
Read More for the details.