GCP – Achieve higher performance and lower query cost for BigQuery integer or timestamp lookups
The first incarnation of search indexes in BigQuery focused on fast and efficient lookups on STRING data elements, either in standalone STRING scalar columns, or within an ARRAY, STRUCT, or JSON column. Our previous blog posts showcased the orders-of-magnitude performance gains achievable when utilizing indexes with the SEARCH function and other functions and operators.
Today, we are announcing the public preview of numeric search indexes, which enables optimized lookups on INT64 and TIMESTAMP data types. With this change, the EQUAL(=) and IN operations on these data types can utilize search indexes to reduce byte scans for improved performance. So now your lookups for account IDs or transactions IDs or log timestamps can get faster and cheaper.
In this blog, we demonstrate the gains on real data, showcasing index creation and queries on a 100TB log table called log_table that contains Google Cloud Logging data for an internal Google test project.
The base table details are as follows:
The table has the following columns of interest:
jsonPayload: type JSON
This jsonPayload has a leaf field named threadId of type JSON number.
sourceLocation: type RECORD (or STRUCT) with two sub-fields of interest:
file: type STRING, containing the name of the file producing the log entry
line: type INT64, containing the line number in the file where the log entry was produced.
Using search indexes
By default, a search index is created for the STRING data only. If you want to index INT64 or TIMESTAMP, you need to provide them in the index option called data_types. In the following example, all data of type STRING and INT64 in the log_table table will be indexed.
<ListValue: [StructValue([(‘code’, “CREATE SEARCH INDEX index_with_string_int64rnON log_table (ALL COLUMNS)rnOPTIONS (data_types = [‘STRING’, ‘INT64’])”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e579ab8adf0>)])]>
JSON field search
In this first example, we want to search for log entries that have the thread ID 12104 in the JSON payload.
<ListValue: [StructValue([(‘code’, ‘SELECT timestamp, jsonPayloadrnFROM log_tablernWHERE INT64(jsonPayload.threadId) = 12104;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e579a180700>)])]>
We compare between having a search index and having no index. Given that log entries with this thread ID are very rare, the results show dramatic improvements on all three metrics:
Metrics
Without Index
With Index
Improvement
Execution Time (ms)
48,790
4,664
10x
Processed Bytes
2,174,758,158,336
774,897,664
2,806x
Slot Usage (ms)
25,735,222
7,300
3,525x
STRUCT nested field search
In the second example, we count how many log entries are produced from a certain line of code (line 813 in the file borg/borgletlib/borgletlib.cc).
<ListValue: [StructValue([(‘code’, “SELECT count(*)rnFROM log_tablernWHERE sourceLocation.file = ‘borg/borgletlib/borgletlib.cc’rn AND sourceLocation.line = 813″), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e579a180b20>)])]>
Note that sourceLocation.file is a STRING field. A search index on only STRING data type can already help improve the query performance as shown below. However, with indexing on also the INT64 data type, the performance can be further improved.
Metrics
No Index
With IndexSTRING-only
With IndexSTRING & INT64
Execution Time (ms)
57,169
11,571
4.9x
7,982
7.1x
Processed Bytes
1,703,843,725,312
976,230,547,456
1.7x
682,560,061,440
2.4x
Slot Usage (ms)
38,947,660
25,595,348
1.5x
8,256,218
4.7x
Do I need indexing if I use partitioning/clustering?
While partitioning and clustering can optimize filtering and lookups, they have certain limitations. For instance, partitioning can only be done on a single column, and clustering allows up to four columns per table. However, clustering is most effective when filtering on the first clustering column, as subsequent columns often provide minimal pruning power. Furthermore, both partitioning and clustering are limited to top-level columns.
Search indexes on INT64/TIMESTAMP complement these BigQuery features by enabling lookup optimizations on any number of columns. In addition, as demonstrated above, they cover struct nested fields, array elements, and JSON leaf fields.
This feature is currently in preview. For more information, refer to Optimize with numeric predicates.
Read More for the details.