GCP – Introducing ScaNN in BigQuery vector search for large query batches
We continue to add more capabilities to BigQuery to make it the AI-ready data platform for the Gemini era. Earlier this year, we introduced vector search, which enables vector similarity search on BigQuery data. Since then, we also added several functionalities such as stored columns and pre-filtering. Already, the scale, performance, and ease-of-use offered through BigQuery’s vector search and AI capabilities is empowering customers to build pipelines and applications, ranging from semantic search to LLM-based retrieval-augmented generation (RAG).
Today, we are announcing the preview of the TreeAH vector index, which brings core pieces from Google’s research and innovation in approximate nearest neighbor algorithms to BigQuery. This new index type uses the same underlying technology that powers some of Google’s most popular services and delivers significant latency and cost reductions in certain situations compared to the first index we implemented in BigQuery, the inverted file index (IVF). How significant? Read on to learn about architectural differences between the two, performance results, and when and how to use TreeAH rather than IVF.
IVF vs. TreeAH indexes: a comparison
Using a vector index allows BigQuery to optimize the lookups and distance computations required to identify closely matching embeddings. Both IVF and TreeAH indexes allow BigQuery to perform approximate nearest neighbor (ANN) search instead of exact nearest neighbor search, which trades some accuracy for lower query latency and cost.
BigQuery’s first vector index, IVF, uses a scalable k-means clustering algorithm to partition the vector data into clusters. When you use the VECTOR_SEARCH function to search the vector data, it finds the clusters that are closest to the query vector and only ranks the vector data from those clusters; this reduces the number of distance calculations by a large factor.
The new TreeAH index is based on Google’s ScaNN algorithm, which is used in a multitude of Google services for similarity search. The main difference with the IVF index is the use of asymmetric hashing (the “AH” in the TreeAH), which uses product quantization to compress embeddings. Coupled with a CPU-optimized distance computation algorithm, vector search using TreeAH can be orders of magnitude faster and more cost-efficient than IVF. Index generation can also be 10x faster and cheaper and have a smaller memory footprint, as only the compressed embeddings are stored.
TreeAH performance
Our engineering team conducted benchmarks across various table configurations and query batch sizes to compare TreeAH with IVF. Here are the results:
Fig 1. Latency and cost for vector search queries
Fig 2. Vector index training latency and cost
Key results:
For small query batches, the IVF index performs as well as TreeAH, and sometimes better, due to the increased overhead introduced by the TreeAH index.
For large query batches, TreeAH significantly outperforms IVF due to its optimized distance-calculation algorithm.
TreeAH index training is also significantly faster and cheaper compared to IVF in most cases.
(*): The query using the IVF index benefited from block pruning optimization.
When to use TreeAH and current limitations
As shown in the benchmarks, the TreeAH index already greatly outperforms the IVF index when the query batch size is large. Index training is also faster and cheaper.
We’re still actively developing the TreeAH index and new features and performance improvements will be added over time. Currently, the following limitations apply:
The base table can have a maximum of 200 million rows.
Stored columns and pre-filtering are not supported for the TreeAH index.
Getting started with TreeAH
The TreeAH index type is in public preview and can be used today:
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE VECTOR INDEX `<index_name>`rnON `<my_table>`(<embedding_column>)rnOPTIONS(distance_type=’COSINE’, index_type=’TREE_AH’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e79fe196fd0>)])]>
For more details, please refer to the documentation.
We encourage you to try the TreeAH index and experience the performance and cost benefits it can bring to your vector search workloads. Please reach out to us (bq-vector-search@google.com) if you have any questions or need assistance in choosing the best index type for your workload.
Read More for the details.