GCP – BigQuery vector search now GA, setting the stage for a new class of AI-powered analytics
Artificial intelligence (AI) has given organizations brand new ways to represent data so that it can be analyzed and understood at scale. AI models are trained to interpret the semantic meaning of words and other data by encoding them as vector embeddings. These embeddings represent the relative position in the vector space of the data that is encoded. Semantically similar embeddings are proximate in vector space, whereas semantically dissimilar embeddings are… not proximate. And together, codifying the semantic understanding of data, applying that understanding across all types of data, and giving organizations simple tools to analyze it all, unblocks brand new approaches for data analytics.
That is why we are excited today to announce the general availability (GA) of BigQuery vector search, enabling vector similarity search on BigQuery data. This functionality, also commonly referred to as approximate nearest-neighbor search, is the key to empowering numerous new data and AI use cases such as semantic search, similarity detection, and retrieval-augmented generation (RAG) with a large language model (LLM).
Originally announced in February, BigQuery vector search integrates generation, management, and search of embeddings within the data platform, to provide a serverless and integrated vector-analytics solution for use cases such as anomaly detection, multi-modal search, product recommendations, drug discovery, and more.
In addition, the inverted file index (IVF) index for BigQuery vector search is now also generally available. This index uses a k-means algorithm to cluster the vector data and combines it with an inverted row locator in a two-piece index in order to efficiently search similar embedding representations of your data. IVF contains several new enhancements since announcing in preview:
Improved scalability: You can now index 10 billion embeddings, enabling applications with massive scale.
Managed index with guaranteed correctness: When the underlying data changes, vector indexes are automatically updated using the existing k-means model. Vector search always returns correct results based on the latest mutations of the data, even before the system has finished re-indexing the modified data.
Stored columns: You can now store frequently used columns in the index to avoid expensive joins when retrieving additional data in the search result. This optimization yields the most noticeable performance improvements in scenarios with high result-set cardinality, for example, when your query data contains a large batch of embeddings, or when you need a high top_k. For example, for a table with 1 billion 96-dimensional embeddings, returning the 1,000 most similar candidates for an embedding is ~4x faster with ~200x less slots using vector indexes with stored columns than without stored columns.
Pre-filters: Combined with stored columns, vector search results can be pre-filtered by rewriting the base table statement into a query with filters. Compared with post-filtering, where the WHERE clauses are added after the VECTOR_SEARCH() function, pre-filtering improves query performance, enhances search quality, and minimizes the risk of missing results.
We’ve seen customers like Palo Alto Networks use BigQuery vector search to find similar common queries to accelerate time to insight.
“We’re leveraging BigQuery’s vector search in our copilot to suggest relevant query examples, significantly enhancing the customer experience. We really like how easy it was to set up and enable vector search in BigQuery and are impressed with the performance and speed of the similarity searches.” – Sameer Merchant, Senior Distinguished Engineer, Palo Alto Networks
Additionally, prototyping with BigQuery vector search and pushing to production is simple, even for the massive-scale workloads like drug discovery that Vilya has been working on. Further, on-demand pricing and budget assessment tools have made scaling to capacity-based billing models seamless.
“We were really impressed with BigQuery’s ability to scale to the size of biological data needed to search for novel macrocycles to precisely target disease biology. BigQuery’s vector search ease-of-use and scalability enables us to rapidly search through billions of such macrocycles.” – Patrick Salveson, Co-Founder & CTO, Vilya
Building with an example
New to BigQuery vector search? Here’s a real-world example to get you started.
Imagine you want to post a question to an internal Q&A forum, but before you do, you would like to find out if there are any existing questions that are semantically similar to yours. To demonstrate, we assume that we have generated embeddings for the questions and stored them in the <my_posts_questions> table. Once that’s done, you can create a vector index, and store the frequently used columns, e.g., title, content, tags, in the index for better performance.
<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE VECTOR INDEX `<index_name>`rnON `<my_posts_questions>`(embedding)rnSTORING (title, content, tags)rnOPTIONS(distance_type=’COSINE’, index_type=’IVF’)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0b99e9b9d0>)])]>
Although VECTOR_SEARCH() works even if you don’t have a vector index, creating an index often results in better query performance. Once ready, you can use VECTOR_SEARCH() combined with ML.GENERATE_EMBEDDING to search similar questions to “Android app using RSS crashing” for example. To better refine the results, you can use a pre-filter on the “tags” column to restrict the search space.
<ListValue: [StructValue([(‘code’, “SELECT query.query, base.title, base.content, distancernFROM VECTOR_SEARCH(rn ( SELECT * FROM <my_posts_questions> WHERE SEARCH(tags, ‘android’) ), rn ’embedding’,rn (rn SELECT ml_generate_embedding_result AS embedding, content AS queryrn FROM ML.GENERATE_EMBEDDING(rn MODEL <LLM_embedding_model>,rn (SELECT ‘Android app using RSS crashing’ AS content)rn )rn ))”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0b99e9b3a0>)])]>
We also recently announced a new index type based on Google-developed ScaNN in preview that can further improve search performance even more. As BigQuery vector search evolves, it becomes a key component in a multi-modal Retrieval Augmentation Generation (RAG) solution that uses state-of-the-art Gemini models on top of a full-fledged BigQuery knowledge base comprised of structured, unstructured, and multimodal data.
Get started today
The combination of vector embeddings and machine learning stands to revolutionize what you can do with the data stored in your BigQuery enterprise data warehouses — starting with searching on those embeddings quickly and cost-effectively. To get started with BigQuery vector search, take a look at the following resources:
Getting started with vector search
Learn more about vector indexes, prefilters, and stored columns
Product Recommendations Example
Generate and search multi-modal embeddings documentation
Read More for the details.