GCP – BigQuery under the hood: How Google brought embeddings to analytics
Embeddings are a crucial component at the intersection of data and AI. As data structures, they encode the inherent meaning of the data they represent, and their significance becomes apparent when they are compared to one another. Vector search is a technique that uncovers the relative meaning of those embeddings by evaluating the distances between them within a shared space.
In early 2024, we launched vector search in the BigQuery data platform, making its powerful capabilities accessible to all BigQuery users. This effectively eliminated the need for specialized databases or complex AI workflows. Our ongoing efforts to democratize vector search has resulted in a unique approach that provides the scale, simplicity, and cost performance that BigQuery users expect. In this article, we reflect on the past two years, sharing insights gained from product development and customer interactions.
In the before-times: Building vector search the hard way
Before we added native support for vector search in BigQuery, building a scalable vector search solution was a complex, multi-step process. Data professionals had to:
-
Extract data from their data warehouse
-
Generate embeddings using specialized machine learning infrastructure
-
Load the embeddings into a dedicated vector database
-
Maintain this additional infrastructure, including server provisioning, scaling, and index management
-
Develop custom pipelines to join vector search results back to their core business data
-
Deal with downtime during index rebuilds, a critical pain point for production systems
This disjointed, expensive, and high-maintenance architecture was a barrier to entry for many teams.
In the beginning: Focus on simplicity
We kicked off BigQuery vector search with one goal: to make the simplest vector database on the market. We built it to meet some core design requirements:
-
It needs to be fully serverless: We knew early on that the best way to bring vector search to all BigQuery customers was to make it serverless. We first built the IVF index, combining the best of clustering and indexing, all within BigQuery. As a result, you don’t need to provision any new servers whatsoever to use vector search in BigQuery. This means you don’t have to manage any underlying infrastructure for your vector database, freeing up your team to focus on what matters most: your data. BigQuery handles the scaling, maintenance, and reliability automatically. It can scale effortlessly to handle billions of embeddings, so your solution can grow with your business.
-
Index maintenance should be as simple as possible: BigQuery’s vector indexes are a key part of this simplicity. You create an index with a simple
CREATE VECTOR INDEXSQL statement, and BigQuery handles the rest. As new data is ingested, the index automatically and asynchronously refreshes to reflect the changes. And if the ingested data results in data distribution changes in the dataset, and in turn, in search accuracy degradation, it’s no problem: You can use theModel Rebuildfeature to completely rebuild your index, without any index downtime, and with just one SQL statement. -
It should be integrated with GoogleSQL and Python: You can perform vector searches directly within your existing SQL workflows using a simple
VECTOR_SEARCHfunction. This makes it easy to combine semantic search with traditional queries and joins. For data scientists, the integration with Python and tools like LangChain and BigQuery DataFrames makes it a natural fit for building advanced machine learning applications. -
Consistency needs to be guaranteed: New data is searchable via the
VECTOR_SEARCHfunction immediately after ingestion, providing accuracy and consistency of the search results. -
You only pay for what you use: The BigQuery vector search pricing model is designed for flexibility. This “pay as you go” model is great for both ad-hoc analyses and highly price-performant batch queries. This model emphasizes the ease of trying out the feature without a significant upfront investment.
-
Security is a given: BigQuery’s security infrastructure offers robust data -access control through row-level security (RLS) and column-level security (CLS). This multi-layered approach guarantees that users can only access authorized data, thereby bolstering protection and ensuring compliance.
The early days: Growing with our customers
As customers found success with early projects and moved more data into BigQuery, they told us about many data science workflows that they were “updating” to use new embedding-based approaches. Here are a few examples of the various applications that vector search can enhance:
-
LLM applications with retrieval augmented generation (RAG): By providing relevant business data, vector search helps ensure accurate and grounded responses from large language models.
-
Semantic search on business data: Enable powerful, natural-language search capabilities for both internal and external users. For instance, a marketing team could search for “customers who have a similar purchasing history to Jane” and receive a list of semantically similar customer profiles.
-
Customer 360 and deduplication: Use embeddings to identify similar customer records, even if details like names or addresses differ slightly. This is an effective way to cleanse and consolidate data for a more accurate, single view of your customer.
-
Log analytics and anomaly detection: Ingest log data as embeddings and use vector search to quickly find similar log entries, even if the exact text doesn’t match. This helps security teams identify potential threats and anomalies much faster.
-
Enhance product recommendations: Suggest visually or textually similar items (e.g., clothing) or semantically related complementary products.
Where we are now: Improving scale and cost performance
As customer usage grew, we enhanced our offering, observing significant demand for batch processing beyond RAG and generative AI workloads. Unlike traditional vector databases, improved batch vector search in BigQuery excels at high-throughput, analytical similarity searches on massive datasets. This allows data scientists to analyze billions of records simultaneously within their existing data environment, enabling previously prohibitive tasks such as:
-
Large-scale clustering: Grouping every customer in a database based on their behavioral embeddings
-
Comprehensive anomaly detection: Finding the most unusual transaction for every single account in a financial ledger
-
Bulk item categorization: Classifying millions of text documents or product images simultaneously
In the second phase of development, we launched many new features to further improve the vector search experience:
-
TreeAH, built using the ScaNN index, provides significant product differentiation in price / performance. Our customers’ data science teams were moving more of their recommendation, clustering, and data pipelines to use vector search. We saw great improvements using TreeAH.
-
Various internal improvements to help increase the training and indexing performance and usability. For example, we added asynchronous index training, which increases usability and scalability as massive index training jobs are moved into the background. We also performed various internal optimizations to improve indexing performance, and reduce indexing latency without incurring additional costs for users.
-
Stored columns to help improve vector search performance:
-
Users can apply prefilers on the stored columns in the vector search query to greatly optimize search performance without sacrificing search accuracy.
-
If users only query stored columns in the vector search query, search performance can be further improved by avoiding expensive joins with the base table.
Partitioned indexes to dramatically reduce I/O costs and accelerate query performance by skipping irrelevant partitions. This is especially powerful for customers who frequently filter on partitioning columns, such as a date or region.
Index model rebuilds to help ensure that vector search results remain accurate and relevant over time. As your base data evolves, you can now proactively correct for model drift, maintaining the high performance of your vector search applications without index downtime.
Looking ahead: Indexing all the things
As businesses look to agentic AI, the data platform has never been more important. We imagine a world in which every business has their own AI mode for productivity, and retrieving relevant data is at the heart of productivity, including intelligent indexing of all relevant enterprise data, structured or unstructured, to automate AI and analytics. Indexing and search is core to Google. We look forward to sharing relevant technology innovations with you!
Read More for the details.
