GCP – Introducing BigQuery ObjectRef: Supercharge your multimodal data and AI processing
Traditional data warehouses simply can’t keep up with today’s analytics workloads. That’s because today, most data that’s generated is both unstructured and multimodal (documents, audio files, images, and videos). With the complexity of cleaning and transforming unstructured data, organizations have historically had to maintain siloed data pipelines for unstructured and structured data, and for analytics and AI/ML use cases. Between these fragmented data platforms, data access restrictions, slow consumption, and outdated information, enterprises struggle to unlock the full potential of their data. The same issues hinder AI initiatives.
Today we’re introducing a new data type, ObjectRef
, now in preview in BigQuery, that represents a reference to any object in Cloud Storage with a URI and additional metadata. ObjectRef
complements Object Tables, read-only tables over unstructured data objects in Cloud Storage, to integrate unstructured data like images and audio into existing BigQuery tables. The ObjectRef
data type removes fragmentation in data processing and access control, providing a unified, multimodal, and governed way to process all modalities of data. You can process unstructured data with large language models (LLMs), ML models, and open-source Python libraries using the same SQL or Python scripts that process tabular data. You can also store structured and unstructured data in the same row throughout different data engineering stages (extract, load, transform a.k.a. ELT), and govern it using a similar access control model.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3ed4497f7310>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
For example, to answer the question “of the customers who complained about performance issues during interactions last month, show me the top 10 by revenue” you need to perform natural language processing (NLP) on audio calls, emails and online chat transcripts to normalize the data, identify whether the interaction discussed “performance issues” and detect whether the customer complained. For each of these steps, you need to decide how to build a pipeline over data in Cloud Storage, run AI/ML models on the data, and host the models (e.g., on Compute Engine, Google Kubernetes Engine, or Vertex AI). The normalized and extracted data would then need to be saved in structured format (e.g., in a BigQuery table) and joined with each customer’s revenue data.
With the launch of ObjectRef
, you can now answer this question with a simple SQL query. Suppose you’ve combined call center audio files and agent chat text into one BigQuery table customer_interactions
using columns (1) audio_ref
of type ObjectRef
, (2) chat
of type STRING
. Filtering for customers who complained about performance issues is as easy as adding one more condition in the WHERE
clause:
BigQuery with ObjectRef
unlocks unique platform capabilities across data and AI:
-
Multimodality: Natively handle structured (tabular) data, unstructured data, and a combination of the two, in a single table via
ObjectRef
. Now, you can build multimodal ELT data pipelines to process both structured and unstructured data. -
Full SQL and Python support: Use your favorite language without worrying about interoperability. If it works in SQL, it works in Python (via BigQuery DataFrames), and vice versa. Object transformations, saving transformed objects back to Cloud Storage, and any other aggregations or filtering, can all be done in one SQL or Python script.
-
Gen-AI-ready, serverless, and auto-scaled data processing: Spend more time building your data pipelines, not managing infrastructure. Process unstructured data with LLMs, or use serverless Python UDFs with your favorite open-source library. Create embeddings, generate summaries using a prompt, use a BigQuery table as an input to Vertex AI jobs, and much more.
-
Unified governance and access control: Use familiar BigQuery governance features such as fine-grained access control, data masking, and connection-delegated access on unstructured data. There is no need to manage siloed governance models for structured versus unstructured data.
ObjectRef in action
Let’s take a closer look at how to use the ObjectRef
data type.
What is an ObjectRef
?
First, it’s good to understand ObjectRef under the hood. Simply put, ObjectRef
is a STRUCT containing object storage and access control metadata. With this launch, when you create an Object Table, it is populated with a new ObjectRef
column named ‘ref’.
- code_block
- <ListValue: [StructValue([(‘code’, ‘struct {rn uri string,rn authorizer string,rn version string, rn details json { rntgcs_metadata jsonrn }rn }’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed43e6869d0>)])]>
Create a BigQuery table with ObjectRefs
Imagine a call center that stores structured information in standard BigQuery tables ingestion.sessions
, and call audio in a Cloud Storage bucket, with a BigQuery Object Table ingestion.audios
created on the Cloud Storage bucket. While this example is based on audio, ObjectRefs
can also represent images, documents, and videos.
In the following diagrams, ObjectRefs
are highlighted in red.
With ObjectRef
, you can join these two tables on sessions.RecordingID
and audios.Ref.uri
columns to create a single BigQuery table. The new table contains an Audio
column of type ObjectRef
, using the Ref
column from the ingestion.audios
table.
- code_block
- <ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE TABLE analysis.sessionsrnASrnSELECT sessions.session_id, sessions.date, sessions.customer_id, object_table.ref AS audiornFROM ingestion.sessions INNER JOIN ingestion.audios object_tablernON object_table.uri = sessions.recording_id;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf760>)])]>
Capturing the object version
allows BigQuery zero-copy snapshots and clones of analysis.sessions
to be reproducible and consistent across structured and unstructured data. This allows reproducibility in downstream applications such as ML training and LLM fine-tuning.
Being a STRUCT, ObjectRef
also supports nesting in ARRAY. The main audio file represented by Audio
can be chunked (for example, into segments per agent ID), and the resulting objects represented in a new column Chunked
of type ARRAY<ObjectRef>
. This preserves the order of chunks, and stores them alongside the main audio file in the same row. This data transformation lets you report the number of agent handoffs per call and further analyze each call segment separately.
Process using serverless Python
With Python UDF integration, you can bring your favorite open-source Python library to BigQuery as a user-defined function (UDF). Easily derive structured data, and unstructured data from the source ObjectRef
and store them in the same row.
The new function OBJ.GET_ACCESS_URL(ref ObjectRef, mode STRING) -> ObjectRefRuntime
enables delegated access to the object in Cloud Storage. ObjectRefRuntime
provides signed URLs to read and write data, allowing you to manage governance and access control entirely in BigQuery, and removing the need for Cloud Storage access control.
Serverless Python use case 1: Multimodal data to structured data
For example, imagine you want to get the duration of every audio file in the analysis.sessions
table. Assume that a Python UDF function analysis.GET_DURATION(object_ref_runtime_json STRING) -> INT
has already been registered in BigQuery. GET_DURATION
uses signed URLs from ObjectRefRuntime
to read Cloud Storage bytes.
- code_block
- <ListValue: [StructValue([(‘code’, ‘– Object is passed to Python UDF using read-only signed URLsrnSELECT analysis.GET_DURATION(TO_JSON_STRING(OBJ.GET_ACCESS_URL(audio, “R”))) AS durationrnFROM analysis.sessionsrnWHERE audio IS NOT NULL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf250>)])]>
- code_block
- <ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpdrndf = bpd.read_gbq(“analysis.sessions”)rnfunc = bpd.read_gbq_function(“analysis.get_duration”)rn# Object is passed to Python UDF using read-only signed URLsrndf[“duration”] = df[“audio”].blob.get_runtime_json_str(mode=”R”).apply(func).cache() # cache to execute’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf2e0>)])]>
Serverless Python use case 2: Multimodal data to processed multimodal data
As another example, here’s how to remove noise from every audio file in the analysis.sessions
table, assuming that a Python UDF function analysis.DENOISE(src_object_ref_runtime_json STRING, dst_object_ref_runtime_json STRING)-> object_ref_runtime_json STRING
has already been registered in BigQuery. This function reads from the source audio, writes the new noise-removed audio to Cloud Storage, and returns ObjectRefs
for the new audio files.
ObjectRefRuntime
provides signed URLs for reading and writing object bytes.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT analysis.DENOISE(rn — Source is accessed using read-only signed URLrn TO_JSON_STRING(OBJ.GET_ACCESS_URL(audio, “R”)), rn — Destination is written using read-write signed URL with prefix “denoised-“rn TO_JSON_STRING(OBJ.GET_ACCESS_URL(rn OBJ.MAKE_REF(rn CONCAT(“denoised-“, audio.uri), audio.authorizer),rn “RW”))rnFROM analysis.sessionsrnWHERE audio IS NOT NULL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfa60>)])]>
- code_block
- <ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpdrndf = bpd.read_gbq(“analysis.sessions”)rnrndf[“denoised”] = (“denoised-” + df[“audio”].blob.uri()).str.to_blob()rnfunc_df = df[[“audio”, “denoised”]]rnrnfunc = bpd.read_gbq_function(“analysis.denoise”)rn# Source is accessed using read-only signed URLrnfunc_df[“audio”] = func_df[“audio”].blob.get_runtime_json_str(“R”)rn# Destination is written using read-write signed URL with prefix “denoised-“rnfunc_df[“denoised”] = func_df[“denoised”].blob.get_runtime_json_str(“RW”)rnfunc_df.apply(func, axis=1).cache() # cache to execute’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfdf0>)])]>
Process using Gemini and BigQuery ML
All BigQuery ML generative AI functions such as AI.GENERATE
, ML.GENERATE_TEXT
and ML.GENERATE_EMBEDDING
now support ObjectRefs
as first-class citizens. This enables a number of use cases.
BQML use case 1: Multimodal inference using Gemini
You can now pass multiple ObjectRefs
in the same Gemini prompt for inference.
Here, you can use Gemini to evaluate noise removal quality by comparing the original audio file and the noise-removed audio file. This script assumes the noise-reduced audio file ObjectRef
is already stored in column Denoised
.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT AI.GENERATE(rn prompt => (“Compare original audio file to audio file with noise removed, and output quality of noise removal as either good or bad. Original audio is”, OBJ.GET_ACCESS_URL(audio, “r”), “and noise removed audio is”, OBJ.GET_ACCESS_URL(denoised, “r”)),rn — BQ connection with permission to call Geminirn connection_id => “analysis.US.gemini-connection”,rn endpoint => “gemini-2.0-flash”rn).resultrnFROM analysis.sessions WHERE audio IS NOT NULL AND denoised IS NOT NULL;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf580>)])]>
- code_block
- <ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpdrnfrom bigframes.ml import llmrnrngemini = llm.GeminiTextGenerator(model_name=”gemini-2.0-flash”, connection_name=”analysis.US.gemini-connection”)rndf = bpd.read_gbq(“analysis.sessions”)rnresult = gemini.predict(df, prompt=[“Compare original audio file to audio file with noise removed, and output quality of noise removal as either good or bad. Original audio is”, df[“audio”], “and denoised audio is”, df[“denoised”]])rnresult[[“ml_generate_text_llm_result”]]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfc70>)])]>
As another example, here’s how to transcribe the Audio
file using Gemini.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT AI.GENERATE(rn prompt => (“Transcribe this audio file”, OBJ.GET_ACCESS_URL(audio, “r”)),rn — BQ connection with permission to call Geminirn connection_id => “analysis.US.gemini-connection”,rn endpoint => “gemini-2.0-flash”).result as transcriptrnFROM analysis.sessionsrnWHERE audio IS NOT NULL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfeb0>)])]>
- code_block
- <ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpdrnfrom bigframes.ml import llmrnrngemini = llm.GeminiTextGenerator(model_name=”gemini-2.0-flash”, connection_name=”analysis.US.gemini-connection”)rndf = bpd.read_gbq(“analysis.sessions”)rnresult = gemini.predict(df, prompt=[“Transcribe this audio file”, df[“audio”]])rnresult[[“ml_generate_text_llm_result”]]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cff70>)])]>
With BQML + Gemini, you can also generate structured or semi-structured results from multimodal inference. For example, you can do speaker diarization in the Audio
file using Gemini to identify the operator vs. the customer.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECT AI.GENERATE(rnprompt => (“Generate audio diarization for this interview. Use JSON format for the output, with the following keys: speaker, transcription. If you can classify the speaker as customer vs operator, please do. If not, use speaker A, speaker B, etc.”, OBJ.GET_ACCESS_URL(audio, “r”)),rn — BQ connection with permission to call Geminirnconnection_id => “analysis.US.gemini_connection”,rnendpoint => “gemini-2.0-flash”).result as diarized_jsonrnFROM analysis.sessionsrnWHERE audio IS NOT NULL;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c9d90>)])]>
- code_block
- <ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpdrnfrom bigframes.ml import llmrnrngemini = llm.GeminiTextGenerator(model_name=”gemini-2.0-flash”, connection_name=”analysis.US.gemini-connection”)rndf = bpd.read_gbq(“analysis.sessions”)rnresult = gemini.predict(df, prompt=[“Generate audio diarization for this interview. Use JSON format for the output, with the following keys: speaker, transcription. If you can classify the speaker as customer vs operator, please do. If not, use speaker A, speaker B, etc.”, df[“audio”]])rnresult[[“ml_generate_text_llm_result”]]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c9280>)])]>
BQML use case 2: Multimodal embeddings using Gemini
With ML.GENERATE_EMBEDDING
support, you can use ObjectRefs
with text embedding and multimodal embedding models to create vector indices, and power RAG workflows to ground LLMs.
Assume we have an Object Table ingestion.images
with the ref
column containing image ObjectRefs
.
- code_block
- <ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODEL `ingestion.multimodal_embedding_model`rnREMOTE WITH CONNECTION ‘ingestion.US.gemini-connection’rnOPTIONS (ENDPOINT = ‘multimodalembedding@001’);rnrnSELECT ref, ml_generate_embedding_result as embeddingrnFROM ML.GENERATE_EMBEDDING(rn MODEL `ingestion.multimodal_embedding_model`,rn (rn SELECT OBJ.GET_ACCESS_URL(ref, ‘r’) as content, refrn FROM ingestion.imagesrn ),rn STRUCT (256 AS output_dimensionality)rn);”), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c9cd0>)])]>
- code_block
- <ListValue: [StructValue([(‘code’, ’embed_model = llm.MultimodalEmbeddingGenerator(model_name=”multimodalembedding@001″, connection_name=”ingestion.US.gemini-connection”)rndf = bpd.read_gbq(“ingestion.images”)rnembeddings = embed_model.predict(df[“ref”])rnembeddings’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c96d0>)])]>
What’s new
To summarize, here’s a list of all the new capabilities for performing analytics on unstructured and/or multimodal data using BigQuery:
-
New types and functions for handling multimodal data (documents, audio files, images, and videos):
-
ObjectRef
andObjectRefRuntime
types along with new functions:OBJ.MAKE_REF
,OBJ.GET_ACCESS_URL
andOBJ.FETCH_METADATA
Object Table enhancements:
-
Scalability: Object Tables now support consistent views of Cloud Storage buckets, scaling 5x from 65M to 300M+ objects per table, and ingesting up to 1M object changes per hour per table
-
Interop with
ObjectRef
: Newref
column provides pre-constructedObjectRefs
directly from Object Tables
BQML Gen-AI multimodal capabilities:
-
Support multimodal inference in TVFs
ML.GENERATE_TEXT
andAI.GENERATE_TABLE
, and scalar functions such asAI.GENERATE
, andAI.GENERATE_BOOL
, by encapsulating multiple objects in the same prompt for Gemini usingObjectRef.
Objects can be sourced from different columns, and complex types such as arrays. -
Support embedding
ObjectRef
via theML.GENERATE_EMBEDDING
function
BigQuery DataFrames multimodal dataframe support:
-
An extension to pandas-like dataframe to include unstructured data (powered by
ObjectRef
) as just another column -
Wrangle, process and filter mixed modality data with the familiarity of dataframe operations
-
Special transformers for unstructured data like chunking, image processing, transcription made available through server side processing functions and BQML
Python UDF support:
-
Leverage the rich Python library ecosystem for advanced unstructured data manipulation in a fully managed, serverless experience with BigQuery governance
Get started today
ObjectRef is now in preview. Follow these simple steps to get started:
-
Watch to learn more – watch live demos from Cloud Next about unifying unstructured and structured data, and generating text with LLMs and performing vector search.
-
Learn by doing – try out
ObjectRefs
with this multimodal data tutorial using either SQL or Python tutorials. -
Build your use case – locate the Cloud Storage bucket containing the unstructured data you want to analyze. Create an Object Table or set up automatic Cloud Storage discovery to pull this data into BigQuery. The Object Table will contain a column of ObjectRefs and now you are ready to start transforming the data.
Read More for the details.