GCP – Analyze images and videos in BigQuery using Gemini 1.0 Pro Vision
With the proliferation of digital devices and platforms including social media, mobile devices and IoT sensors, organizations are increasingly generating unstructured data in the form of images, audio files, videos, and documents etc. Over the last few months, we launched BigQuery integrations with Vertex AI to leverage Gemini 1.0 Pro, PaLM , Vision AI, Speech AI, Doc AI, Natural Language AI and more to help you interpret and extract meaningful insights from unstructured data.
While Vision AI provides image classification and object recognition capabilities, large language models (LLMs) unlock new visual use cases. To that end, we are expanding BigQuery and Vertex AI integrations to support multimodal generative AI use cases with Gemini 1.0 Pro Vision. Using familiar SQL statements, you can take advantage of Gemini 1.0 Pro Vision directly in BigQuery to analyze both images and videos by combining them with your own text prompts.
A birds-eye view of Vertex AI integration capabilities for analyzing unstructured data in BigQuery
Within a data warehouse setting, multimodal capabilities can help enhance your unstructured data analysis across a variety of use cases:
Object recognition: Answer questions related to fine-grained identification of the objects in images and videos.
Info seeking: Combine world knowledge with information extracted from the images and videos.
Captioning/description: Generate descriptions of images and videos with varying levels of detail.
Digital content understanding: Answer questions by extracting information from content like infographics, charts, figures, tables, and web pages.
Structured content generation: Generate responses in formats like HTML and JSON based on provided prompt instructions.
Turning unstructured data into structured data
With minimal prompt adjustments, Gemini 1.0 Pro Vision can produce structured responses in convenient formats like HTML or JSON, making them easy to consume in downstream tasks. In a data warehouse such as BigQuery, having structured data means you can use the results in SQL operations and combine it with other structured datasets for deeper analysis.
For example, imagine you have a large dataset that contains images of cars. You want to understand a few basic details about the car in each image. This is a use case that Gemini 1.0 Pro Vision can help with!
Combining text and image into a prompt for Gemini 1.0 Pro Vision, with a sample response.
Dataset from: 3D Object Representations for Fine-Grained Categorization Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13). Sydney, Australia. Dec. 8, 2013.
As you can see, Gemini’s response is very thorough! But while the format and extra information are great if you’re a person, they’re not so great if you’re a data warehouse. Rather than turning unstructured data into more unstructured data, you can make changes to the prompt to direct the model on how to return a structured response.
Adjusting the text portion of the prompt to indicate a structured response from Gemini 1.0 Pro Vision, with a sample result.
You can see how this response would be much more useful in an environment like BigQuery.
Now let’s see how to prompt Gemini 1.0 Pro Vision directly in BigQuery to perform this analysis over thousands of images!
Accessing Gemini 1.0 Pro Vision from BigQuery ML
Gemini 1.0 Pro Vision is integrated with BigQuery through the ML.GENERATE_TEXT() function. To unlock this function in your BigQuery project, you will need to create a remote model that represents a hosted Vertex AI large language model. Fortunately, it’s just a few lines of SQL:
<ListValue: [StructValue([(‘code’, “CREATE MODEL `mydataset.gemini_pro_vision_model`rnREMOTE WITH CONNECTION `us.bqml_llm_connection`rnOPTIONS(endpoint = ‘gemini-pro-vision’);”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6eed188e80>)])]>
Once the model is created, you can combine your data with the ML.GENERATE_TEXT() function in your SQL queries to generate text.
A few notes on the ML.GENERATE_TEXT() function syntax when it is pointing to a gemini-pro-vision model endpoint, as is the case in this example:
TABLE: takes an object table as input, where it can contain different types of unstructured objects (e.g. images, videos).
PROMPT: takes a single string text prompt that is placed as part of the option STRUCT (dissimilar to the case when using the gemini-pro model) and applies this prompt to each object, row-by-row, contained in the object TABLE.
<ListValue: [StructValue([(‘code’, “SELECTrn uri,rn ml_generate_text_llm_result as brand_model_yearrn FROMrn ML.GENERATE_TEXT(rn MODEL `mydataset.gemini_pro_vision_model`,rn TABLE `mydataset.car_images_object_table`,rn STRUCT(rn ‘What is the brand, model, and year of this car? Answer in JSON format with three keys: brand, model, year. brand and model should be string, year should be integer.’ AS prompt, TRUE AS flatten_json_output));”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6eed188310>)])]>
Let’s take a peek at the results.
We can add some SQL to this query to extract each of the values for brand, model, and year into new fields for use downstream.
<ListValue: [StructValue([(‘code’, ‘WITH raw_json_result AS ( rnSELECTrn uri,rn ml_generate_text_llm_result as brand_model_yearrn FROMrn ML.GENERATE_TEXT(rn MODEL `mydataset.gemini_pro_vision_model`,rn TABLE `mydataset.car_images_object_table`,rn STRUCT(rn ‘What is the brand, model, and year of this car? Answer in JSON format with three keys: brand, model, year. brand and model should be string, year should be integer.’ AS prompt, TRUE AS flatten_json_output)))rnSELECTrn uri,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.brand”) AS brand,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.model”) AS model,rn JSON_QUERY(RTRIM(LTRIM(raw_json_result.brand_model_year, ” “`json”), ““`”), “$.year”) AS yearrnFROM raw_json_result’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6eecd34130>)])]>
Now the responses have been parsed into new, structured columns.
And there you have it. We’ve just turned a collection of unlabeled, raw images into structured data, fit for analysis in a data warehouse. Imagine joining this new table with other relevant enterprise data. With a dataset of historical car sales, for example, you could determine the average or median sale price for similar cars in a recent time period. This is just a taste of the possibilities that are uncovered by bringing unstructured data into your data workflows!
When getting started with Gemini 1.0 Pro Vision in BigQuery, there are a few important items to note:
You need an enterprise or enterprise plus reservation to run Gemini 1.0 Pro Vision model inference over an object table. For reference see the BigQuery editions documentation.
Limits apply to functions that use Vertex AI large language models (LLMs) and Cloud AI services, so review the current quota in place for the Gemini 1.0 Pro Vision model.
Next steps
Bringing generative AI directly into BigQuery has enormous benefits. Instead of writing custom Python code and building data pipelines between BigQuery and the generative AI model APIs, you can now just write a few lines of SQL! BigQuery manages the infrastructure and helps you scale from one prompt to thousands. Check out the overview and demo video, and the documentation to see more example queries using ML.GENERATE_TEXT() with Gemini 1.0 Pro Vision.
Coming to Next ‘24? Check out session Power data analytics with generative AI using BigQuery and Gemini, where you can see Gemini Vision Pro and BigQuery in action.
Read More for the details.