GCP – How to use gen AI for better data schema handling, data quality, and data generation
In the realm of data engineering, generative AI models are quietly revolutionizing how we handle, process, and ultimately utilize data. For example, large language models (LLMs) can help with data schema handling, data quality, and even data generation.
Building upon the recently released Gemini in BigQuery Data preparation capabilities, this blog showcases areas where gen AI models are making a significant impact in data engineering with automated solutions for schema management, data quality automation, and generation of synthetic and structured data from diverse sources, providing practical examples and code snippets.
1. Data schema handling: Integrating new datasets
Data movement and maintenance is an ongoing challenge across all data engineering teams. Whether it’s moving data between systems with different schemas or integrating new datasets into existing data products, the process can be complex and error-prone. This is often exacerbated when dealing with legacy systems; in fact, 32% of organizations cite migrating the data and the app as their biggest challenge, according to Flexera’s 2024 State of the Cloud Report.
Gen AI models offer a powerful solution by assisting in automating schema mapping and transformation on an ongoing basis. Imagine migrating customer data from a legacy CRM system to a new platform, and combining it with additional external datasets in BigQuery. The schemas likely differ significantly, requiring intricate mapping of fields and data types. Gemini, our most capable AI model family to date, can analyze both schemas and generate the necessary transformation logic, significantly reducing manual effort and potential errors.
A common approach to data schema handling that we’ve seen from data engineering teams involves creating a lightweight application that receives messages from Pub/Sub, retrieves relevant dataset information from BigQuery and Cloud Storage, and uses the Vertex AI Gemini API to map source fields to target fields and assign a confidence score. Here is example code showing a FunctionDeclaration to perform the mapping-confidence task:
- code_block
- <ListValue: [StructValue([(‘code’, ‘set_source_field_mapping_confidence_levels = generative_models.FunctionDeclaration(rn name=”set_source_field_mapping_confidence_levels”,rn description=”””Sets the mapping confidence values for each source field for a given target field.rnrnHere is a general example to help you understand how to use the set_source_field_mapping_confidences_tool correctly. This is only an example to show the source and target field structures.:rnrnAssuming you had previously decided on the following mapping confidence levels (but it is important that you come up with your own values for mapping condifence level rather than specifically using these values):rna mapping confidence level of 2 for the field with source_field_unique_ref=158rna mapping confidence level of 1 for the field with source_field_unique_ref=159rna mapping confidence level of 1 for the field with source_field_unique_ref=1290rna mapping confidence level of 1 for the field with source_field_unique_ref=579rna mapping confidence level of 1 for the field with source_field_unique_ref=638rna mapping confidence level of 1 for the field with source_field_unique_ref=970rna mapping confidence level of 1 for the field with source_field_unique_ref=3317rna mapping confidence level of 3 for the field with source_field_unique_ref=160rna mapping confidence level of 1 for the field with source_field_unique_ref=1910rna mapping confidence level of 5 for the field with source_field_unique_ref=2280rnrnThen this function would be used to set the mapping confidence levels for each of the source fields, where your input parameter source_field_mapping_confidences would be:rnsource_field_mapping_confidences = [rn {‘source_field_unique_ref’:158,’mapping_confidence_level’:’2′},rn {‘source_field_unique_ref’:159,’mapping_confidence_level’:’2′},rn {‘source_field_unique_ref’:1290,’mapping_confidence_level’:’1′},rn {‘source_field_unique_ref’:579,’mapping_confidence_level’:’1′},rn {‘source_field_unique_ref’:638,’mapping_confidence_level’:’1′},rn {‘source_field_unique_ref’:970,’mapping_confidence_level’:’1′},rn {‘source_field_unique_ref’:3317,’mapping_confidence_level’:’1′},rn {‘source_field_unique_ref’:160,’mapping_confidence_level’:’3′},rn {‘source_field_unique_ref’:1910,’mapping_confidence_level’:’1′},rn {‘source_field_unique_ref’:2280,’mapping_confidence_level’:’5′}rn]”””,rnrn parameters={rn “type”: “object”,rn “properties”: {rn “source_field_mapping_confidences”: {rn “type”: “array”,rn “description”: “A List of objects where each object in the list contains the source field’s source_field_unique_ref, the mapping_confidence_level for that source field and the reason for applying that mapping_confidence_level.”,rn “items”: {rn “type”: “object”,rn “properties”: {rn “source_field_unique_ref”: {rn “type”: “integer”,rn “description”: “The reference ID for the source field.”rn },rn “mapping_confidence_level”: {rn “type”: “string”,rn “enum”: [“1”, “2”, “3”, “4”, “5”],rn “description”: “The confidence level for the mapping (an integer between 1 and 5).”rn },rn “mapping_confidence_level_reason”: {rn “type”: “string”,rn “description”: “The reason why the source field should have this mapping confidence level value”rn }rn },rn “required”: [“source_field_unique_ref”, “mapping_confidence_level”, “mapping_confidence_level_reason”]rn }rn },rn },rn “required”: [“source_field_mapping_confidences”],rn },rn )’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ec7e0269850>)])]>
As seen in the above prompt, Gemini assigns confidence levels to each mapping, which are then stored in BigQuery. Once these are in BigQuery, the data engineering team can validate high-confidence mappings (and eventually choose to fully automate these if they feel comfortable), and investigate the low-confidence mappings. This pipeline of gen AI tasks could be deployed in an event-driven architecture or could run on a batch basis. However, there’s usually a final step required, where a human approves the final output (this could eventually become fully automated over time, given the rapid release cadence of improvements in gen AI models). Here is an example architecture / workflow:
2. Data quality: Enhancing accuracy and consistency
In today’s data-driven world, poor data quality can cost businesses millions. From inaccurate customer insights leading to misguided marketing campaigns, to flawed financial reporting that impacts investment decisions, the consequences of bad data are significant. Gen AI models offer a new approach to data quality, going beyond traditional rule-based systems to identify subtle inconsistencies that can wreak havoc on your data pipelines. For example, imagine a system that can automatically detect and correct errors that would typically require hours of manual review or creation of intensive ReGex expressions.
Gemini can augment your existing data quality checks in several ways:
-
Deduplication: Consider a scenario where you need to deduplicate customer profiles. Gemini can analyze various fields, such as names, addresses, and phone numbers, to identify potential duplicates, even when there are minor variations in spelling or formatting. For example, Gemini can recognize that “Robert Smith” and “Bob Smith” likely refer to the same individual, or that “123 Main St.” and “123 Main Street” represent the same address. In contrast to traditional methods like fuzzy matching, which are cumbersome to code and don’t always produce ideal results, using an LLM can provide a simpler and more effective solution.
-
Standardization: Gemini excels at standardizing data formats. Instead of relying on intricate regular expressions to validate data formats, Gemini can be used with prompt engineering, RAG, or fine-tuning to understand and enforce data quality rules in a more human-readable and maintainable way. This is particularly useful for fields like dates, times, and addresses, where variations in format can hinder analysis.
-
Subtle error detection: Gemini can identify subtle inconsistencies that might be missed by traditional methods. These include:
-
Variations in abbreviations (e.g., “St.” vs “Street”)
-
Different spellings of the same name (e.g., “Catherine” vs. “Katherine”)
-
Use of nicknames (e.g., “Bob” vs. “Robert”)
-
Incorrectly formatted phone numbers (e.g., missing area codes)
-
Inconsistent use of capitalization and punctuation
Let’s illustrate this with a common example of address validation. We have a table named customer_addresses
with the following format, and we want to check if the address_state
column is a valid US state and convert it into the standard two-letter abbreviation:
Looking at the input data, you can easily identify some issues with the address_state
column. For example, ‘Pennsylvaniaa’ is misspelled, and ‘Texas’ is written out instead of using the standard two-letter abbreviation. While these errors are obvious to a human, they can be challenging for traditional data quality tools to catch because they rely on exact matches or rigid rules, missing these subtle variations.
However, Gemini excels at understanding and interpreting human language, making it well suited for this task. With a simple prompt, Gemini can accurately identify these inconsistencies and standardize the state names into the correct format, going beyond rigid rules and adapting to nuances of the human language.
Here’s how you can use Gemini in BigQuery to perform this task, using the BQML function ML. GENERATE_TEXT, which lets you perform gen AI tasks on data stored in BigQuery using a remote connection to Gemini hosted in Vertex AI:
- code_block
- <ListValue: [StructValue([(‘code’, “SELECTrn prompt,rn REPLACE(REPLACE(REPLACE(ml_generate_text_llm_result,’json’,”),’\n’,”),’“`’,”) asrn ml_generate_text_llm_result,rn address_id,rn address_line1,rn address_line2,rn address_city,rn address_state,rn address_zipcode,rn address_country,rnFROMrn ML.GENERATE_TEXT( MODEL `bigquery_demo.gemini-pro`,rn (rn SELECTrn CONCAT( ‘Check if the given address_state field is as per ANSI 2-letter standard.If not,convert it into the recommended format.Also check if the address_state is a valid US state.Return only the output with input,output and is_valid_us_state fields. address_state:’, address_state) AS prompt,rn *rn FROMrn `bigquery_demo.customer_addresses` ),rn STRUCT (TRUE AS flatten_json_output));”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ec7e0269910>)])]>
This code sends each address_state
value to Gemini with a prompt asking it to validate and standardize the input. Gemini then returns a JSON response with the original input, the standardized output, and a boolean indicating whether the state is valid:
In this instance, Gemini has automated and streamlined our data quality process and reduced the complexity of the code. The first column contains the validation output — with a simple prompt, we are able to correctly identify the rows that have an invalid state column value and convert the state columns to a standard format. In the more traditional approach this would have taken multiple SQL expressions, external APIs, or joining with a lookup table.
The above example is just a glimpse into how Gemini can improve data quality. But beyond basic validation and standardization, gen AI models also excel at more nuanced tasks. For instance, they can classify data errors by severity (low, medium, high) for prioritized action and effectively handle mixed-language text fields by detecting language discrepancies. For more detailed examples check out this code repo, which includes how to leverage gen AI models for semantic search in BigQuery that you could use to identify duplicate records.
Important considerations for large datasets:
When working with large datasets, sending individual requests to an LLM like Gemini can become inefficient and may exceed usage quotas. To optimize performance and manage costs, consider batching requests and make sure your GCP project has sufficient API quotas.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3ec7e0269e50>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
3. Data generation: Unlocking insights from unstructured data
Unstructured data like images, videos, and PDFs hold valuable information that has historically been difficult to translate into structured data use cases. Gemini’s multimodal industry-leading context window of up to 2 million tokens allows us to extract structured data for downstream usage.
However, some gen AI models can be unreliable and prone to hallucinations, posing challenges for consistent data processing. To address this in practice, you can useGemini’s system instructions, controlled generation, grounding with Gemini, and Vertex AI evaluation services. System instructions guide the model’s behavior, while controlled generation instructs the models to output in specific format such as JSON and enforces structured outputs adhering to a predefined schema. Evaluation lets you automate the selection process of the best response and provide associated quality metrics and explanations. Finally, grounding tethers the output to private or public up-to-date data, reducing the likelihood of the model inventing content. Then, the model’s structured data output can be integrated with BigQuery for downstream analysis and used in data pipelines and ML workflows, helping to ensure consistency and reliability in business applications.
Let’s take a look at an example inspired by the YouTube ABCDs where we use one of the latest Gemini models, Flash 2.0, to analyze an ad video on YouTube to see if it follows YouTube best practices, using the following prompt:
- code_block
- <ListValue: [StructValue([(‘code’, ‘from google import genairnfrom google.genai import typesrnimport base64rnrnrndef generate():rn client = genai.Client(rn vertexai=True,rn project=”YOUR_PROJECT_ID”,rn location=”us-central1″,rn )rnrnrn text1 = types.Part.from_text(“””You are a creative expert who analyzes and labels video ads to answerrn specific questions about the content in the video and how it adheres to a set of features.rn Answer the following questions with either \\\”True\\\” or \\\”False\\\” and provide a detailed explanation torn support your answer. The explanation should be thorough and logically sound, incorporating relevantrn facts and reasoning. Only base your answers strictly on what information is available in the videorn attached. Do not make up any information that is not part of the video.rn rn These are the questions that you have to answer for each feature:rn1. does the brand show in the first 5 seconds?rn2. is there consistent brand presence throughout the ad video?rn3. is there a clear call to action in the ad?”””)rn video1 = types.Part.from_uri(rn file_uri=”https://www.youtube.com/watch?v=OMVpP-Zam1A”,rn mime_type=”video/*”,rn )rnrnrn model = “gemini-2.0-flash-exp”rn contents = [rn types.Content(rn role=”user”,rn parts=[rn text1,rn video1rn ]rn )rn ]rn generate_content_config = types.GenerateContentConfig(rn temperature = 1,rn top_p = 0.95,rn max_output_tokens = 8192,rn response_modalities = [“TEXT”],rn safety_settings = [types.SafetySetting(rn category=”HARM_CATEGORY_HATE_SPEECH”,rn threshold=”OFF”rn ),types.SafetySetting(rn category=”HARM_CATEGORY_DANGEROUS_CONTENT”,rn threshold=”OFF”rn ),types.SafetySetting(rn category=”HARM_CATEGORY_SEXUALLY_EXPLICIT”,rn threshold=”OFF”rn ),types.SafetySetting(rn category=”HARM_CATEGORY_HARASSMENT”,rn threshold=”OFF”rn )],rn response_mime_type = “application/json”,rn response_schema = {“type”:”ARRAY”,”items”:{“type”:”OBJECT”,”properties”:{“id”:{“type”:”STRING”},”name”:{“type”:”STRING”},”category”:{“type”:”STRING”},”criteria”:{“type”:”STRING”},”detected”:{“type”:”BOOLEAN”},”llm_explanation”:{“type”:”STRING”}},”required”:[“id”,”name”,”category”,”criteria”,”detected”,”llm_explanation”]}},rn},rn )rnrnrn for chunk in client.models.generate_content_stream(rn model = model,rn contents = contents,rn config = generate_content_config,rn ):rn print(chunk, end=””)rnrnrngenerate()’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ec7e02694c0>)])]>
The resulting output can easily be ingested into BigQuery as structured data for further analytical and reporting uses:
- code_block
- <ListValue: [StructValue([(‘code’, ‘[{“category”: “Brand Presence”, “criteria”: “Does the brand show in the first 5 seconds?”, “detected”: true, “id”: “brand_first_5_seconds”, “llm_explanation”: “The brand name Gemini shows up within the first 5 seconds of the video ad, clearly visible on the screen along with the text prompt that is shown.”rn, “name”: “Brand Visibility”}, rn{“category”: “Brand Presence”, “criteria”: “Is there consistent brand presence throughout the ad video?”, “detected”: true, “id”: “consistent_brand_presence”, “llm_explanation”: “The brand name Gemini remains consistently visible in the upper left corner of the screen throughout the duration of the video ad, ensuring brand awareness.” , “name”: “Consistent Branding”}, rn{“category”: “Call to Action”, “criteria”: “Is there a clear call to action in the ad?”, “detected”: true, “id”: “clear_call_to_action”, “llm_explanation”: “The video ad concludes by displaying a clear call to action directing viewers to GoogleStore.com to learn more, providing a direct path for engagement with the brand and product. ” , “name”: “Call To Action”}]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ec7e02698b0>)])]>
There are also considerations for choosing the right model for the right task. For example, larger videos or unstructured content may require using the 2M token context window, available from Gemini Pro, whereas other tasks may be fine using just 1M context window with Gemini Flash.
You can also use Gemini to generate synthetic data that mimics real-world scenarios, augmenting your datasets and improving model performance. Synthetic data is artificially generated data that statistically mirrors real-world data while preserving privacy by excluding personally identifiable information (PII). This approach enables organizations to develop robust machine learning models and data-driven insights without the limitations and risks associated with using real-world data. The growing interest in synthetic data stems from its ability to address privacy concerns, overcome data scarcity, and facilitate test data generation across various industries. To learn more about synthetic data generation using gen AI, check out our in-depth blog about Generating synthetic data in BigQuery with Gretel.
Going to production: DataOps and the LLM pipeline
Once you’ve successfully implemented LLM-powered data engineering solutions, you’re ready to integrate them into your production environment. Here are a few things you’ll need to address:
-
Scheduling and automation: Leverage tools like Composer or Vertex AI Pipelines to schedule and automate gen AI tasks, to help ensure continuous data processing and analysis.
-
Model monitoring and evaluation: Implementing an evaluation pipeline to monitor the performance of your gen AI models allows you to track accuracy, identify potential biases, and trigger retraining when necessary.
-
Version control: Treat Gemini prompts and configurations as code, using version control systems to track changes and ensure reproducibility.
The following resources are useful for integrating gen AI models into your data engineering production pipelines, and deliver robust, scalable, and reliable solutions:
Transform your data engineering processes with gen AI
Gen AI is transforming the data engineering landscape, offering powerful capabilities for schema handling, data quality improvement, synthetic data generation, and data generation from unstructured sources. By embracing these advancements and adopting DataOps principles, get ready to unlock new levels of efficiency, accuracy, and insight from your data. Start experimenting with Gemini in your own data pipelines and unlock the potential for greater consistency in data processing, insights from new data sources, and ultimately, better business outcomes.
Read More for the details.