2025 12 03

GCP – No metadata? No problem, with AI and Dataplex Universal Catalog

If you’ve ever opened a dataset in BigQuery only to find columns with generic names like col1, col2, and value_x, you know the tax that poor documentation can put on analytics. At the heart of this issue is the schema — the blueprint of how your data is structured, named, and related. But when schemas are inconsistent, cryptic, or poorly documented, they create a knowledge gap that slows down discovery, governance, and trust.

This is the reality of “metadata debt.” A column named cust_id might mean “customer identifier” in one dataset and “customs record ID” in another. Multiply this ambiguity across hundreds of tables and thousands of columns, and you have a problem that plagues even the most modern data stacks.

For data engineers, analysts, and governance teams, the challenge of inadequate metadata is familiar:

Manual documentation doesn’t scale. Even with a dedicated data steward, keeping table and column descriptions up to date is a losing battle.
Context is scattered. Some details live in team wikis, others in spreadsheets, and some only in the minds of engineers who have since moved on.
Governance gets bottlenecked. Without clear definitions, policy enforcement and data classification become guesswork.

And now, with the rise of AI agents in analytics workflows, a poorly documented schema isn’t just an inconvenience — it’s a blocker. Simply put, an AI agent can’t query a column it doesn’t understand.

Automation enters the picture

But while AI may suffer from this problem, it can also help fix it, with automated metadata generation, a new capability in the Google Data Cloud that is now generally available. By analyzing profile data (think: data types, value distributions, patterns) alongside schema context, an AI system can draft human-readable descriptions for tables and columns — instantly.

Here’s what that means in practice:

A table named sales_fact_2025 might get a generated description like:“Contains transactional sales data for 2025, including product IDs, regions, quantities, and revenue.”
A column named qty might be described as:“Number of units sold in each transaction.”

It’s not just about filling in blanks. It’s about creating consistent, searchable, and understandable documentation that’s ready the moment a dataset lands in your environment.

The power of BigQuery + Dataplex

In a Google Cloud environment, you can use Dataplex Universal Catalog to automate metadata creation for your BigQuery datasets, right where you work:

Profiling in Dataplex gathers statistics about your BigQuery tables.
Gemini-powered generation turns those stats into clear, contextual descriptions for tables, columns, and even glossary terms.
Dataplex Universal Catalog stores these descriptions for search, governance, and AI workflows across your environment.

You get the benefits of automated metadata generation right away — whether you’re in the BigQuery console searching for datasets, in Dataplex applying governance policies, or in an AI-powered data agent. Benefits include:

1. Time to insights

Instead of spending upfront analysis time figuring out what the data represents, you can jump straight into querying it in BigQuery.

Before (without generated metadata): An analyst encounters a table with a column named c1. The data in the column looks like a series of numbers, but it’s unclear what they represent.
After (with generated metadata): The analyst sees the description for the c1 column: “Estimated annual revenue of the account.” They can now write the correct query from the start: SELECT account_id, c1 AS estimated_annual_revenue FROM accounts WHERE c1 > 1000000;

2. Governance at scale

The Dataplex Universal Catalog can now store AI-generated descriptions, meaning governance rules can be applied more effectively. When every column has a description, it’s easier to spot sensitive data, classify fields, and enforce compliance policies without manual detective work.

3. Fuel for AI agents

Data agents rely on metadata for grounding. When descriptions are complete and consistent, the AI can map natural language requests to the right datasets with higher accuracy. That means fewer hallucinations, more relevant results, and better trust in conversational analytics.

A customer perspective: Virgin Media O2

Virgin Media O2 is a leading British telecommunications company formed through the merger of Virgin Media and O2 UK. As one of the largest telecoms operators in the United Kingdom, providing mobile, broadband, TV and landline services, the company continues to innovate in how it manages and leverages data.

“As part of its forward-looking data strategy, Virgin Media O2 is enhancing how metadata is created, understood, and governed across its expansive data estate. With over 20,000 data assets distributed across a federated architecture of business units and data users, the organization is unlocking new opportunities to make data more meaningful, discoverable, and trusted.

To enable this, we implemented a Smart Metadata solution that combines the power of generative AI with the deep domain knowledge of our internal experts. Leveraging BigQuery Data Insights, AI automatically generates rich, contextual metadata by analysing schema, data profiles, and relationships. For example, a column named txn_amt becomes “Transaction Amount (in GBP) — derived from the daily retail sales feed,” making data instantly meaningful to analysts and business users. This metadata is then refined and validated through crowdsourced input from specialists across the organization — ensuring it reflects real-world business context and remains accurate, relevant, and actionable.

By blending automation with human intelligence, Virgin Media O2 built a scalable, governed, and intelligent metadata foundation. This approach enhances data discovery, strengthens data quality, and empowers teams across departments to make confident, data-driven decisions — turning metadata into a strategic enabler of innovation, trust, and enterprise-wide value.”

– Chandu Bhuman, Head of Data Strategy, Cloud & Engineering, Virgin Media O2

Looking ahead

Automated metadata generation doesn’t replace human judgment — you’ll still want to review and refine key business definitions — but it does close the gap between when data is created and when it becomes usable. For data analytics teams running on Google Cloud, that’s not just a productivity boost — it’s the foundation for the next wave of analytics, where humans and AI agents work from the same, clearly defined context. Automated metadata generation is also accessible through an API, making it easy to integrate with existing data engineering pipelines. To get started with automated metadata generation in Google Data Cloud, visit the documentation.

GCP – No metadata? No problem, with AI and Dataplex Universal Catalog

Automation enters the picture

The power of BigQuery + Dataplex

A customer perspective: Virgin Media O2

Looking ahead

Related Posts

AWS – Amazon VPC Route Server now available in new regions

GCP – Palo Alto Networks automates customer intelligence document creation with agentic design

GCP – Vibe querying: Write SQL queries faster with Comments to SQL in BigQuery