GCP – Toward automated tagging: bringing bulk metadata into Data Catalog
Data Catalog lets you ingest and edit business metadata through an interactive interface. It includes programmatic interfaces that can be used to automate your common tasks. Many enterprises have to define and collect a set of metadata using Data Catalog, so we’ll offer some best practices here on how to declare, create, and maintain this metadata in the long run.
In our previous post, we looked at how tag templates can facilitate data discovery, governance, and quality control by describing a vocabulary for categorizing data assets. In this post, we’ll explore how to tag data using tag templates. Tagging refers to creating an instance of a tag template and assigning values to the fields of the template in order to classify a specific data asset. As of this writing, Data Catalog supports three storage back ends: BigQuery, Cloud Storage and Pub/Sub. We’ll focus here on tagging assets that are stored on those back ends, such as tables, columns, files, and message topics.
We’ll describe three usage models that are suitable for tagging data within a data lake and data warehouse environment: provisioning of a new data source, processing derived data, and updating tags and templates. For each scenario, you’ll see our suggested approach for tagging data at scale.
1. Provisioning data sources
Provisioning a data source typically entails several activities: creating tables or files depending on the storage back end, populating them with some initial data, and setting access permissions on those resources. We add one more activity to this list: tagging the newly created resources in Data Catalog. Here’s what that step entails.
Tagging a data source requires a domain expert who understands both the meaning of the tag templates to be used and the semantics of the data in the data source. Based on their knowledge, the domain expert chooses which templates to attach as well as what type of tag to create from those templates. It is important for a human to be in the loop, given that many decisions rely on the accuracy of the tags.
We’ve observed two types of tags based on our work with clients. One type is referred to as static
because the field values are known ahead of time and are expected to change only infrequently. The other type is referred to as dynamic
because the field values change on a regular basis based on the contents of the underlying data. An example of a static tag is the collection of data governance fields that include data_domain, data confidentiality
, and data_retention
. The value of those fields are determined by an organization’s data usage policies. They are typically known by the time the data source is created and they do not change frequently. An example of a dynamic tag is the collection of data quality fields, such as number_values, unique_values, min_value
, and max_value
. Those field values are expected to change frequently whenever a new load runs or modifications are made to the data source.
In addition to these differences, static tags also have a cascade property that indicates how their fields should be propagated from source to derivative data. (We’ll expand on this concept in a later section.) By contrast, dynamic tags have a query expression and a refresh property to indicate the query that should be used to calculate the field values and the frequency by which they should be recalculated. An example of a config for a static tag is shown in the first code snippet, and one for a dynamic tag is shown in the second.
YAML-based static tag config
YAML-based dynamic tag config
As mentioned earlier, a domain expert provides the inputs to those configs when they are setting up the tagging for the data source. More specifically, they first select the templates to attach to the data source. Secondly, they choose the tag type to use, namely static or dynamic. Thirdly, they input the values of each field and their cascade setting if the type is static, or the query expression and refresh setting if the type is dynamic. These inputs are provided through a UI so that the domain expert doesn’t need to write raw YAML files.
Once the YAML files are generated, a tool parses the configs and creates the actual tags in Data Catalog based on the specifications. The tool also schedules the recalculation of dynamic tags according to the refresh settings. While a domain expert is needed for the initial inputs, the actual tagging tasks can be completely automated. We recommend following this approach so that newly created data sources are not only tagged upon launch, but tags are maintained over time without the need for manual labor.
2. Processing derivative data
In addition to tagging data sources, it’s important to be able to tag derivative data at scale. We define derivative data in broad terms, as any piece of data that is created from a transformation of one or more data sources. This type of data is particularly prevalent in data lake and warehousing scenarios where data products are routinely derived from various data sources.
The tags for derivative data should consist of the origin data sources and the transformation types applied to the data. The origin data sources’ URIs are stored in the tag and one or more transformation types are stored in the tag—namely aggregation, anonymization, normalization, etc. We recommend baking the tag creation logic into the pipeline that generates the derived data. This is doable with Airflow DAGs and Beam pipelines. For example, if a data pipeline is joining two data sources, aggregating the results and storing them into a table, you can create a tag on the result table with references to the two origin data sources and aggregation:true
. You can see this code snippet of a Beam pipeline that creates such a tag:
Beam pipeline with tagging logic
Once you’ve tagged derivative data with its origin data sources, you can use this information to propagate the static tags that are attached to those origin data sources. This is where the cascade
property comes into play, which indicates which fields should be propagated to their derivative data. An example of the cascade
property is shown in the first code snippet above, where the data_domain
and data_confidentiality
fields are both to be propagated, whereas the data_retention
field is not. This means that any derived tables in BigQuery will be tagged with data_domain:HR
and data_confidentiality:CONFIDENTIAL
using the dg_template.
3. Handling updates
There are several scenarios that require update capabilities for both tags and templates. For example, if a business analyst discovers an error in a tag, one or more values need to be corrected. If a new data usage policy gets adopted, new fields may need to be added to a template and existing fields renamed or removed.
We provide configs for tag and template updates, as shown in the figures below. The tag update config specifies the current and new values for each field that is changing. The tool processes the config and updates the values of the fields in the tag based on the specification. If the updated tag is static, the tool also propagates the changes to the same tags on derivative data.
The template update config specifies the field name, field type, and any enum value changes. The tool processes the update by first determining the nature of the changes. As of this writing, Data Catalog supports field additions and deletions to templates as well as enum value additions, but field renamings or type changes are not yet supported. As a result, the tool modifies the existing template if a simple addition or deletion is requested. Otherwise, it has to recreate the entire template and all of its dependent tags.
YAML-based tag update config
YAML-based template update config
We’ve started prototyping these approaches to release an open-source tool that automates many tasks involved in creating and maintaining tags in Data Catalog in accordance with our proposed usage model. Keep an eye out for that. In the meantime, learn more about Data Catalog tagging.
Read More for the details.