This is part 2 of the Data Governance blog series published in January. This blog focuses on technology to implement data governance in the cloud.
Along with a corporate governance policy and a dedicated team of people, implementing a successful data governance program requires tooling. From securing data, retaining and reporting audits, enabling data discovery, tracking lineage, to automating monitoring and alerts, multiple technologies are integrated to manage data life cycle.
Google cloud offers a comprehensive set of tools that enable organizations to manage their data securely, ensure governance, and drive data democratization. These tools fall into the following categories:
Data Security
Data security encompasses securing data from the point data is generated, acquired, transmitted, stored in permanent storage, and retired at the end of its life. Multiple strategies supported by various tools are used to ensure data security, identify and fix vulnerabilities as data moves in the data pipeline.
Google Cloud’s Security Command Center is a centralized vulnerability and threat reporting service. Security Command Center is a built-in security management tool for Google Cloud platform that helps organizations prevent, detect, and remediate vulnerabilities and threats. Security Command Center can identify security and compliance misconfigurations in your Google Cloud assets and provides actionable recommendations to resolve the issues.
Data Encryption
All data in Google cloud is encrypted by default, both in transit and rest. All VM to VM traffic, client connections to BigQuery, serverless Spark, Cloud Functions, and communication to all other services in Google cloud within a VPC as well as between peered VPCs is encrypted by default.
In addition to default encryption which is provided out of the box, customers can also manage their own encryption keys in Cloud KMS. Client side encryption where customers keep full control of the encryption keys at all times is also available.
Data Masking and Tokenization
While data encryption ensures that data is stored and travels in an encrypted form, end users are still able to see the sensitive data when they query the database or read file. Several compliance regulations require de-identifying or tokenizing sensitive data. For example, GDPR recommends data pseudonymization to “reduce the risk on data subjects”. De-identified data reduces the organization’s obligations on data processing and usage. Tokenization, another data obfuscation method, provides the ability to do data processing tasks such as verifying credit card transactions, without knowing the real credit card number. Tokenization replaces the original value of the data with a unique token. The difference between tokenization and encryption is that data encrypted using keys can be deciphered using the same keys while tokens are mapped to original data in the tokenization server. Without access to the token server, data tokens prevent deciphering of the original value even if a bad actor gets access to the token.
Google’s Cloud Data Loss Prevention (DLP) automatically detects, obfuscates and de-identifies sensitive information in your data using methods like data masking and tokenization. When building data pipelines or migrating data into the cloud, integrate Cloud DLP to automatically detect and de-identify or tokenize sensitive data and allow data scientists and users to build models and reports while minimizing risk of compliance violations.
Fine Grained Access Control
BigQuery supports fine grained access control for your data in Google Cloud. BigQuery access control policies can be created to limit access at column and row level controls in BigQuery. The combination of column and row level access control combined with DLP allows you to create datasets that have a safe (masked or encrypted) version of the data and a clear version of the data. This promotes data democratization where the CDO can trust the guardrails of Google cloud to allow access correctly according to the user identity, accompanied by audit logs to ensure a system of record. Data can be shared across the organization to run analysis and build machine learning models while ensuring that sensitive data remains inaccessible to unauthorized users.
Data Discovery, Classification and Data Sharing
Ability to find data easily is crucial to enable an effective data driven organization. Data governance programs leverage data catalogs to create an enterprise repository of all metadata. These catalogs allow data stewards and data users to add custom metadata, create business glossaries, and allow data analysts and scientists to search for data to analyze across the organization. Certain data catalogs also offer users to request access within the catalog to data which can be approved or denied based on policies created by data stewards.
Google cloud offers a fully managed and scalable Data Catalog to centralize metadata and support data discovery. Google’s data catalog will adhere to the same access controls the user has on the data (so users will not be able to search for data they cannot access). Further, Google’s Data Catalog is natively integrated into the GCP data fabric, without the need to manually register new datasets in the catalog – the same “search” technology that scours the web auto-indexes newly created data.
In addition, Google partners with major data governance platforms e.g. Collibra, Informatica to provide unified support for your on-prem and multi-cloud data ecosystem.