GCP – A practical guide to synthetic data generation with Gretel and BigQuery DataFrames
In our previous post, we explored how integrating Gretel with BigQuery DataFrames streamlines synthetic data generation while preserving data privacy. To recap, BigQuery DataFrames is a Python client for BigQuery, providing pandas-compatible APIs with computations pushed down to BigQuery. Gretel offers a comprehensive toolbox for synthetic data generation using cutting-edge machine learning techniques, including large language models (LLMs). This integration enables an integrated workflow, allowing users to easily transfer data from BigQuery to Gretel and save the generated results back to BigQuery.
In this guide, we dive into the technical aspects of generating synthetic data to drive AI/ML innovation, while helping to ensure high-data quality, privacy protection, and compliance with privacy regulations. We begin by working with a BigQuery patient records table, de-identifying the data in Part 1, and then generating synthetic data to save back to BigQuery in Part 2.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e828fb896a0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Setting the stage: Installation and configuration
You can start by using BigQuery Studio as the notebook runtime, with BigFrames pre-installed. We assume you have a Google Cloud project set up and you are familiar with Pandas.
Step 1: Install the Gretel Python client and BigQuery DataFrames:
- code_block
- <ListValue: [StructValue([(‘code’, ‘%%capturern!pip install -Uqq “gretel-client>=0.22.0″rn# Install bigframes if not alreadyrn# %%capturern# !pip install bigframes’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a4c0>)])]>
Step 2: Initialize the Gretel SDK and BigFrames: You’ll need a Gretel API key to access their services. You can obtain one from the Gretel console.
- code_block
- <ListValue: [StructValue([(‘code’, ‘from gretel_client import Gretelrnfrom gretel_client.bigquery import BigFramesrnrnrngretel = Gretel(api_key=”prompt”, validate=True, project_name=”bigframes-demo”)rnrnrn# This is the core interface we will use moving forward!rngretel_bigframes = BigFrames(gretel)rnrnimport bigframes.pandas as bpdrnimport bigframesrnrnrnBIGQUERY_PROJECT = “gretel-vertex-demo”rnrnrn# Set BigFrames optionsrnbpd.options.display.progress_bar = Nonernbpd.options.bigquery.project = BIGQUERY_PROJECT’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65acd0>)])]>
Part 1: De-identifying and processing data with Gretel Transform v2
Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel’s Transform v2 (Tv2) provides a powerful and scalable framework for this and various other data processing tasks. Tv2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, Tv2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. Learn more about Gretel Transform v2.
Step 1: Create a BigFrames DataFrame from your BigQuery table:
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Define the source project and datasetrnproject_id = “gretel-public”rndataset_id = “public”rntable_id = “sample-patient-events”rnrnrn# Construct the table pathrntable_path = f”{project_id}.{dataset_id}.{table_id}”rnrnrn# Read the table into a BigFrames DataFramerndf = bpd.read_gbq_table(table_path)rndf = df.dropna()rn# Display the DataFramerndf.peek()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65abb0>)])]>
The table below is a subset of the DataFrame we will transform. We hash the `patient_id` column and create replacement first and last names based on the value of the `sex` column.
- code_block
- <ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name sex racernpmc-6545753-1 Antonio Fernandez Male Hispanicrnpmc-6192350-1 Ana Silva Female Otherrnpmc-6332555-4 Lina Chan Female Asianrnpmc-6089485-1 Omar Hassan Male Black or African Americanrnpmc-6100673-1 Aisha Khan Female Asian’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a4f0>)])]>
Step 2: Transform the data with Gretel:
- code_block
- <ListValue: [StructValue([(‘code’, ‘# De-identification configurationrnrnrntransform_config = “””rnschema_version: “1.0”rnmodels:rn – transform_v2:rn steps:rn – rows:rn update:rn – name: patient_idrn value: this | hash | truncate(10, end=”)rn – name: first_namern value: >rn fake.first_name_female() if row.sex == ‘Female’ elsern fake.first_name_male() if row.sex == ‘Male’ elsern fake.first_name()rn – name: last_namern value: fake.last_name()rn”””rn# Submit a transform job against the BigFrames tablerntransform_results = gretel_bigframes.submit_transforms(transform_config, df)rnrnrn# Check out our Model ID, we can re-use this later to restore results.rnmodel_id = transform_results.model_idrnrnrnprint(f”Gretel Model ID: {model_id}\n”)rnprint(f”Gretel Console URL: {transform_results.model_url}”)rntransform_results.wait_for_completion()rntransform_results.refresh()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a310>)])]>
Step 3: Explore the de-identified data:
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Take a look at the newly transformed BigFrames DataFramerntransformed_df = transform_results.transformed_dfrntransformed_df.peek()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65ae50>)])]>
Below is a comparison of the original vs de-identified data.
Original:
- code_block
- <ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name sex racernpmc-6545753-1 Antonio Fernandez Male Hispanicrnpmc-6192350-1 Ana Silva Female Otherrnpmc-6332555-4 Lina Chan Female Asianrnpmc-6089485-1 Omar Hassan Male Black or African Americanrnpmc-6100673-1 Aisha Khan Female Asian’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65ac70>)])]>
De-identified:
- code_block
- <ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name sex racern389b63f369 John Hampton Male Hispanicrneff31024e6 Christine Carlson Female Otherrn8af37475b6 Sarah Moore Female Asianrn7bd5f08fb8 Russell Zhang Male Black or African Americanrn1628622e23 Stacy Wilkinson Female Asian’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a790>)])]>
Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)
Gretel Navigator Fine Tuning (NavFT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include:
-
Handles multiple data modalities: numeric, categorical, free text, time series, and JSON
-
Maintains complex relationships across data types and rows
-
Can introduce meaningful new patterns, potentially improving ML/AI task performance
-
Balances data utility with privacy protection
NavFT builds on Gretel Navigator’s capabilities, enabling the creation of synthetic data that captures the nuances of your specific data, including the distributions and correlations for numeric, categorical, and other column types, while leveraging the strengths of domain-specific pre-trained models. Learn more about Navigator Fine Tuning.
In this example, we will fine-tune a Gretel model on the de-identified data from Part 1.
Step 1: Fine-tune a model:
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Prepare the training configurationrnbase_config = “navigator-ft” # Base configuration for trainingrnrnrn# Define the generation parametersrngenerate_params = {rn “num_records”: len(df), # Number of records to generatern “temperature”: 0.7 # Temperature parameter for data generationrn}rnrnrn# Submit the training job to Gretelrntrain_results = gretel_bigframes.submit_train(rn base_config=base_config,rn dataframe=transformed_df,rn job_label=”synthetic_patient_data”,rn generate=generate_params,rn group_training_examples_by=”patient_id”, # Group training examples by patient_idrn order_training_examples_by=”event_date” # Order training examples by event_datern)rntrain_results.wait_for_completion()rntrain_results.refresh()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a940>)])]>
Step 2: Fetch the Gretel Synthetic Data Quality Report:
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Display the full report within this notebookrntrain_results.report.display_in_notebook()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a0d0>)])]>
The image below shows the high-level metrics from the Gretel Synthetic Data Quality Report. Please see the Gretel documentation for more details about how to interpret this report.
Step 3: Generate synthetic data from the fine-tuned model, evaluate data quality and privacy, and write back to a BQ table.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Fetch the synthetically generated datarndf_synth = train_results.fetch_report_synthetic_data()rndf_synth.peek()rnrnrn# Write to the destination table in BQ.rndf_synth.to_gbq(table_path)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65a2e0>)])]>
Below is a sample of the final synthetic data:
- code_block
- <ListValue: [StructValue([(‘code’, ‘patient_id first_name last_name date_of_birthrnc704235f91 Andrew Sanchez 1986-01-19rnc704235f91 Andrew Sanchez 1986-01-19rnc704235f91 Andrew Sanchez 1986-01-19rnc704235f91 Andrew Sanchez 1986-01-19rna8e410d3ff Jacqueline Smith 2016-07-15rnrnsex race weight heightrnMale Hispanic 190.0 70.0rnMale Hispanic 190.0 70.0rnMale Hispanic 190.0 70.0rnMale Hispanic 190.0 70.0rnFemale Asian 89.0 48.0rnrnevent_id event_type event_date event_namern1 Admission 01/21/2023 <NA>rn2 Treatment 01/22/2023 IV Immunosuppressionrn3 Diagnosis Test 01/22/2023 Follow-up Examinationrn4 Discharge 01/26/2023 <NA>rn1 Admission 07/15/2023 <NA>rnrnprovider_name reason resultrnDr. Angela Clinic Elective right lower lobectomy Transplant successfulrnOral Health Center Postoperative care Stable with minimal side effectsrnOrthopedic Inst. Routine check after surgery No signs of infection or relapsernCity Hospital ER End of hospital stay Stabilized with normal vitalsrnMain Hospital Initial Checkup <NA>rnrndetailsrn{}rn{“dosage”:”Standard”, “frequency”:”Twice daily”}rn{}rn{“referral”:”Outpatient clinic”}rn{}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e827a65aca0>)])]>
A few things to note about the synthetic data:
-
The various modalities (JSON structures, free text) are preserved and fully synthetic while being semantically correct.
-
Because of the group-by/order-by hyperparameters that were used during fine-tuning, the records are clustered on a per patient basis during generation.
How to use BigQuery with Gretel
This technical guide provides a foundation for leveraging Gretel AI and BigQuery DataFrames to generate and utilize synthetic data. By following these examples and exploring the Gretel documentation, you can unlock the power of synthetic data to enhance your data science, analytics, and AI development workflows while ensuring data privacy and compliance.
To learn more about generating synthetic data with BigQuery DataFrames and Gretel, explore the following resources:
-
Gretel documentation
-
BigQuery DataFrames documentation
-
Overview and Architecture blog
Start generating your own synthetic data today and unlock the full potential of your data!
Googlers Firat Tekiner, Jeff Ferguson and Sandeep Karmarkar contributed to this blog post. Many Googlers contributed to make these features a reality.
Read More for the details.