2024 05 31

GCP – Build a hybrid data processing footprint using Dataproc on Google Distributed Cloud

Google Cloud customers interested in building or modernizing their data lake infrastructure often need to maintain at least part of their workloads and data on-premises, because of regulatory or operational requirements.

Thanks to Dataproc on Google Distributed Cloud, introduced in preview at Google Cloud Next ‘24, you can now fully modernize your data lake with cloud-based technology, while building hybrid data processing footprints that allow you to store and process on-prem data that you can’t move to the cloud.

Dataproc on Google Distributed Cloud lets you run Apache Spark processing workloads on-prem, using Google-provided hardware located within your data center, while maintaining consistency between the technology you use in the cloud and locally.

For example, a large telecommunications company in Europe is modernizing their data lake on Google Cloud, while keeping Personally Identifiable Information (PII) data on-prem, on Google Distributed Cloud, to satisfy regulatory requirements.

In this blog, we will show how to use Dataproc on Google Distributed Cloud to read on-prem PII data, calculate aggregate metrics, and upload the resulting dataset to the data lake on the cloud using Google Cloud Storage.

Aggregate and anonymize sensitive data on-prem

In our demo scenario, the customer is a telecommunications company storing event logs that record users’ calls:

customer_id

customer_name

call_duration

call_type

signal_strength

device_type

location

141

Voice

379

LG Q6

Tammieview, PA

Video

947

Kyocera Hydro Elite

New Angela, FL

117

Voice

625

Huawei Y5

Toddville, MO

Video

382

iPhone X

Richmondview, NV

110

Video

461

HTC 10 evo

Cowanchester, KS

Video

326

Galaxy S7

Nicholsside, NV

200

Data

448

Kyocera Hydro Elite

New Taramouth, AR

178

Data

475

Galaxy S7

South Heather, CT

200

Voice

538

Oppo Reno6 Pro+ 5G

Gregoryburgh, ID

113

Voice

878

ZTE Axon 30 Ultra 5G

Karaview, NV

200

Data

722

Huawei P10 Lite

Petersonstad, IA

200

Voice

HTC 10 evo

West Danielport, CO

169

Voice

230

Samsung Galaxy S10+

North Jose, SD

198

Voice

Kyocera DuraForce

East Matthewmouth, AS

155

Data

757

Oppo Find X

Tuckerchester, MD

Data

ZTE Axon 30 Ultra 5G

New Tammy, NC

200

Data

656

Galaxy Note 7

East Jeanside, NJ

Data

567

Huawei Y5

Lake Patrickburgh, OH

This dataset contains PII. For regulatory compliance, PII must remain on-prem in their own data center. To satisfy this requirement, the customer S3-compatible object storage on-premise to store this data. However, now the customer would like to use their broader data lake in Google Cloud to analyze signal_strength by location and identify the best areas for new infrastructure investments.

To integrate with Google Cloud Data Analytics while still satisfying compliance requirements, Dataproc on Google Distributed Cloud supports full local execution of Spark jobs that can perform an aggregation on signal_quality. Consider this sample Spark code:

code_block
<ListValue: [StructValue([(‘code’, ‘import argparsernrnfrom pyspark.sql import SparkSessionrnimport pyspark.sql.functions as Frnrnparser = argparse.ArgumentParser()rnrnparser.add_argument(“–input”, default=”s3a://event-logs/gdc-demo/dataset.tsv”)rnparser.add_argument(“–output”, default=””)rnrnargs = parser.parse_args()rnrn# Create SparkSessionrnspark = SparkSession.builder.appName(“demo-query”).getOrCreate()rnrn# Read datarnprint(“Reading data from %s” % args.input)rndf = spark.read.csv(args.input, sep=r”\t”, header=True)rnrn# Find weighted average signal strength by locationrnout = df.select(“call_duration”, “signal_strength”, “location”).withColumn(rn “adj_duration”, F.col(“call_duration”) + 1rn).withColumn(rn “signal_x_duration”, F.col(“adj_duration”) * F.col(“signal_strength”)rn).groupBy(rn “location”rn).agg(rn F.sum(“adj_duration”).alias(“total_call_duration”),rn F.sum(“signal_x_duration”).alias(“total_signal_x_duration”),rn).withColumn(rn “weighted_avg_signal_strength”,rn F.col(“total_signal_x_duration”) / F.col(“total_call_duration”),rn).select(rn “location”, “weighted_avg_signal_strength”rn).orderBy(rn F.asc(“weighted_avg_signal_strength”)rn)rnrnout.show()rnrnif args.output:rn print(“Saving output to %s” % args.output)rn out.coalesce(1).write.option(“delimiter”, “\t”).csv(args.output)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05edfeb880>)])]>

Dataproc on GDC exposes custom resources in the Kubernetes Resource Manager API to support Spark application submission. First, users obtain credentials to the GDC cluster:

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud container hub memberships get-credentials event-logs-gdce-cluster’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05edfeb160>)])]>

Then, users can run the job shown above by creating a SparkApplication custom resource and specifying the input location from local object storage and the output location to Cloud Storage:

code_block
<ListValue: [StructValue([(‘code’, ‘kubectl apply -f – <<EOFrnapiVersion: “dataprocgdc.cloud.google.com/v1alpha1″rnkind: SparkApplicationrnmetadata:rn name: demo-spark-app-localrn namespace: demo-nsrnspec:rn applicationEnvironmentRef: demo-app-envrn pySparkApplicationConfig:rn mainPythonFileUri: “s3a://bucket-10/demo/demo-script.py”rn args:rn – “–input=s3a://event-logs/gdc-demo/dataset.tsv”rn – “–output=gs://telecom-datalake/gdc-demo/output/”rnEOF’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05edfebc70>)])]>

The resulting output in Cloud Storage identifies several areas of low signal quality:

Location

Value

Georgefurt, MS

1.0

Scottside, MA

1.0

Monroemouth, FL

1.0

Lake Robert, OH

1.0

East Lauren, VA

1.0

Shelleyburgh, CT

1.0

Buckville, ID

1.0

Garzaton, WI

3.32

North Danielle, NY

3.99

Port Natalie, ID

5.43

This data set is now available in Cloud Storage, with PII removed, as part of the customer’s broader GCP data lake strategy. This opens the possibility of additional analysis, such as trending over time, or using multiple data analytics products such as BigQuery and Dataproc Serverless.

Learn more

In this blog, we saw how you can leverage Dataproc on Google Distributed Cloud to create hybrid data processing footprints, processing on-prem sensitive data that needs to remain in your datacenter, and moving the rest of your data to the cloud. Dataproc on Google Distributed Cloud lets you modernize your data lake while respecting regulatory and operational data residency requirements. To learn more about Dataproc and Google Distributed Cloud, please visit:

Dataproc

Dataproc Serverless

Google Distributed Cloud