GCP – Build a hybrid data processing footprint using Dataproc on Google Distributed Cloud
Google Cloud customers interested in building or modernizing their data lake infrastructure often need to maintain at least part of their workloads and data on-premises, because of regulatory or operational requirements.
Thanks to Dataproc on Google Distributed Cloud, introduced in preview at Google Cloud Next ‘24, you can now fully modernize your data lake with cloud-based technology, while building hybrid data processing footprints that allow you to store and process on-prem data that you can’t move to the cloud.
Dataproc on Google Distributed Cloud lets you run Apache Spark processing workloads on-prem, using Google-provided hardware located within your data center, while maintaining consistency between the technology you use in the cloud and locally.
For example, a large telecommunications company in Europe is modernizing their data lake on Google Cloud, while keeping Personally Identifiable Information (PII) data on-prem, on Google Distributed Cloud, to satisfy regulatory requirements.
In this blog, we will show how to use Dataproc on Google Distributed Cloud to read on-prem PII data, calculate aggregate metrics, and upload the resulting dataset to the data lake on the cloud using Google Cloud Storage.
Aggregate and anonymize sensitive data on-prem
In our demo scenario, the customer is a telecommunications company storing event logs that record users’ calls:
customer_id
customer_name
call_duration
call_type
signal_strength
device_type
location
1
<redacted>
141
Voice
379
LG Q6
Tammieview, PA
2
<redacted>
26
Video
947
Kyocera Hydro Elite
New Angela, FL
3
<redacted>
117
Voice
625
Huawei Y5
Toddville, MO
4
<redacted>
36
Video
382
iPhone X
Richmondview, NV
5
<redacted>
110
Video
461
HTC 10 evo
Cowanchester, KS
6
<redacted>
0
Video
326
Galaxy S7
Nicholsside, NV
7
<redacted>
200
Data
448
Kyocera Hydro Elite
New Taramouth, AR
8
<redacted>
178
Data
475
Galaxy S7
South Heather, CT
9
<redacted>
200
Voice
538
Oppo Reno6 Pro+ 5G
Gregoryburgh, ID
10
<redacted>
113
Voice
878
ZTE Axon 30 Ultra 5G
Karaview, NV
11
<redacted>
200
Data
722
Huawei P10 Lite
Petersonstad, IA
12
<redacted>
200
Voice
1
HTC 10 evo
West Danielport, CO
13
<redacted>
169
Voice
230
Samsung Galaxy S10+
North Jose, SD
14
<redacted>
198
Voice
1
Kyocera DuraForce
East Matthewmouth, AS
15
<redacted>
155
Data
757
Oppo Find X
Tuckerchester, MD
16
<redacted>
0
Data
1
ZTE Axon 30 Ultra 5G
New Tammy, NC
17
<redacted>
200
Data
656
Galaxy Note 7
East Jeanside, NJ
18
<redacted>
15
Data
567
Huawei Y5
Lake Patrickburgh, OH
This dataset contains PII. For regulatory compliance, PII must remain on-prem in their own data center. To satisfy this requirement, the customer S3-compatible object storage on-premise to store this data. However, now the customer would like to use their broader data lake in Google Cloud to analyze signal_strength by location and identify the best areas for new infrastructure investments.
To integrate with Google Cloud Data Analytics while still satisfying compliance requirements, Dataproc on Google Distributed Cloud supports full local execution of Spark jobs that can perform an aggregation on signal_quality. Consider this sample Spark code:
<ListValue: [StructValue([(‘code’, ‘import argparsernrnfrom pyspark.sql import SparkSessionrnimport pyspark.sql.functions as Frnrnparser = argparse.ArgumentParser()rnrnparser.add_argument(“–input”, default=”s3a://event-logs/gdc-demo/dataset.tsv”)rnparser.add_argument(“–output”, default=””)rnrnargs = parser.parse_args()rnrn# Create SparkSessionrnspark = SparkSession.builder.appName(“demo-query”).getOrCreate()rnrn# Read datarnprint(“Reading data from %s” % args.input)rndf = spark.read.csv(args.input, sep=r”\t”, header=True)rnrn# Find weighted average signal strength by locationrnout = df.select(“call_duration”, “signal_strength”, “location”).withColumn(rn “adj_duration”, F.col(“call_duration”) + 1rn).withColumn(rn “signal_x_duration”, F.col(“adj_duration”) * F.col(“signal_strength”)rn).groupBy(rn “location”rn).agg(rn F.sum(“adj_duration”).alias(“total_call_duration”),rn F.sum(“signal_x_duration”).alias(“total_signal_x_duration”),rn).withColumn(rn “weighted_avg_signal_strength”,rn F.col(“total_signal_x_duration”) / F.col(“total_call_duration”),rn).select(rn “location”, “weighted_avg_signal_strength”rn).orderBy(rn F.asc(“weighted_avg_signal_strength”)rn)rnrnout.show()rnrnif args.output:rn print(“Saving output to %s” % args.output)rn out.coalesce(1).write.option(“delimiter”, “\t”).csv(args.output)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05edfeb880>)])]>
Dataproc on GDC exposes custom resources in the Kubernetes Resource Manager API to support Spark application submission. First, users obtain credentials to the GDC cluster:
<ListValue: [StructValue([(‘code’, ‘gcloud container hub memberships get-credentials event-logs-gdce-cluster’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05edfeb160>)])]>
Then, users can run the job shown above by creating a SparkApplication custom resource and specifying the input location from local object storage and the output location to Cloud Storage:
<ListValue: [StructValue([(‘code’, ‘kubectl apply -f – <<EOFrnapiVersion: “dataprocgdc.cloud.google.com/v1alpha1″rnkind: SparkApplicationrnmetadata:rn name: demo-spark-app-localrn namespace: demo-nsrnspec:rn applicationEnvironmentRef: demo-app-envrn pySparkApplicationConfig:rn mainPythonFileUri: “s3a://bucket-10/demo/demo-script.py”rn args:rn – “–input=s3a://event-logs/gdc-demo/dataset.tsv”rn – “–output=gs://telecom-datalake/gdc-demo/output/”rnEOF’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05edfebc70>)])]>
The resulting output in Cloud Storage identifies several areas of low signal quality:
Location
Value
Georgefurt, MS
1.0
Scottside, MA
1.0
Monroemouth, FL
1.0
Lake Robert, OH
1.0
East Lauren, VA
1.0
Shelleyburgh, CT
1.0
Buckville, ID
1.0
Garzaton, WI
3.32
North Danielle, NY
3.99
Port Natalie, ID
5.43
This data set is now available in Cloud Storage, with PII removed, as part of the customer’s broader GCP data lake strategy. This opens the possibility of additional analysis, such as trending over time, or using multiple data analytics products such as BigQuery and Dataproc Serverless.
Learn more
In this blog, we saw how you can leverage Dataproc on Google Distributed Cloud to create hybrid data processing footprints, processing on-prem sensitive data that needs to remain in your datacenter, and moving the rest of your data to the cloud. Dataproc on Google Distributed Cloud lets you modernize your data lake while respecting regulatory and operational data residency requirements. To learn more about Dataproc and Google Distributed Cloud, please visit:
Read More for the details.