GCP – Dataproc 2.3 on Google Compute Engine: A lightweight image with improved security
Google Cloud Dataproc is a managed service for Apache Spark and Hadoop, providing a fast, easy-to-use, and cost-effective platform for big data analytics. In June, we announced the general availability (GA) of the Dataproc 2.3 image on Google Compute Engine, whose lightweight design offers enhanced security and operational efficiency.
“With Dataproc 2.3, we have a cutting edge, high performance and trusted platform that empowers our machine learning scientists and analysts to innovate at scale.” – Sela Samin, Machine Learning Manager, Booking.com
The Dataproc 2.3 image represents a deliberate shift towards a more streamlined and secure environment for your big data workloads. Today, let’s take a look at what makes this lightweight approach so impactful:
1. Reduced attack surface and enhanced security
Dataproc on Google Compute Engine 2.3 is a FedRamp High compliant image designed for superior security and efficiency.
At its core, we designed Dataproc 2.3 to be lightweight, meaning it contains only the essential core components required for Spark and Hadoop operations. This minimalist approach drastically reduces the exposure to Common Vulnerabilities and Exposures (CVEs). For organizations with strict security and compliance requirements, this is a game-changer, providing a robust and hardened environment for sensitive data.
We maintain a robust security posture through a dual-pronged approach to CVE (Common Vulnerabilities and Exposures) remediation, so that our images consistently meet compliance standards. This involves a combination of automated processes and targeted manual intervention:
-
Automated remediation: We use a continuous scanning system to automatically build and patch our images with fixes for known vulnerabilities, enabling us to handle issues efficiently at scale.
- Manual intervention: For complex issues where automation could cause breaking changes or has intricate dependencies, our engineers perform deep analysis and apply targeted fixes to guarantee stability and security.
2. On-demand flexibility for optional components
While the 2.3 image is lightweight, it doesn’t sacrifice functionality. Instead of pre-packaging every possible component, Dataproc 2.3 adopts an on-demand model for optional components. If your workload requires specific tools like Apache Flink, Hive WebHCat, Hudi, Pig, Docker, Ranger, Solr, Zeppelin, you can simply deploy them when creating your cluster. This helps keep your clusters lean by default, but still offers the full breadth of Dataproc’s capabilities when you need it.
3. Faster cluster creation (with custom images)
When you deploy optional components on-demand, they are downloaded and installed while the cluster is being created, which may increase the startup time a bit. However, Dataproc 2.3 offers a powerful solution to this: custom images. You can now create custom Dataproc images with your required optional components pre-installed. This allows you to combine the security benefits of the lightweight base image with the speed and convenience of pre-configured environments, drastically reducing cluster provisioning and setup time for your specific use cases.
Getting started with Dataproc 2.3
Using the new lightweight Dataproc 2.3 image is straightforward. When creating your Dataproc clusters, simply specify 2.3 (or a specific sub-minor version like 2.3.10-debian12, 2.3.10-ubuntu22, or 2.3.10-rocky9).
Here’s an example using the gcloud CLI:
- code_block
- <ListValue: [StructValue([(‘code’, ‘gcloud dataproc clusters create my-cluster \rn –region=your-region \rn –image-version=2.3-ubuntu22 \rn –network my-network \rn –optional-components […]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f7db4743250>)])]>
For comprehensive details on image versions and available components, refer to the Dataproc cluster image version lists.
The Dataproc 2.3 image sets a new standard for big data processing on Google Cloud by prioritizing a lightweight, secure and efficient foundation. By minimizing the included components by default and offering flexible on-demand installation or custom image creation, Dataproc 2.3 can help you achieve higher security compliance and optimized cluster performance.
Start leveraging the enhanced security and operational efficiency of Dataproc 2.3 today and experience a new level of confidence in your big data initiatives!
Read More for the details.