GCP – Google Cloud Serverless for Apache Spark: high-performance, unified with BigQuery
At Google Cloud, we’re committed to providing the most streamlined, powerful, and cost-effective production- and enterprise-ready serverless Spark experience. To that end, we’re thrilled to announce a significant evolution for Apache Spark on Google Cloud, with Google Cloud Serverless for Apache Spark.
Serverless Spark is now also generally available directly within the BigQuery experience. This deeply integrated experience brings the full power of Google Cloud Serverless for Apache Spark into the BigQuery unified data-to-AI platform, offering a unified developer experience in BigQuery Studio, seamless interoperability, and industry-leading price/performance.
Why Google Cloud Serverless for Apache Spark?
Apache Spark is an incredibly popular and powerful open-source engine for data processing, analytics and AI/ML. However, developers often get bogged down managing clusters, optimizing jobs, and troubleshooting, taking valuable time away from building business logic.
By simplifying your Spark experience, you can focus on deriving insights, not managing infrastructure. Google Cloud Serverless for Apache Spark (formerly Dataproc Serverless) addresses these challenges with:
-
On-demand Spark for reduced total cost of ownership (TCO):
-
No cluster management. Develop business logic in Spark for interactive, batch, and AI workloads, without worrying about infrastructure.
-
Pay only for the job’s runtime, not for environment spinup/teardown.
-
On-demand Spark environments, so no more long running, under-utilized clusters.
Exceptional performance:
-
Support for Lightning Engine (in Preview), a Spark processing engine with vectorized execution, intelligent caching, and optimized storage I/O, for up to 3.6x faster query performance on industry benchmarks*
-
Highly optimized BigQuery, Google Cloud Storage, and Spanner connectors
-
Full support (DDL, DML, schema evolution) for open data formats such as Apache Iceberg and Delta Lake
Openness and flexibility:
-
Full OSS compatibility for your existing Spark code and libraries
-
Support for Google Cloud native (BigQuery, Spanner, Bigtable), and open-source (Apache Iceberg, Apache Parquet, Delta Lake) data formats
-
Choice of language (Python, Java, Scala, R) and development environment (BigQuery Studio, Vertex AI Workbench, your own Jupyter or VS Code)
Gemini-powered productivity and assistance at every step:
-
Gemini-based PySpark code generation for developer assistance (in Preview)
-
Gemini Cloud Assist for troubleshooting recommendations (in Preview)
Easily distributed AI/ML:
-
Popular ML libraries like XGBoost, PyTorch, Transformers, and many more, all pre-packaged with Google-certified serverless Spark images, boosting productivity, improving startup times, and reducing potential security issues from custom image management
-
GPU acceleration for distributed training and inference workloads
Enterprise-grade security capabilities:
-
No SSH access to VMs
-
Encryption by default, including support for Customer Managed Encryption Keys (CMEK)
-
Custom Org Policies for setting and enforcing enterprise guardrails
-
End-user credential support to ensure traceability for all data access
Production ready capabilities:
-
Support for job isolation, so jobs do not contend for resources
-
Full control over Spark job configuration for Spark experts
-
On-demand Spark monitoring for all jobs, so you don’t have to set up your own Persistent History Server (PHS)
-
Easy deployment using Apache Airflow/Cloud Composer operators, or the orchestration/scheduling tool of your choice
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3ed82803a400>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
A Unified Spark and BigQuery experience
Building on the power of serverless Spark, we’ve worked to reimagine how you work with Spark and BigQuery, so that you can get the flexibility to use the right engine for the right job, with a unified platform, notebook interface, and on a single copy of data.
With the general availability of serverless Apache Spark in BigQuery, we’re bringing Apache Spark directly into the BigQuery unified data platform. This means you can now develop, run and deploy Spark code interactively in the BigQuery Studio, offering an alternative, scalable, OSS processing framework alongside BigQuery’s renowned SQL engine.
“We rely on machine learning for connecting our customers with the greatest travel experiences at the best prices. With Google Serverless for Apache Spark, our platform engineers save countless hours configuring, optimizing, and monitoring Spark clusters, while our data scientists can now spend their time on true value-added work like building new business logic. We can seamlessly interoperate between engines and use BigQuery, Spark and Vertex AI capabilities for our AI/ML workflows. The unified developer experience across Spark and BigQuery, with built-in support for popular OSS libraries like PyTorch, Tensorflow, Transforms etc., greatly reduces toil and allows us to iterate quickly.” – Andrés Sopeña Pérez, Head of Content Engineering, trivago
Key capabilities and benefits of Spark in BigQuery
Apart from all the features and benefits of Google Cloud Serverless for Apache Spark outlined above, Spark in BigQuery offers deep unification:
-
Unified developer experience in BigQuery Studio:
-
Develop SQL and Spark code side-by-side in BigQuery Studio notebooks.
-
Leverage Gemini-based PySpark Code Generation (Preview), with the intelligent context of your data to prevent hallucination in generated code.
-
Use Spark Connect for remote connectivity to serverless Spark sessions.
-
Because Spark permissions are unified with default BigQuery roles, you can get started without needing additional permissions.
Unified data access and engine interoperability:
-
Powered by the BigLake metastore, Spark and BigQuery can operate on a single copy of your data, whether it’s BigQuery managed tables or open formats like Apache Iceberg. No more juggling separate security policies or data governance models across engines. Refer to the documentation on using BigLake metastore with Spark.
-
Additionally, all data access to BigQuery, both native and OSS formats, are unified via the BigQuery Storage Read API. Reads from serverless Spark jobs via the Storage API are now available at no additional cost
3. Easy operationalization:
-
-
Collaborate with your team and integrate into your Git-based CI/CD workflows using BigQuery repositories.
-
Orchestrate your Spark jobs with the rest of your business logic using BigQuery Pipelines and Schedules.
-
In addition to functional unification, BigQuery spend-based CUDs now apply to all usage from serverless Spark jobs. For more information about serverless Spark pricing, please visit our pricing page.
How to get started with Spark in BigQuery Studio
Getting started is incredibly easy. Within BigQuery Studio, you can spin up a Spark session using one of the templates in the notebook.
Creating a default Spark session:
You can create a default Spark session with a single line of code, as shown below.
- code_block
- <ListValue: [StructValue([(‘code’, ‘from google_spark_session.session.spark.connect import DataprocSparkSessionrn# This line creates a default serverless Spark session powered by Google Cloud Serverless for Apache Sparkrnspark = DataprocSparkSession.builder.getOrCreate()rnrn# Now you can use the ‘spark’ variable to run your Spark codern# For example, reading a BigQuery table:rndf = spark.read.format(“bigquery”) \rn .option(“table”, “your-project.your_dataset.your_table”) \rn .load()rndf.show()’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed8285c8e80>)])]>
Customizing your Spark session:
If you want to customize your session — for example, use a different VPC network, or a service account — you can get full control over the session’s configuration, using existing session templates or by providing configurations inline. For detailed instructions on configuring your Spark sessions, reading from and writing to BigQuery, and more, please refer to the documentation.
And that’s it, you are now ready to develop your business logic using the Spark session.
The bigger picture: A unified and open data cloud
With Google Cloud Serverless for Apache Spark and its new, deep integration with BigQuery, we’re breaking down barriers between powerful analytics engines, enabling you to choose the best tool for your specific task, all within a cohesive and managed environment.
We invite you to experience the power and simplicity of Google Cloud Serverless for Apache Spark and its new, deep integration with BigQuery.
We are incredibly excited to see what you will build. Stay tuned for more innovations as we continue to enhance Google Cloud Serverless for Apache Spark and its integrations across the Google Cloud ecosystem.
* The queries are derived from the TPC-H standard and as such are not comparable to published TPC-H standard results, as these runs do not comply with all requirements of the TPC-H standard specification.
Read More for the details.