GCP – Google Cloud’s open lakehouse: Architected for AI, open data, and unrivaled performance
The Google Data Cloud is a uniquely integrated platform built on Google’s planet-scale infrastructure, infused with AI, and features an open lakehouse architecture for multimodal data. Already, organizations like Snap Inc. credit Google’s Data Cloud and open lakehouse architecture with empowering their data engineers and data scientists to do more with their data assets.
“Partnering with Google Cloud has been instrumental in our journey to build Snap’s next-generation, open lakehouse and democratize Spark and Iceberg in our developer community!” – Zhengyi Liu, Senior Manager – Software Engineering, Snap Inc.
Today, we’re excited to announce a series of innovations to our AI-powered lakehouse that sets a new standard for openness, intelligence, and performance. These innovations include:
-
BigLake Iceberg native storage: leverages Google’s Cloud Storage (GCS) to provide an enterprise-grade experience for managing and interoperating with Iceberg data. This includes BigLake tables for Apache Iceberg (GA) and BigLake metastore with a new REST Catalog API (Preview).
-
United operational and analytical engines: building on the BigLake foundation, customers can seamlessly interoperate on the same Iceberg open data foundation using BigQuery for analytical workloads (GA) and AlloyDB for PostgreSQL (Preview) to target operational needs.
-
Performance acceleration for BigQuery SQL: delivering a suite of automated SQL engine enhancements for significantly faster and more agile data processing, featuring the BigQuery advanced runtime, a low-latency query API, column metadata indexing, and an order of magnitude speedup for fine-grained updates/deletes.
-
High-performance Lightning Engine for Apache Spark: our new Lightning Engine (Preview) is designed to supercharge Apache Spark, leveraging optimized data connectors, efficient columnar shuffle operations, in-built caching, and vectorized execution.
-
Dataplex Universal Catalog: extends AI-powered intelligence and unified governance across the Google Cloud data estate by automatically discovering and organizing metadata from data to AI (including BigLake Iceberg, BigQuery, Spanner, Vertex AI models), enabling central policy enforcement via BigLake, and supporting AI-driven curation, data insights and semantic search.
-
AI-native notebooks and tooling: developer experiences are improved with Gemini-powered notebooks, PySpark code generation, and code extensions for JupyterLab and Visual Studio Code. Additionally, third-party notebook interfaces now offer enhanced and integrated experiences.
Let’s explore these new innovations.
Expanded BigLake services: Open, unified, and interoperable
We are actively reimagining BigLake into a comprehensive storage runtime for Google Data Cloud using Google’s Cloud Storage. This approach lets you build open, managed and high-performance lakehouses that span Google native storage and data stored in open formats. As part of BigLake, we are announcing our new Iceberg native storage, which provides enterprise-grade support for Iceberg on Google’s Cloud Storage through BigLake tables for Apache Iceberg (GA). BigLake natively supports Google’s Cloud Storage management capabilities and extends these to Iceberg data, enabling you to use storage Autoclass for efficient data tiering to colder storage classes and apply customer-managed encryption keys (CMEK) to your storage buckets. BigLake is also natively supported in our Dataplex Universal Catalog, helping to ensure that centralized governance is consistently enforced across your entire data estate.
Underlying BigLake, the new BigLake metastore (GA) with an Apache Iceberg REST Catalog API (Preview), allows you to achieve true openness and interoperability across your data ecosystem while simplifying management and governance. BigLake metastore is built on Google’s planet-scale infrastructure, offering a unified, managed, serverless, and scalable offering, bringing together enterprise metadata that spans BigQuery, Iceberg native storage, and self managed open formats to support analytics, operational querying, streaming, and AI. The BigLake solution enables universal engine interoperability, supporting a range of query engines — including first-party Google Cloud services such as BigQuery, AlloyDB, and Google Cloud Serverless for Apache Spark, as well as third party and open-source engines— to consistently operate on Iceberg data managed by BigLake.
In addition, it is now easier than ever to bring data into the Iceberg native storage through our enhanced Migration Services that feature automated Iceberg table and metadata migration from Hadoop/Cloudera (Preview) and a push-button Delta to Iceberg service (Preview).
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e5b86fb3fd0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Analytical and operational engines unite on open data
When you need to perform deep analytics, BigQuery can now read and write Iceberg data using BigLake tables for Apache Iceberg. BigQuery further enhances Iceberg tables with features traditionally associated with proprietary data warehouses, offering high-throughput streaming for zero-latency queries, enhanced table management with automatic data reclustering, and the ability to build advanced ETL use cases with support for multi-table transactions (Preview). In addition, you can leverage BigQuery’s built-in AI capabilities (BQML, AI Query Engine, multimodal analysis) directly on your open datasets. Through this integration, you benefit from the openness and data ownership associated with native Iceberg storage, while simultaneously gaining access to BigQuery’s expansive capabilities. In fact, customer adoption of BigLake Iceberg usage with BigQuery has grown nearly 3x in 18 months, now managing hundreds of petabytes.
Unified data management extends beyond analytics into the operational heart of your business, with AlloyDB for PostgreSQL, our high-performance operational database, which can now natively query the same BigLake-managed Iceberg data. Now, your operational applications can tap into the richness of BigLake without complex ETL, and you can apply AlloyDB AI capabilities such as semantic search and natural language querying to your Iceberg data.
Customers like Bayer modernized their data cloud to store and analyze vast amounts of observational data using a combination of AlloyDB and BigQuery. They use BigQuery to produce real-time analytics and insights which are operationalized by AlloyDB, delivering 50% better response rates and 5x more throughput than their previous solution.
Unleashing high-performance BigQuery SQL and serverless Spark on open data
We’re also excited to deliver new high-performance data processing, so that all data can be activated quickly and intelligently. We continue to innovate on BigQuery’s SQL engine with a suite of unique, automated performance enhancements. The BigQuery advanced runtime (Preview), can automatically accelerate analytical workloads, using enhanced vectorization and short query optimized mode, without requiring any user action or code changes. This is complemented by the BigQuery API optional job creation mode (GA), which optimizes query paths for short-duration, interactive queries, reducing latency. Further query efficiency is unlocked by the BigQuery column metadata index (CMETA) (GA), which helps process queries on large tables through more efficient, system-managed data pruning. Other architectural improvements also mean that BigQuery fine-grained updates/deletes (Preview) now operate an order of magnitude faster, increasing agility for large-scale data operations, including on open formats.
Simultaneously, we’re launching an accelerated Apache Spark experience with our new Lightning Engine (Preview) for Apache Spark. The Lightning Engine accelerates Apache Spark performance through highly optimized data connectors for Cloud Storage and BigQuery storage, efficient columnar shuffle operations, and intelligent in-built caching mechanisms. Furthermore, our Lightning Engine leverages vectorized execution built with native C++ libraries (Velox and Gluten), optimized for Apache Spark. This powerful combination delivers 3.6x faster Spark performance for TPC-H like benchmarks. In addition, our Spark offering is AI/ML-ready, providing pre-packaged AI libraries, updated ML runtimes, and easy GPU support, establishing Apache Spark–available via our Google Cloud Serverless for Apache Spark offering or via Dataproc cluster deployments–as a first-class, high-performance citizen in a Google Data Cloud lakehouse environment.
Dataplex Universal Catalog: AI-powered intelligence across Google Cloud
An effective AI-driven data strategy hinges on having an intelligent and active universal catalog that can operate at any scale. This is what Dataplex Universal Catalog now provides for the Google Data Cloud, transforming your entire distributed data estate into trusted, discoverable, and actionable resources.
Dataplex Universal Catalog automatically discovers, understands, and organizes metadata across your whole analytical and operational landscape. This comprehensive view now includes BigLake-native Iceberg storage, other open formats like Delta and Hudi on Cloud Storage, analytical data in BigQuery, transactional data from databases like Spanner, and metadata from machine learning models in Vertex AI—showcasing pervasive governance across Google’s Data Cloud.
This is also integral to the lakehouse by enabling users to define governance policies centrally and enforce them consistently across multiple data engines through BigLake. This integration supports fine-grained access controls and strengthens governance, across all engines of choice in Google’s Data Cloud. The BigLake solution supports credential vending, which allows users to securely extend centrally defined policies all the way to data in Cloud Storage.
Dataplex Universal Catalog is powered by AI, with a Gemini-enhanced knowledge graph, transforming metadata into dynamic, actionable intelligence. Here, AI automates metadata curation, infers hidden relationships between data elements, proactively recommends insights from data backed by complex queries, and enables semantic search with natural language. It also fuels new AI-powered experiences and autonomous agents. For instance, Gemini-powered assistance using Dataplex Universal Catalog shows 50% greater precision in identifying datasets, significantly accelerating insights. Dataplex Universal Catalog is also the foundation of an open ecosystem with seamless metadata federation to platforms like Collibra, and ensures broad connectivity through Dataplex Universal Catalog APIs.
Empowering practitioners with AI-native notebooks and tooling
At Google Cloud, our goal is to revolutionize the data practitioner’s experience by embedding sophisticated AI and lakehouse integrations directly into their preferred tools and workflows. This commitment to an open, flexible, and intelligent environment lets data scientists, engineers, and analysts unlock new levels of productivity and innovation.
Making this possible are our next-gen, AI-native BigQuery Notebooks, which offer a unified and interoperable development experience across SQL, Python, and Apache Spark. This experience is enhanced by deeply embedded Gemini assistive capabilities. Gemini acts as an intelligent collaborator, offering advanced PySpark code generation, insightful explanations of complex code, and direct integration with Cloud Assist Investigations for serverless Spark troubleshooting (Preview), dramatically reducing development friction and accelerating the path from data to insight.
Furthermore, new JupyterLab and Visual Studio Code extensions for BigQuery, Dataproc and Google Cloud Serverless for Apache Spark (Preview) allow developers to connect to Google Cloud’s open lakehouse capabilities directly from their preferred IDEs with minimal setup. Users can start developing within minutes with access to all their lakehouse datasets and files in their preferred tool, supporting their end-to-end journey from development to deployment. The consumption of notebooks using serverless Spark more than quadrupled from Q1 2024 to Q1 2025.
Together, these integrated advancements help deliver an adaptable, intelligent, high-performance Data Cloud anchored on the lakehouse architecture, equipping organizations to connect all of their data to Google’s AI, unlock its full potential, and define innovation in the AI era. Click here to learn more and sign up for early access to these new capabilities. We’re excited to see the solutions you’ll build.
Read More for the details.