GCP – A new top score: Advancing Text-to-SQL on the BIRD benchmark
In the fast-evolving world of agentic development, natural language is becoming the standard for interaction. This shift is deeply connected to the power of operational databases, where a more accurate text-to-SQL capability is a major catalyst for building better, more capable agents. From empowering non-technical users to self-serve data, to accelerating analyst productivity, the ability to accurately translate natural language questions into SQL is a game-changer. As end-user engagements increasingly happen over chat, conversations become the fundamental connection between businesses and their customers.
In an earlier post, “Getting AI to write good SQL: Text-to-SQL techniques explained,” we explored the core challenges of text-to-SQL — handling complex business context, ambiguous user intent, and subtle SQL dialects — and the general techniques used to solve them.
Today, we’re moving from theory to practice. We’re excited to share that Google Cloud has scored a new state-of-the-art result on the BIRD benchmark’s Single Trained Model Track. We scored 76.13, ahead of any other single-model solution (higher is better). In general, the closer you get to the benchmark of human performance (92.96), the harder it is to score incremental gains.
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) is an industry standard for testing text-to-SQL solutions. BIRD spans over 12,500 unique question-SQL pairs from 95 databases with a total size of 33 GB. The Single Trained Model Track is designed to measure the raw, intrinsic capability of the model itself, restricting the use of complex preprocessing, retrieval, or agentic frameworks often used to boost model accuracy. In other words, success here reflects an advancement in the model’s core ability to generate SQL.
Gemini scores #1 place in BIRD (October ‘25)
From research to industry-leading products
This leap in more accurate natural-language-to-SQL capability, often referred to as NL2SQL, isn’t just an internal research or engineering win; it fundamentally elevates the customer experience across several key data services,and our state-of-the-art research in this field is enabling us to create industry-leading products that customers leverage to activate their data with agentic AI.
Consider AlloyDB AI’s natural language capability, a tool that customers use to allow end users to query the most current operational data using natural language. For instance, companies like Hughes, an Echostar Corporation, depend on AlloyDB’s NL2SQL for critical tasks like call analytics. Numerous other retail, technology, and industry players also integrate this capability into their customer-facing applications. With NL2SQL that is near-100% accurate, customers gain the confidence to build and deploy applications in production workloads that rely on real-time data access.
The benefits of NL2SQL extend to analysis, as exemplified with conversational analytics in BigQuery. This service lets business users and data analysts explore data, run reports, and extract business intelligence from vast historical datasets using natural language. The introduction of a multi-turn chat experience, combined with a highly accurate NL2SQL engine, helps them make informed decisions with the confidence that the responses from BigQuery-based applications are consistently accurate.
Finally, developers are finding new efficiencies. They have long relied on Google Code Assist (GCA) for code generation, aiding their application development with databases across Spanner, AlloyDB, and Cloud SQL Studio. With the availability of more accurate NL2SQL, developers will be able to use AI coding assistance to generate SQL code too.
BIRD: a proving ground for core model capability
BIRD benchmark is one of the most commonly used benchmarks in the text-to-SQL field. It moves beyond simple, single-table queries to cover real-world challenges our models must handle, such as reasoning over very large schemas, dealing with ambiguous values, and incorporating external business knowledge. Crucially, BIRD measures a critical standard: execution-verified accuracy. This means a query is not just considered ‘correct’ if it appears right; it must also successfully run and return the correct data.
We specifically targeted the Single Trained Model Track because it allows us to isolate and measure the model’s core ability to solve the text-to-SQL task (rather than an ensemble, a.k.a., a system with multiple components such as multiple parallel models, re-rankers, etc.). This distinction is critical, as text-to-SQL accuracy can be improved with techniques like dynamic few-shot retrieval or schema preprocessing; this track reflects the model’s true reasoning power. By focusing on a single-model solution, these BIRD results demonstrate that enhancing the core model creates a stronger foundation for systems built on top of it.
Our method: Specializing the model
Achieving a state-of-the-art score doesn’t happen only by using a powerful base model. The key is to specialize the model. We developed a recipe designed to transform the model from a general-purpose reasoner into a highly specialized SQL-generation expert.
This recipe consisted of three critical phases applied before inference:
-
Rigorous data filtering: Ensuring the model learns from a flawless, “gold standard” dataset.
-
Multitask learning: Teaching the model not just to translate, but to understand the implicit subtasks required for writing a correct SQL query.
-
Test-time scaling: “Self consistency” a.k.a., picking the best answer.
Let’s break down each step.
Our process for achieving SOTA result
Step 1: Start with a clean foundation (data filtering)
One important tenet of fine-tuning is “garbage in, garbage out.” A model trained on a dataset with incorrect, inefficient, or ambiguous queries may learn incorrect patterns. The training data provided by the BIRD benchmark is powerful, but like most large-scale datasets, it’s not perfect.
Before we could teach the model to be a SQL expert, we had to curate a gold-standard dataset. We used a rigorous two-stage pipeline: first, execution-based validation to execute every query and discard any that failed, returned an error, or gave an empty result. Second, we used LLM-based validation, where multiple LLMs act as a “judge” to validate the semantic alignment between the question and the SQL, catching queries that run but don’t actually answer the user’s question. This aggressive filtering resulted in a smaller, cleaner, and more trustworthy dataset that helped our model learn from a signal of pure quality rather than noise.
Step 2: Make the model a SQL specialist (multitask learning)
With a clean dataset, we could move on to the supervised fine-tuning itself. This is the process of taking a large, general-purpose model — in our case, Gemini 2.5-pro — and training it further on our narrow, specialized dataset to make it an expert in a specific task.
To build these skills directly into the model, we leveraged the publicly available Supervised Tuning API for Gemini on Vertex AI. This service provided the foundation for our multitask supervised finetuning (SFT) approach, where we trained Gemini-2.5-pro on several distinct-but-related tasks simultaneously.
We also extended our training data to cover tasks outside of the main Text-to-SQL realm, helping enhance the model’s reasoning, planning, and self-correction capabilities.
By training on this combination of tasks in parallel, the model learns a much richer, more robust set of skills. It goes beyond simple question-to-query mapping — it learns to deeply analyze the problem, plan its approach, and refine its own logic, leading to drastically improved accuracy and fewer errors.
Step 3: Inference accuracy + test-time scaling with self-consistency
The final step was to ensure we could reliably pick the model’s single best answer at test time. For this, we used a technique called self-consistency.
With self-consistency, instead of asking the model for just one answer, we ask it to generate several query candidates for the same question. We then execute these queries, cluster them by their execution results, and select a representative query from the largest cluster. This approach is powerful because if the model arrives at the same answer through different reasoning paths, that answer has a much higher probability of being correct.
It’s important to note that self-consistency is a standard, efficient method, but it is not the only way to select a query. More complex, agentic frameworks can achieve even higher accuracy. For example, our team’s own research on CHASE-SQL (our state-of-the-art ensembling methodology) demonstrates that using diverse candidate generators and a trained selection agent can significantly outperform consistency-based methods.
For this benchmark, we wanted to focus on the model’s core performance. Therefore, we used the more direct self-consistency method: we generated several queries, executed them, and selected a query from the group that produced the most common result. This approach allowed us to measure the model’s raw text-to-SQL ability, minimizing the influence of a more complex filtering or reranking system.
The BIRD Single-Model Track explicitly allows for self-consistency, which reflects the model’s own internal capabilities. The benchmark categorizes submissions based on the number of candidates used (‘Few’, ‘Many’, or ‘Scale’). We found our “sweet spot” in the “Few” (1-7 candidates) category.
This approach gave us the final, critical boost in execution accuracy that pushed our model to the top of the leaderboard. More importantly, it proves our core thesis: by investing in high-quality data and instruction tuning, you can build a single model that is powerful enough to be production-ready without requiring a heavy, high-latency inference framework.
A recipe for customizing Gemini for text-to-SQL
A combination of clean data, multi-task learning, and efficient self-consistency allowed us to take the powerful Gemini 2.5-pro model and build a specialist that achieved the top-ranking score on the BIRD single-model benchmark.
Our fine-tuned model represents a much stronger baseline for text-to-SQL. However, it’s important to note that this score is not the upper bound of accuracy. Rather, it is the new, higher baseline we have established for the core model’s capability in a constrained setting. These results can be further amplified by either
-
creating an ensemble, aka integrating this specialist model into a broader system that employs preprocessing (like example retrieval) or agentic scaffolding (like our CHASE-SQL research), or
-
optimizing model quality for your unique database by enhancing metadata and/or query examples (which is how our customers typically deploy production workloads).
Nevertheless, the insights from this research are actively informing how we build our next-generation AI-powered products for Google Data Cloud, and we’ll continue to deliver these enhancements in our data services.
Explore advanced text-to-SQL capabilities today
We’re constantly working to infuse our products with these state-of-the-art capabilities, starting with bringing natural language queries to applications built on AlloyDB and BigQuery. For AI-enhanced retrieval, customers especially value AlloyDB and its AI functions. AlloyDB integrates AI capabilities directly into the database, allowing developers to run powerful AI models using standard SQL queries without moving data. It offers specialized operators such as AI.IF() for intelligent filtering, AI.RANK() for semantic reranking of search results, and AI.GENERATE() for in-database text generation and data transformation.
And if you want to write some SQL yourself, Gemini Code Assist can help. With a simple prompt, you can instruct Gemini as to the query you want to create. Gemini will generate your code and you can immediately test it by executing it against your database. We look forward to hearing about what you build with it!

Read More for the details.
