GCP – How good is your AI? Gen AI evaluation at every stage, explained
As AI moves from promising experiments to landing core business impact, the most critical question is no longer “What can it do?” but “How well does it do it?”.
Ensuring the quality, reliability, and safety of your AI applications is a strategic imperative. To guide you, evaluation must be your North Star—a constant process that validates your direction throughout the entire development lifecycle. From crafting the perfect prompt and choosing the right model to deciding if tuning is worthwhile and evaluating your agents, robust evaluation provides the answers.
One year ago, we launched the Gen AI evaluation service, offering capabilities to evaluate various models including Google’s foundation models, open models, proprietary foundation models, and customized models. It provided online evaluation modes with pointwise and pairwise criteria, utilizing computation and Autorater methods.
Since then, we’ve listened closely to your feedback and focused on addressing your most important needs. That’s why today we’re excited to dive into the new features of the Gen AI Evaluation Service, designed to help you scale your evaluations, evaluate your autorater, customize your autorater with rubrics and evaluate your agents in production.
Framework to evaluate your generative AI
1. Scale your valuation with Gen AI batch evaluation
One of the most pressing questions for AI developers is, “How can I run evaluation at scale”? Previously, scaling evaluations could be heavy-engineered, hard to maintain, and expensive. You have to build your own batch evaluation processes combining multiple Google Cloud services.
The new batch evaluation feature simplifies this process, providing a single API for large datasets. This means you can evaluate large volumes of data efficiently, supporting all methods and metrics available in the Gen AI evaluation service in Vertex AI. It’s designed to be cheaper and more efficient than previous approaches.
You can learn more about how to run batch evaluation with the Gemini API in Vertex AI in this tutorial.
2. Scrutinize your autorater and build trust
A common and critical concern we hear from developers is, “How can I customize and truly evaluate my autorater?” While using an LLM to assess an LLM-based application offers scale and efficiency, it also introduces valid questions about its limitations, robustness, and potential biases. The fundamental challenge is building trust in its results.
We believe that trust isn’t given; it’s built through transparency and control. Our features are designed to empower you to rigorously scrutinize and refine your autorater. This is achieved through two key capabilities:
-
First, you can evaluate your autorater’s quality. By creating a benchmark dataset of human-rated examples, you can directly compare the autorater’s judgments against your “source of truth.” This allows you to calibrate its performance, measure its alignment with you, and gain a clear understanding of areas that need improvement.
-
Second, you can actively improve its alignment. We provide several approaches to customize your autorater’s behavior. You can refine the autorater’s prompt with specific criteria, chain-of-thought reasoning, and detailed scoring guidelines. Furthermore, advanced settings and the ability to bring and tune the autorater with your own reference data ensures it meets your specific needs and is able to capture unique use cases.
Here is an example of analysis you can build with the new autorater customization features.
Check out the Advanced judge model customization series in the official documentation to learn more about how to evaluate and configure the judge model. For a practical example, here is a tutorial on how to customize your evaluations using an open autorater with Vertex AI Gen AI Evaluation.
3. Rubrics-driven evaluation
Evaluating complex AI applications can sometimes present a frustrating challenge: how can you use a fixed set of criteria when every input is different? A generic list of evaluation criteria often fails to capture the nuance of a complex multimodal use case, such as image understanding.
To solve this, our rubrics-driven evaluation feature breaks the evaluation experience into a two-step approach.
-
Step 1 – Rubric generation: First, instead of asking users to provide a static list of criteria, the system acts like a tailored test-maker. For each individual data point in your evaluation set, it automatically generates a unique set of rubrics—specific, measurable criteria adapted to that entry’s content. You can review and customize these tests, if needed.
-
Step 2 – Targeted autorating: Next, the autorater uses these custom-generated rubrics to assess the AI’s response. This is like a teacher writing unique questions for each student’s essay based on its specific topic, rather than using the same generic questions for the whole class.
This process ensures that every evaluation is contextual and insightful. It enhances interpretability by tying every score to criteria that are directly relevant to the specific task, giving you a far more accurate measure of your model’s true performance.
Here, you can see an example of the rubric-driven pairwise evaluation you will be able to produce with Gen AI evaluation service on Vertex AI.
Check out these examples of running rubric-based evaluation for instruction-following, multimodal, and text quality. Also, we have worked with our research team to implement rubrics-based autorater for text- to-image and text-to-video.
4. Agent evaluation
We are at the beginning of the agentic era, where agents reason, plan, and use tools to accomplish complex tasks. However, evaluating these agents presents a unique challenge. It’s no longer sufficient to just assess the final response; we need to validate the entire decision-making process. “Did the agent choose the right tool?”, “Did it follow a logical sequence of steps?”, “Did it effectively store and use information to provide personalized answers?”. These are some of the critical questions that determine an agent’s reliability.
To address some of these challenges, the Gen AI evaluation service in Vertex AI introduces capabilities specifically for agent evaluation. You can evaluate not only the agent’s final output but also gain insights into its “trajectory”—the sequence of actions and tool calls it makes. With specialized metrics for trajectory, you can assess your agent’s reasoning path. Whether you’re building with Agent Development Kit, LangGraph, CrewAI, or other frameworks, and hosting them locally or on Vertex AI Agent Engine, you can analyze if the agent’s actions were logical and if the right tools were used at the right time. All results are integrated with Vertex AI Experiments, providing a robust system to track, compare, and visualize performance, enabling you to build more reliable and effective AI agents.
Here you can find a detailed documentation with several examples of agent evaluation with Gen AI evaluation service on Vertex AI.
Finally, we recognize that evaluation remains a research frontier. We believe that collaborative efforts are key to addressing current challenges. Therefore, we are actively working with companies like Weights & Biases, Arize, and Maxim AI. Together, we aim to find solutions for open challenges such as the cold-start data problem, multi-agent evaluation, and real-world agent simulation for validation.
Get started today
Ready to build reliable LLMs applications ready for production on Vertex AI? The Gen AI evaluation service in Vertex AI addresses the most requested features from users, providing a powerful, comprehensive suite for evaluating your AI application. By enabling you to scale evaluations, build trust in your autorater, and assess multimodal and agentic use cases, we want to foster confidence and efficiency, ensuring your LLM-based applications perform as expected in production.
Check the comprehensive documentation and code examples for the Gen AI evaluation service.
Read More for the details.