GCP – Introducing agent evaluation in Vertex AI Gen AI evaluation service
Comprehensive agent evaluation is essential for building the next generation of reliable AI. It’s not enough to simply check the outputs; we need to understand the “why” behind an agent’s actions – its reasoning, decision-making process, and the path it takes to reach a solution.
That’s why today, we’re thrilled to announce Vertex AI Gen AI evaluation service is now in public preview. This new feature empowers developers to rigorously assess and understand their AI agents. It includes a powerful set of evaluation metrics specifically designed for agents built with different frameworks, and provides native agent inference capabilities to streamline the evaluation process.
In this post, we’ll explore how evaluation metrics work and share an example of how you can apply this to your agents.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e447b328e20>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Evaluate agents using Vertex AI Gen AI evaluation service
Our evaluation metrics can be grouped in two categories: final response and trajectory evaluation.
Final response asks a simple question: does your agent achieve its goals? You can define custom final response criteria to measure success according to your specific needs. For example, you can assess whether a retail chatbot provides accurate product information or if a research agent summarizes findings effectively, using appropriate tone and style.
To look below the surface, we offer trajectory evaluation to analyze the agent’s decision-making process. Trajectory evaluation is crucial for understanding your agent’s reasoning, identifying potential errors or inefficiencies, and ultimately improving performance. We offer six trajectory evaluation metrics to help you answer these questions:
1. Exact match: Requires the AI agent to produce a sequence of actions (a “trajectory”) that perfectly mirrors the ideal solution.
2. In-order match: The agent’s trajectory needs to include all the necessary actions in the correct order, but it might also include extra, unnecessary steps. Imagine following a recipe correctly but adding a few extra spices along the way.
3. Any-order match: Even more flexible, this metric only cares that the agent’s trajectory includes all the necessary actions, regardless of their order. It’s like reaching your destination, regardless of the route you take.
4. Precision: This metric focuses on the accuracy of the agent’s actions. It calculates the proportion of actions in the predicted trajectory that are also present in the reference trajectory. A high precision means the agent is making mostly relevant actions.
5. Recall: This metric measures the agent’s ability to capture all the essential actions. It calculates the proportion of actions in the reference trajectory that are also present in the predicted trajectory. A high recall means the agent is unlikely to miss crucial steps.
6. Single-tool use: This metric checks for the presence of a specific action within the agent’s trajectory. It’s useful for assessing whether an agent has learned to utilize a particular tool or capability.
Compatibility meets flexibility
Vertex AI Gen AI evaluation service supports a variety of agent architectures.
With today’s launch, you can evaluate agents built with Reasoning Engine (LangChain on Vertex AI), the managed runtime for your agentic applications on Vertex AI. We also support agents built by open-source frameworks, including LangChain, LangGraph, and CrewAI – and we are planning to support upcoming Google Cloud services to build agents.
For maximum flexibility, you can evaluate agents using a custom function that processes prompts and returns responses. To make your evaluation experience easier, we offer native agent inference and automatically log all results in Vertex AI experiments.
Agent evaluation in action
Let’s say you have the following LangGraph customer support agent, and you aim to assess both the responses it generates and the sequence of actions (or “trajectory”) it undertakes to produce those responses.
To assess an agent using Vertex AI Gen AI evaluation service, you start preparing an evaluation dataset. This dataset should ideally contain the following elements:
-
User prompt: This represents the input that the user provides to the agent.
-
Reference trajectory: This is the expected sequence of actions that the agent should take to provide the correct response.
-
Generated trajectory: This is the actual sequence of actions that the agent took to generate a response to the user prompt.
-
Response: This is the generated response, given the agent’s sequence of actions.
A sample evaluation dataset is shown below.
After you gather your evaluation dataset, define the metrics that you want to use to evaluate the agent. For a complete list of metrics and their interpretations, refer to Evaluate Gen AI agents. Some metrics you can define are listed here:
- code_block
- <ListValue: [StructValue([(‘code’, ‘response_tool_metrics = [rn “trajectory_exact_match”, “trajectory_in_order_match”, “safety”, response_follows_trajectory_metricrn]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e447b32d100>)])]>
Notice that the response_follows_trajectory_metric
is a custom metric that you can define to evaluate your agent.
Standard text generation metrics, such as coherence, may not be sufficient when evaluating AI agents that interact with environments, as these metrics primarily focus on text structure. Agent responses should be assessed based on their effectiveness within the environment. Vertex AI Gen AI Evaluation service allows you to define custom metrics, like response_follows_trajectory_metric
, that assess whether the agent’s response logically follows from its tool choices. For more information on these metrics, please refer to the official notebook.
With your evaluation dataset and metrics defined, you can now run your first agent evaluation job on Vertex AI. Please see the code sample below.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Import libraries rnimport vertexairnfrom vertexai.preview.evaluation import EvalTaskrnrn# Initiate Vertex AI sessionrnvertexai.init(project=”my-project-id”, location=”my-location”, experiment=”evaluate-langgraph-agent)rnrn# Define an EvalTaskrnresponse_eval_tool_task = EvalTask(rn dataset=byod_eval_sample_dataset,rn metrics=response_tool_metrics,rn)rnrn# Run evaluationrnresponse_eval_tool_result = response_eval_tool_task.evaluate( experiment_run_name=”response-over-tools”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e447b64ad30>)])]>
To run the evaluation, initiate an `EvalTask` using the predefined dataset and metrics. Then, run an evaluation job using the evaluate method. Vertex AI Gen AI evaluation tracks the resulting evaluation as an experiment run within Vertex AI Experiments, the managed experiment tracking service on Vertex AI. The evaluation results can be viewed both within the notebook and the Vertex AI Experiments UI. If you’re using Colab Enterprise, you can also view the results in the Experiment side panel as shown below.
Vertex AI Gen AI evaluation service offers summary and metrics tables, providing detailed insights into agent performance. This includes individual user input, trajectory results, and aggregate results for all user input and trajectory pairs across all requested metrics.
Access to these granular evaluation results enables you to create meaningful visualizations of agent performance, including bar and radar charts like the one below:
Get started today
Explore the Vertex AI Gen AI evaluation service in public preview and unlock the full potential of your agentic applications.
Documentation
Notebooks
Read More for the details.