GCP – Evaluate your gen media models with multimodal evaluation on Vertex AI
The world of generative AI is moving fast, with models like Lyria, Imagen, and Veo now capable of producing stunningly realistic and imaginative images and videos from simple text prompts. However, evaluating these models is still a steep challenge. Traditional human evaluation, while the gold standard, can be slow and costly, hindering rapid development cycles.
To address this, we’re thrilled to introduce Gecko, now available through Google Cloud’s Vertex AI Evaluation Service. Gecko is a rubric-based and interpretable autorater for evaluating generative AI models that empowers developers with a more nuanced, customizable, and transparent way to assess the performance of image and video generation models.
The challenge of evaluating generative models with auto-raters
Creating useful, performant auto-raters is challenging as the quality of generation dramatically improves. While specialised models can be efficient, they lack the interpretability developers need to understand model behavior and pinpoint areas for improvement. For instance, when evaluating how accurately a generated image depicts a prompt, a single score doesn’t reveal why a model succeeded or failed.
Introducing Gecko: Interpretable, customizable, and performant evaluation
Gecko offers a fine-grained, interpretable, and customizable auto-rater. This Google DeepMind research paper shows that such an auto-rater can reliably evaluate image and video generation across a range of skills, reducing the dependency on costly human judgment. Notably, beyond its interpretability, Gecko exhibits strong performance and has already been instrumental in benchmarking the progress of leading models like Imagen.
Gecko makes evaluation interpretable with its clear, step-by-step rubric-based approach. Let’s take an example and use Gecko to evaluate the generated media of a cup of coffee and a croissant on a table.
Figure 1: Prompt and image pair we will use as our running example
Step 1: Semantic prompt decomposition.
Gecko leverages a Gemini model to first break down the input text prompt into key semantic elements that need to be verified in the generated media. This includes identifying entities, their attributes, and the relationships between them.
For the running example, the prompt is broken down into keywords: Steaming, cup of coffee, croissant, table.
Step 2: Question generation.
Based on the decomposed prompt, the Gemini model then generates a series of question-answer pairs. These questions are specifically designed to probe the generated image or video for the presence and accuracy of the identified elements and relationships. Optionally, Gemini can provide justifications for why a particular answer is correct, further enhancing transparency.
Let’s take a look at the running example and generate question-answer pairs for each keyword. For the keyword Steaming, the question-answer pair is ‘is the cup of coffee steaming? [‘yes’, ‘no’]’ with the ground-truth answer ‘yes’.
Figure 2: Visualisation of the outputs from the semantic prompt decomposition and question-answer generation steps.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3f97ca05b0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Step 3: Scoring
Finally, the Gemini model scores the generated media against each question-answer pair. These individual scores are then aggregated to produce a final evaluation score.
For the running example, all questions were found to be correct, giving a perfect final score.
Figure 3: Visualisation of the outputs from the scoring step, giving scores for each question which are aggregated to give a final overall score.
Evaluate with Gecko on Vertex AI
Gecko is now available via the Gen AI Evaluation Service in Vertex AI, empowering you to evaluate image or video generative models. Here’s how you can get started with Gecko evaluation for images and videos on Vertex AI:
First, you’ll need to set up configurations for both rubric generation and rubric validation.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# Rubric Generationrnrubric_generation_config = RubricGenerationConfig(rn prompt_template=RUBRIC_GENERATION_PROMPT,rn parsing_fn=parse_json_to_qa_records,rn)rn# Rubric Validationrnpointwise_metric = PointwiseMetric(rn metric=”gecko_metric”,rn metric_prompt_template=RUBRIC_VALIDATOR_PROMPT,rn custom_output_config=CustomOutputConfig(rn return_raw_output=True,rn parsing_fn=parse_rubric_results,rn ),rn)rn# Rubric Metricrnrubric_based_gecko = RubricBasedMetric(rn generation_config=rubric_generation_config,rn critique_metric=pointwise_metric,rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3f951b7d00>)])]>
Next, prepare your dataset for evaluation. This involves creating a Pandas DataFrame with columns for your prompts and the corresponding generated images or videos.
- code_block
- <ListValue: [StructValue([(‘code’, ‘prompts = [rn “steaming cup of coffee and a croissant on a table”,rn “steaming cup of coffee and toast in a cafe”,rn # … more promptsrn]rnimages = [rn ‘{“contents”: [{“parts”: [{“file_data”: {“mime_type”: “image/png”, “file_uri”: “gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png”}}]}]}’,rn ‘{“contents”: [{“parts”: [{“file_data”: {“mime_type”: “image/png”, “file_uri”: “gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png”}}]}]}’,rn # … more image URIsrn]rneval_dataset = pd.DataFrame(rn {rn “prompt”: prompts,rn “image”: images, # or “video”: videos for video evaluationrn }rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3f951b7700>)])]>
Now, you can generate the rubrics based on your prompts using the configured rubric_based_gecko
metric.
- code_block
- <ListValue: [StructValue([(‘code’, ‘dataset_with_rubrics = rubric_based_gecko.generate_rubrics(eval_dataset)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3f93e803a0>)])]>
Finally, run the evaluation using the generated rubrics and your dataset. The evaluate
method of EvalTask
will use the rubric validator to score the generated content.
- code_block
- <ListValue: [StructValue([(‘code’, ‘eval_task = EvalTask(rn dataset=dataset_with_rubrics,rn metrics=[rubric_based_gecko],rn)rneval_result = eval_task.evaluate(response_column_name=”image”) # or “video”‘), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3f93e80eb0>)])]>
After the evaluation runs, you can compute and analyze the final scores to understand how well your generated content aligns with the detailed criteria derived from your prompts.
Python
- code_block
- <ListValue: [StructValue([(‘code’, ‘dataset_with_final_scores = compute_scores(eval_result.metrics_table)rnnp.mean(dataset_with_final_scores[“final_score”])’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3f93e80cd0>)])]>
Vertex AI Gen AI evaluation service offers summary and metrics tables, providing detailed insights into the evaluation performance. Beyond that, for Gecko you will find the category or concept which each of the questions is categorized as, as well as the score of the generated image or video performed against that category. For example “is the cat grey?” would be a question which falls under the question category: “color”
Access to these granular evaluation results enables you to create meaningful visualizations of the performance of the models across various criterion, including bar and radar charts like the one below:
Figure 4: Visualisation of the aggregate performance of the generated media for various categories/criterion
With Gecko on Vertex AI, you gain access to a robust framework for assessing model’s capabilities at finer detail. You can refer to the text- to-image and text-to-video evaluation Colabs to get a first hand experience today.
Read More for the details.