AWS – Amazon Bedrock now supports RAG Evaluation (generally available)
Amazon Bedrock RAG evaluation is now generally available. You can evaluate your retrieval-augmented generation (RAG) applications, either those built on Amazon Bedrock Knowledge Bases or a custom RAG system. You can evaluate either retrieval or end-to-end generation. Evaluations are powered by an LLM-as-a-judge, with a choice of several judge models. For retrieval, you can select from metrics such as context relevance and coverage. For end-to-end retrieve and generation, you can select from quality metrics such as correctness, completeness, and faithfulness (hallucination detection), and responsible AI metrics such as harmfulness, answer refusal, and stereotyping. You can also compare across evaluation jobs to iterate on your Knowledge Bases or custom RAG applications with different settings like chunking strategy or vector length, rerankers, or different content generating models.
*Brand new – more flexibility!* As of today, in addition to Bedrock Knowledge Bases, Amazon Bedrock’s RAG evaluations supports custom RAG pipeline evaluations. Customers evaluating custom RAG pipelines can now bring their input-output pairs and retrieved contexts into the evaluation job directly in their input dataset, enabling them to bypass the call to a Bedrock Knowledge Base (“bring your own inference responses”). We also added citation precision and citation coverage metrics for Bedrock Knowledge Bases evaluation. If you use a Bedrock Knowledge Base as part of your evaluation, you can incorporate Amazon Bedrock Guardrails directly.
To learn more, visit the Amazon Bedrock Evaluations page and documentation. To get started, log into the Amazon Bedrock Console or use the Amazon Bedrock APIs.
Read More for the details.