20 hours ago

Wed Nov 20, 2024 4:57pm PST

Show HN: Agentic Evaluators for Agentic Workflows (Starting with RAG)

Hey all! Thought this group might find this interesting - new approach to evaluating RAG pipelines using 'agents as a judge'. We got excited by the findings in this paper (https://arxiv.org/abs/2410.10934), about agents producing evaluations closer to human-evaluators, especially for multi-step workflows.

Our first use case was RAG pipelines, specifically evaluating if your agent MISSED pulling any important chunks from the source document. While many RAG evaluators determine if your model USED its chunk in the output, there's no visibility on if your model grabbed all the right chunks in the first place. We thought we'd test the 'agent as judge', with a new metric called 'potential sources missed', to help evaluate if your agents are missing any important chunks from the source of truth.