Our first use case was RAG pipelines, specifically evaluating if your agent MISSED pulling any important chunks from the source document. While many RAG evaluators determine if your model USED its chunk in the output, there's no visibility on if your model grabbed all the right chunks in the first place. We thought we'd test the 'agent as judge', with a new metric called 'potential sources missed', to help evaluate if your agents are missing any important chunks from the source of truth.
Curious what you all think!