TensorZero study shows noisy LLM evaluators can reliably rank AI agents offline by averaging scores across many outputs, despite poor single-output accuracy.

Even (very) noisy LLM evaluators are useful for improving AI agents · TensorZero

May 12, 2026 · Alan Mishler It’s surprisingly hard to develop reliable LLM evaluators: they’re often noisy and poorly correlated with the metrics or outcomes practitioners actually care about. Sometimes the target is directly measurable but evaluators still disagree with experts (e.g. on correctness or faithfulness to a source document). Other times the target is only accessible through a proxy (e.g. whether code that passes tests satisfies user needs). And sometimes the target is hard to observ...