Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-mini by 22%
In a recent post, we introduced the Tau² benchmark, a framework for benchmaring LLMs. Today we’re sharing a surprising discovery we made while using it: a simple prompt rewrite boosted a small model’s success rate by over 20%. This post is a deep-dive on how we found and fixed this performance bottleneck by making subtle changes to agent policies.
Benchmarking LLMs with Tau²
On the recent OpenAI Summer Update, we have seen that GPT-5 model has made significant strides in agentic tasks. To valida...
Read more at quesma.com