News Score: Score the News, Sort the News, Rewrite the Headlines

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-mini by 22%

In a recent post, we introduced the Tau² benchmark, a framework for benchmaring LLMs. Today we’re sharing a surprising discovery we made while using it: a simple prompt rewrite boosted a small model’s success rate by over 20%. This post is a deep-dive on how we found and fixed this performance bottleneck by making subtle changes to agent policies. Benchmarking LLMs with Tau² On the recent OpenAI Summer Update, we have seen that GPT-5 model has made significant strides in agentic tasks. To valida...

Read more at quesma.com

© News Score  score the news, sort the news, rewrite the headlines