AbsenceBench reveals LLMs struggle to detect missing info; Claude-3.7-Sonnet scores only 69.6% F1 on 5K-token tasks

AbsenceBench: Language Models Can't Tell What's Missing

View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, ...