OpenAI stops using SWE-bench Verified coding benchmark after finding 59.4% of audited problems have flawed tests rejecting correct solutions and all frontier models trained on evaluation data, recommends SWE-bench Pro instead.

Why SWE-bench Verified no longer measures frontier coding capabilities

Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework. When we created the Verified benchmark initially, we attempted...