News Score: Score the News, Sort the News, Rewrite the Headlines

Are LLMs not getting better?

I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llms doing programming when the success criterion is “passes all tests” and compare it to when the success criterion is “would get approved by the maintainer”. Unsurprisingly, llm performance is much worse under the more stringent success criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes. As part of this they have included figures...

Read more at entropicthoughts.com

© News Score  score the news, sort the news, rewrite the headlines