Anthropic's Claude AI matches human experts on BioMysteryBench, a new bioinformatics benchmark testing real-world dataset analysis; latest models solve problems human experts could not using different strategies.

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

In this post, Brianna, a researcher on the discovery team, shares results from a recent bioinformatics benchmarking effort.Almost as soon as large language models could hold a conversation, people started asking how they’d stack up against human experts. Could models pass the bar exam? Could they answer medical licensing questions, or solve Olympiad math problems? Such benchmarks—self-contained sets of human-vetted problems designed to evaluate a capability of a model—have now become a source of...