Benchmarks are saturated!!
“So we’re seeing a saturation among these benchmarks – there just isn’t really any improvement to be made.” - Stanford
“Even if it looks like it’s tapering off in some ways, that’s because we’re enabling entirely new classes [of functionality], but we’ve saturated the benchmarks, and the ability to do older tasks,” - Anthropic
The overwhelming rhetoric in literature is that many benchmarks are saturated and we need new ones. This isn’t news. There’s plenty of research coming out with harder benchmarks that push the boundaries (e.g. Frontier Math, Nov 2024, ARC-AGI-2 expected 2025). It’s exciting to show how badly we’re doing and then show how big an improvement we can make. That’s the sort of work that get’s published.
But what about that old benchmark that’s at 96%? Saturation is not completion. There are hundreds of once popular benchmarks which remain incomplete, even with the latest models. As a community, we don’t often push the last mile and that might be hurting us. Agents is a clear application where the small inaccuracies can accumulate over time to hurt us. More generally, since LLM’s are autoregressive, Yann LeCun points out clearly the same problem of error accumulation. So there may be value in pushing our benchmarks even further.
So why don’t we usually push benchmarks further?
- Diminishing returns: Many things in life are governed by power laws which state that as you get closer to completion the amount of effort required to cross the line drastically increases. A classic framing of this is the Pareto principle, or 80/20 rule.
- Not all benchmarks should be 100%: ImageNet is a classic example of erroneous labels. This repo makes an effort to correct ImageNet and other top labelling benchmarks but there is always the possibility that errors exist in a benchmark and getting to 100 would be over-fitting to an incorrect model.
- The best is undefined: if you’re working on an RL problem, it may not be obvious what perfection is. The maximum return may be unknown, i.e. perfect play is something to be discovered, not known a-priori. Another example is scissors-paper-rock agents where it’s almost impossible to have a ‘perfect’ agent, as every strategy has a weakness to some other strategy.
- Benchmarks are proxies: ultimately benchmarks represent a goal we care about it, they themselves are not what’s important. Getting 100% on a subset of science questions for high schoolers might not be meaningful. That said, it’s not clear what the meaningful point is; is it 90%, 99%, 99.9%?
- Benchmark death: benchmarks are often published and there’s some initial excitement around a new challenge. This quickly dies down after a few approaches get decent results. A new benchmark is launched and the old one is forgotten. There are thousands of benchmarks out there so it’s easy to get distracted. This study looks out 1,688 of them, from CV and NLP.
- Beating human performance: it’s also often the case that if we can show performance better than human results it may be enough. For some purposes this is true, but ultimately we need to strive to surpass humans by a long way, not just enough.
- Research value: research that moves benchmarks a lot is valued more.
So there’s a lot holding us back from pushing benchmarks, but as mentioned, there’s still value to be had. I believe practicing good statistics in reporting results and clearly understanding the failure modes of our work is an important first step. But reaching 100% on a benchmark isn’t just a matter of scaling up current approaches—it often involves addressing deeper, fundamental challenges in AI, such as robustness, generalization, and real-world alignment. These challenges are worth tackling and may lead to interesting outcomes. Getting to 100% isn’t just someone else’s (usually industry) problem, it should be an interesting research problem for us all to tackle.