It’s not straightforward being one in every of Silicon Valley’s favourite benchmarks.
SWE-Bench (pronounced “swee bench”) launched in November 2024 to judge an AI mannequin’s coding ability, utilizing greater than 2,000 real-world programming issues pulled from the general public GitHub repositories of 12 completely different Python-based initiatives.
Within the months since then, it’s rapidly turn into some of the widespread exams in AI. A SWE-Bench rating has turn into a mainstay of main mannequin releases from OpenAI, Anthropic, and Google—and outdoors of basis fashions, the fine-tuners at AI corporations are in fixed competitors to see who can rise above the pack. The highest of the leaderboard is a pileup between three completely different advantageous tunings of Anthropic’s Claude Sonnet mannequin and Amazon’s Q developer agent. Auto Code Rover—one of many Claude modifications—nabbed the quantity two spot in November, and was acquired simply three months later.
Regardless of all of the fervor, this isn’t precisely a truthful evaluation of which mannequin is “higher.” Because the benchmark has gained prominence, “you begin to see that folks really need that high spot,” says John Yang, a researcher on the workforce that developed SWE-Bench at Princeton College. Because of this, entrants have begun to recreation the system—which is pushing many others to wonder if there’s a greater technique to really measure AI achievement.
Builders of those coding brokers aren’t essentially doing something as simple dishonest, however they’re crafting approaches which can be too neatly tailor-made to the specifics of the benchmark. The preliminary SWE-Bench take a look at set was restricted to packages written in Python, which meant builders might acquire a bonus by coaching their fashions completely on Python code. Quickly, Yang observed that high-scoring fashions would fail fully when examined on completely different programming languages—revealing an strategy to the take a look at that he describes as “gilded.”
“It appears to be like good and glossy at first look, however then you definitely attempt to run it on a unique language and the entire thing simply form of falls aside,” Yang says. “At that time, you’re not designing a software program engineering agent. You’re designing to make a SWE-Bench agent, which is way much less fascinating.”
The SWE-Bench problem is a symptom of a extra sweeping—and sophisticated—downside in AI analysis, and one which’s more and more sparking heated debate: The benchmarks the business makes use of to information growth are drifting additional and additional away from evaluating precise capabilities, calling their primary worth into query. Making the state of affairs worse, a number of benchmarks, most notably FrontierMath and Chatbot Enviornment, have just lately come beneath warmth for an alleged lack of transparency. Nonetheless, benchmarks nonetheless play a central position in mannequin growth, even when few specialists are prepared to take their outcomes at face worth. OpenAI cofounder Andrej Karpathy just lately described the state of affairs as “an analysis disaster”: the business has fewer trusted strategies for measuring capabilities and no clear path to higher ones.
“Traditionally, benchmarks have been the best way we evaluated AI methods,” says Vanessa Parli, director of analysis at Stanford College’s Institute for Human-Centered AI. “Is that the best way we need to consider methods going ahead? And if it is not, what’s the method?”
A rising group of teachers and AI researchers are making the case that the reply is to go smaller, buying and selling sweeping ambition for an strategy impressed by the social sciences. Particularly, they need to focus extra on testing validity, which for quantitative social scientists refers to how effectively a given questionnaire measures what it’s claiming to measure—and, extra essentially, whether or not what it’s measuring has a coherent definition. That might trigger hassle for benchmarks assessing hazily outlined ideas like “reasoning” or “scientific data”—and for builders aiming to succeed in the much-hyped objective of synthetic normal intelligence—however it will put the business on firmer floor because it appears to be like to show the value of particular person fashions.
“Taking validity severely means asking of us in academia, business, or wherever to point out that their system does what they are saying it does,” says Abigail Jacobs, a College of Michigan professor who’s a central determine within the new push for validity. “I feel it factors to a weak spot within the AI world in the event that they need to again off from displaying that they’ll help their declare.”
The boundaries of conventional testing
If AI corporations have been gradual to answer the rising failure of benchmarks, it’s partially as a result of the test-scoring strategy has been so efficient for thus lengthy.
One of many greatest early successes of latest AI was the ImageNet problem, a form of antecedent to modern benchmarks. Launched in 2010 as an open problem to researchers, the database held greater than 3 million photographs for AI methods to categorize into 1,000 completely different courses.
Crucially, the take a look at was fully agnostic to strategies, and any profitable algorithm rapidly gained credibility no matter the way it labored. When an algorithm known as AlexNet broke by means of in 2012, with a then unconventional type of GPU coaching, it turned one of many foundational outcomes of contemporary AI. Few would have guessed prematurely that AlexNet’s convolutional neural nets could be the key to unlocking picture recognition—however after it scored effectively, nobody dared dispute it. (One in every of AlexNet’s builders, Ilya Sutskever, would go on to cofound OpenAI.)
A big a part of what made this problem so efficient was that there was little sensible distinction between ImageNet’s object classification problem and the precise technique of asking a pc to acknowledge a picture. Even when there have been disputes about strategies, nobody doubted that the highest-scoring mannequin would have a bonus when deployed in an precise picture recognition system.
However within the 12 years since, AI researchers have utilized that very same method-agnostic strategy to more and more normal duties. SWE-Bench is often used as a proxy for broader coding means, whereas different exam-style benchmarks typically stand in for reasoning means. That broad scope makes it tough to be rigorous about what a particular benchmark measures—which, in flip, makes it laborious to make use of the findings responsibly.
The place issues break down
Anka Reuel, a PhD scholar who has been specializing in the benchmark downside as a part of her analysis at Stanford, has turn into satisfied the analysis downside is the results of this push towards generality. “We’ve moved from task-specific fashions to general-purpose fashions,” Reuel says. “It’s not a few single process anymore however a complete bunch of duties, so analysis turns into more durable.”
Just like the College of Michigan’s Jacobs, Reuel thinks “the principle problem with benchmarks is validity, much more than the sensible implementation,” noting: “That’s the place quite a lot of issues break down.” For a process as difficult as coding, as an illustration, it’s almost unattainable to include each potential state of affairs into your downside set. Because of this, it’s laborious to gauge whether or not a mannequin is scoring higher as a result of it’s extra expert at coding or as a result of it has extra successfully manipulated the issue set. And with a lot strain on builders to attain file scores, shortcuts are laborious to withstand.
For builders, the hope is that success on a number of particular benchmarks will add as much as a usually succesful mannequin. However the methods of agentic AI imply a single AI system can embody a posh array of various fashions, making it laborious to judge whether or not enchancment on a particular process will result in generalization. “There’s simply many extra knobs you possibly can flip,” says Sayash Kapoor, a pc scientist at Princeton and a outstanding critic of sloppy practices within the AI business. “In the case of brokers, they’ve form of given up on the perfect practices for analysis.”
In a paper from final July, Kapoor known as out particular points in how AI fashions have been approaching the WebArena benchmark, designed by Carnegie Mellon College researchers in 2024 as a take a look at of an AI agent’s means to traverse the net. The benchmark consists of greater than 800 duties to be carried out on a set of cloned web sites mimicking Reddit, Wikipedia, and others. Kapoor and his workforce recognized an obvious hack within the successful mannequin, known as STeP. STeP included particular directions about how Reddit buildings URLs, permitting STeP fashions to leap on to a given consumer’s profile web page (a frequent aspect of WebArena duties).
This shortcut wasn’t precisely dishonest, however Kapoor sees it as “a severe misrepresentation of how effectively the agent would work had it seen the duties in WebArena for the primary time.” As a result of the approach was profitable, although, an analogous coverage has since been adopted by OpenAI’s internet agent Operator. (“Our analysis setting is designed to evaluate how effectively an agent can resolve duties given some instruction about web site buildings and process execution,” an OpenAI consultant stated when reached for remark. “This strategy is in line with how others have used and reported outcomes with WebArena.” STeP didn’t reply to a request for remark.)
Additional highlighting the issue with AI benchmarks, late final month Kapoor and a workforce of researchers wrote a paper that exposed vital issues in Chatbot Enviornment, the favored crowdsourced analysis system. In response to the paper, the leaderboard was being manipulated; many high basis fashions have been conducting undisclosed personal testing and releasing their scores selectively.
At present, even ImageNet itself, the mom of all benchmarks, has began to fall sufferer to validity issues. A 2023 research from researchers on the College of Washington and Google Analysis discovered that when ImageNet-winning algorithms have been pitted towards six real-world information units, the structure enchancment “resulted in little to no progress,” suggesting that the exterior validity of the take a look at had reached its restrict.
Going smaller
For individuals who consider the principle downside is validity, the perfect repair is reconnecting benchmarks to particular duties. As Reuel places it, AI builders “need to resort to those high-level benchmarks which can be virtually meaningless for downstream shoppers, as a result of the benchmark builders can’t anticipate the downstream process anymore.” So what if there was a method to assist the downstream shoppers determine this hole?
In November 2024, Reuel launched a public rating mission known as BetterBench, which charges benchmarks on dozens of various standards, equivalent to whether or not the code has been publicly documented. However validity is a central theme, with specific standards difficult designers to spell out what functionality their benchmark is testing and the way it pertains to the duties that make up the benchmark.
“It’s worthwhile to have a structural breakdown of the capabilities,” Reuel says. “What are the precise abilities you care about, and the way do you operationalize them into one thing we are able to measure?”
The outcomes are shocking. One of many highest-scoring benchmarks can be the oldest: the Arcade Studying Atmosphere (ALE), established in 2013 as a technique to take a look at fashions’ means to learn to play a library of Atari 2600 video games. One of many lowest-scoring is the Huge Multitask Language Understanding (MMLU) benchmark, a broadly used take a look at for normal language abilities; by the requirements of BetterBench, the connection between the questions and the underlying ability was too poorly outlined.
BetterBench hasn’t meant a lot for the reputations of particular benchmarks, no less than not but; MMLU remains to be broadly used, and ALE remains to be marginal. However the mission has succeeded in pushing validity into the broader dialog about how you can repair benchmarks. In April, Reuel quietly joined a brand new analysis group hosted by Hugging Face, the College of Edinburgh, and EleutherAI, the place she’ll develop her concepts on validity and AI mannequin analysis with different figures within the area. (An official announcement is predicted later this month.)
Irene Solaiman, Hugging Face’s head of world coverage, says the group will concentrate on constructing legitimate benchmarks that transcend measuring simple capabilities. “There’s simply a lot starvation for a very good benchmark off the shelf that already works,” Solaiman says. “A variety of evaluations are attempting to do an excessive amount of.”
More and more, the remainder of the business appears to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a brand new framework for bettering evaluations—with validity as step one.
“AI analysis science should,” the researchers argue, “transfer past coarse grained claims of ‘normal intelligence’ in the direction of extra task-specific and real-world related measures of progress.”
Measuring the “squishy” issues
To assist make this shift, some researchers need to the instruments of social science. A February place paper argued that “evaluating GenAI methods is a social science measurement problem,” particularly unpacking how the validity methods utilized in social measurements could be utilized to AI benchmarking.
The authors, largely employed by Microsoft’s analysis department however joined by teachers from Stanford and the College of Michigan, level to the requirements that social scientists use to measure contested ideas like ideology, democracy, and media bias. Utilized to AI benchmarks, those self same procedures might supply a technique to measure ideas like “reasoning” and “math proficiency” with out slipping into hazy generalizations.
Within the social science literature, it’s notably vital that metrics start with a rigorous definition of the idea measured by the take a look at. As an illustration, if the take a look at is to measure how democratic a society is, it first wants to determine a definition for a “democratic society” after which set up questions which can be related to that definition.
To use this to a benchmark like SWE-Bench, designers would want to put aside the basic machine studying strategy, which is to gather programming issues from GitHub and create a scheme to validate solutions as true or false. As an alternative, they’d first have to outline what the benchmark goals to measure (“means to resolve flagged points in software program,” as an illustration), break that into subskills (several types of issues or kinds of program that the AI mannequin can efficiently course of), after which lastly assemble questions that precisely cowl the completely different subskills.
It’s a profound change from how AI researchers usually strategy benchmarking—however for researchers like Jacobs, a coauthor on the February paper, that’s the entire level. “There’s a mismatch between what’s occurring within the tech business and these instruments from social science,” she says. “We’ve got a long time and a long time of excited about how we need to measure these squishy issues about people.”
Regardless that the concept has made an actual impression within the analysis world, it’s been gradual to affect the best way AI corporations are literally utilizing benchmarks.
The final two months have seen new mannequin releases from OpenAI, Anthropic, Google, and Meta, and all of them lean closely on multiple-choice data benchmarks like MMLU—the precise strategy that validity researchers are attempting to maneuver previous. In spite of everything, mannequin releases are, for probably the most half, nonetheless about displaying will increase normally intelligence, and broad benchmarks proceed for use to again up these claims.
For some observers, that’s adequate. Benchmarks, Wharton professor Ethan Mollick says, are “dangerous measures of issues, but additionally they’re what we’ve acquired.” He provides: “On the identical time, the fashions are getting higher. A variety of sins are forgiven by quick progress.”
For now, the business’s long-standing concentrate on synthetic normal intelligence appears to be crowding out a extra centered validity-based strategy. So long as AI fashions can continue to grow normally intelligence, then particular purposes don’t appear as compelling—even when that leaves practitioners counting on instruments they now not absolutely belief.
“That is the tightrope we’re strolling,” says Hugging Face’s Solaiman. “It’s too straightforward to throw the system out, however evaluations are actually useful in understanding our fashions, even with these limitations.”
Russell Brandom is a contract author protecting synthetic intelligence. He lives in Brooklyn along with his spouse and two cats.
This story was supported by a grant from the Tarbell Middle for AI Journalism.

