3 weeks ago

Thurs Jul 31, 2025 2:30pm PST

Show HN: New SWE-bench leaderboard compares LMs without fancy agent scaffolds

Hello from the SWE-bench/SWE-agent team at Princeton/Stanford.

When we created the SWE-bench benchmark in 2023 from hundreds of real-life GitHub issues/pull requests, the highest score was just a couple of percent. The tasks were so challenging for LMs, that most people didn't even want to work on them.

Half a year later, SWE-agent showed that the early 2024 LMs were actually good enough to resolve up to 20% of the GitHub issues in the benchmark. This kicked off a whole wave of coding agents.

Back then, developing agents was all about working around tons of silly behavior from the LMs. For example, if a command didn't work, they would try running the exact command again. If a command didn't return output, they would assume it never ran. They also couldn't get whitespace right in their edits, would get stuck into repetitive attempts and much much more.

So agents got pretty complicated to work around all of that bad LM behavior.

But now it's 2025, and LM companies have invested a whole lot of money to make their LMs really good at being agents.

So we asked two questions:

1. What's the simplest agent we can write that still scores near SotA? 2. How do LMs compare when we evaluate them using this simple agent?

Turns out, the agent can be very simple indeed! mini-swe-agent (https://github.com/SWE-agent/mini-swe-agent) has only 100 lines of code for the agent class (plus some 100 lines for environment etc.). It is little more than a loop that parses LM output for shell commands, executing them in a subshell, and continuing.

We then took various LMs and put them to the test in a real apples-to-apples comparison without a fancy agent scaffold to prop up bad LMs.

Our new leaderboard https://www.swebench.com/ shows the results.

The highest score is currently 65% with Claude Sonnet 4 (which is not much less than the 70% that most fancier agents observe).

o3, o4-mini, and Gemini 2.5 Pro are significantly behind, but not hopeless, achieving 50-60%.

We were really surprised by these strong numbers overall: It shows that as LMs get stronger and better adapted at performing difficult, highly iterative tasks, we can take our hands off of the steering wheel, provide the minimal necessary environment, and let the LM figure out the rest.