What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately.
GPT-5 by itself gets 65.0%, Sonnet 4 64.8%, but randomly switching at every step gets us 67.2%
This result came pretty surprising to us. There's a few more experiments in the blog post.