SWE-bench Verified

← All patterns Source: research/patterns.md

Claude Mythos Preview93.9%

Claude Opus 4.7 (Adaptive)87.6%

GPT-5.3 Codex85.0%

OpenHands + Claude 4 (OSS top)72.0%

Average across 83 models63.4%

SWE-bench Verified — May 2026. On SWE-bench Pro, scaffolding alone moves the same model 5–15 points. Context retrieval is the bottleneck, not raw capability.

SWE-bench Verified

Don't let clients pick by SWE-bench score alone. Cite the 5–15 point harness delta — reinforces the "invest in harness, swap models" thesis.