Pattern · Process

SWE-bench Verified

Princeton · OpenAI

Claude Mythos Preview93.9%
Claude Opus 4.7 (Adaptive)87.6%
GPT-5.3 Codex85.0%
OpenHands + Claude 4 (OSS top)72.0%
Average across 83 models63.4%
SWE-bench Verified — May 2026. On SWE-bench Pro, scaffolding alone moves the same model 5–15 points. Context retrieval is the bottleneck, not raw capability.

SWE-bench Verified

Don't let clients pick by SWE-bench score alone. Cite the 5–15 point harness delta — reinforces the "invest in harness, swap models" thesis.