Definition. Frontier models pass the hardest PhD-level exams but lack the taste, judgment, and reliability of a human expert in the same domain. RL on benchmark distributions creates models that are "superhuman at test-taking" but unreliable in practice.
Why now. Sutskever also argues the era of pure scaling is over — another 100× would help but not transform. Returns now come from new learning paradigms, not raw compute.
Ilya Sutskever · Dwarkesh Podcast (Nov 25 2025) · People · Sutskever
Use jaggedness as the vocabulary when execs are dazzled by SWE-bench scores. Pairs with the SWE-bench Pro finding that scaffolding moves the same model 5–15 points.