Pattern · Discipline

Jaggedness

Ilya Sutskever · Nov 2025

Definition. Frontier models pass the hardest PhD-level exams but lack the taste, judgment, and reliability of a human expert in the same domain. RL on benchmark distributions creates models that are "superhuman at test-taking" but unreliable in practice.

Why now. Sutskever also argues the era of pure scaling is over — another 100× would help but not transform. Returns now come from new learning paradigms, not raw compute.

benchmark task type production 100% 0% Frontier model — jagged Human expert — even
Jaggedness. Frontier models spike past human experts on benchmarks but crater on neighbouring tasks the benchmark didn't cover. This is why agent-authored code remains junior-to-mid.

Ilya Sutskever · Dwarkesh Podcast (Nov 25 2025) · People · Sutskever

Use jaggedness as the vocabulary when execs are dazzled by SWE-bench scores. Pairs with the SWE-bench Pro finding that scaffolding moves the same model 5–15 points.