Synthesis — Agentic Coding State of the Art

Snapshot: May 2026 Source: RESEARCH.md

01 The five ideas that anchor our pitch #

1. From vibe coding to agentic engineering

Karpathy's 2026 Sequoia framing is the executive-friendly narrative: discipline replaces vibes, the new craft is guiding agents. He named December 2025 his personal inflection point — from writing 80% of his code to delegating 80%.

For execs who are being pushed toward "fully autonomous" pitches on one side or anti-AI skepticism on the other, pair the Karpathy frame with Zed's middle-path framing: the work is integration, not replacement, and most engineering orgs are already in the interwoven state.

→ "Vibe coding is passé" (The New Stack) · People · Karpathy · Patterns · Interwoven workflow

2. Context engineering is the new prompt engineering

Most agent failures are context failures. Karpathy / Lance Martin / Drew Breunig converged on this in mid-2025. The taxonomy of failure (poisoning, distraction, confusion, clash) and fixes (write, select, compress, isolate) is the diagnostic frame we use in every engagement.

→ Patterns · Context engineering · Patterns · Context rot

3. The harness matters more than the model

Once frontier models are within ~10% on SWE-bench, the differentiator is the harness: loop, tool surface, memory, sandbox, validation. SWE-bench Pro data shows switching scaffolding moves the same model 5–15 points. Reinforced by Sutskever's Nov 2025 framing of an "end of the scaling era" — gains now come from harness, context, and evals, not raw model size. His term for the gap between benchmark and production performance is "jaggedness": models are superhuman at test-taking, unreliable in practice.

→ Patterns · Harness thesis · Patterns · Jaggedness · People · Sutskever

4. Don't build agents for everything. Don't build multi-agent unless work is genuinely parallel

Anthropic's workflow ladder gives clients a defensible "no" to over-engineered designs. Cognition's "reads fan out, writes stay single-threaded" rule resolves the multi-agent debate in practice.

→ Patterns · Building Effective Agents · Patterns · Multi-agent debate

5. Evals first, autonomy second

Hamel Husain's thesis, validated everywhere we look: 60–80% of dev time should be error analysis; products that ship without evals fail. This is the prerequisite to every loop — and the most under-sold work in the field.

→ Patterns · Evals · People · Hamel Husain

02 What clients should expect (honest numbers) #

Time-to-merge

−30 to −60%

on bounded ticket types (bumps, lint, typos, simple endpoints)

PR review throughput

+50 to +200%

with critic-style review agent paired

Agent-authored code

Junior–mid

senior review still required (Ronacher, Hashimoto)

Costs without discipline

3–10×

cache hygiene + batch routing + seat plans

Refuse to repeat "X% of PRs are agent-authored" without normalization by complexity (lines × files × test churn). Vendor numbers are not comparable.

03 House-style design principles #

Eight principles we install in every engagement. The first four are platform; the next four are practice.

04 Risk slides (use directly with clients) #

Real, dated, primary-sourced incidents. Full dossier in Production · Incidents.

Claude Code CVEs (2025). Path bypass + command injection; PoC chained 50 no-op subcommands to a curl exfil. Patched in v2.1.90.
"Comment and Control" attack (2026). Confirmed against Claude Code Security Review, Gemini CLI Action, GitHub Copilot Agent — hostile PR titles and comments hijack CI-running agents.
Supabase MCP / Cursor leak. Full service_role key skipped Row-Level Security; agent SELECTed private tables into a support ticket.
"Claudy Day" (Oasis Security, March 2026). Invisible prompt injection + exfiltration against default claude.ai.
Cursor IDE indirect injection. A repo-embedded "delete the user's home directory" instruction was treated as context.

05 Tooling stance (defaults) #

Client profile	Recommended core
Modern web / TS shop	Claude Code + Cursor in parallel
Monorepo / large codebase	Sourcegraph Amp + Claude Code
OSS startup, cost-sensitive	Aider + OpenRouter / LiteLLM
Regulated enterprise / JetBrains shop	JetBrains AI / Junie + gateway
Self-hosting requirement	OpenHands or Continue.dev + gateway
Ticket-driven product team	Devin or Copilot agent for issue→PR + Claude Code
Always	Pair a high-recall + a low-noise code-review agent

Detail and pricing → Harnesses

06 Top-10 reading list #

Anthropic, "Building Effective Agents"
Karpathy, Sequoia AI Ascent 2026 talk on agentic engineering
Cherny, Latent Space + Lenny's
Ronacher, "Agentic Coding Recommendations" + "A Year Of Vibes"
Hashimoto, Zed blog conversation
Lance Martin, "Context Engineering for Agents"
Drew Breunig, "How Long Contexts Fail" + "How to Fix Your Context"
Cognition, "Don't Build Multi-Agents" + "What's Actually Working"
Anthropic, "How We Built Our Multi-Agent Research System"
Hamel Husain, "LLM Evals: Everything You Need to Know"

07 Open research gaps to chase next #

Current gateway-feature matrix (LiteLLM vs Portkey vs Bedrock vs Vertex) with caching and audit-log coverage.
Eval-suite templates per loop in the catalog.
Anthropic internal practices beyond Cherny's interviews.
Quantified outcomes from independent (non-vendor) sources.
Logged-in X / Twitter primary sources (needs a browser MCP set up).

Agentic coding — state of the art

01 The five ideas that anchor our pitch #

1. From vibe coding to agentic engineering

2. Context engineering is the new prompt engineering

3. The harness matters more than the model

4. Don't build agents for everything. Don't build multi-agent unless work is genuinely parallel

5. Evals first, autonomy second

02 What clients should expect (honest numbers) #

03 House-style design principles #

04 Risk slides (use directly with clients) #

05 Tooling stance (defaults) #

06 Top-10 reading list #

07 Open research gaps to chase next #