01 IDE & local developer workflow #
Harness selection & house style
- Help the team pick a coding-agent harness (Claude Code, Cursor, Codex, Aider, Continue, Cline, Zed AI). See the harness landscape.
- Define defaults: model tier, permission mode, sandbox settings, MCP servers enabled, hook setup.
- Onboarding kit: one-pager + screen recording so a new hire is productive on day one.
Project prompt files (CLAUDE.md, AGENTS.md, .cursorrules)
- Author and maintain per-repo prompt files: project overview, architecture, build/test/lint commands, do-not-touch zones, naming conventions, review expectations.
- Prompt-file linter / analyzer — flags bloat, stale references, "rule density", and coverage gaps.
- Versioning + review process for prompt files. Treat them as code, not lore.
- Reference template. The viral Karpathy CLAUDE.md (Karpathy's observations distilled by Forrest Chang, 110k+ stars) is a fine starting boilerplate. Strip the unverified "65→94%" claim and tailor to the client's stack and voice. → Patterns · our position.
MEMORY.md/ERRORS.mdartifacts. Persistent decision and failure logs across sessions — the concrete instantiation of the Write operation in context engineering. Most under-installed thing in mature Claude Code deployments.
Plan-mode adoption as a team norm
- "No code without an approved plan" installed as a hard team norm during pilots — not just a tool feature.
- Measure compliance: % of agent PRs where the plan was approved before code was written.
- Coaching for tech leads on how to push back when team members skip planning. See → Patterns · plan mode.
Skills / commands library
- Per-repo skills for repeatable tasks: "scaffold endpoint", "add migration", "add feature flag", "write integration test in our style", "ship a hotfix".
- Shared cross-repo skills: security review, dependency upgrade, changelog generation.
- Catalog with usage metrics so unused skills get pruned.
Sub-agent recipes
- Pre-built sub-agents: code reviewer, security reviewer, test generator, doc generator, refactor planner.
02 Agent loops in the SDLC #
Issue / bug-tracker loops
- New bug → draft PR. Agent pulls the issue, reproduces, writes a PR, requests human review.
- Pilot scope guidance. Start with 3–5 of the most repetitive ticket categories (CVE bumps, typo fixes, lint cleanup, small endpoint additions). Avoid "X% of all PRs" framing. Measure: time-to-merge, reviewer-edit count, post-merge incident rate. See production · issue→PR.
- Triage loop. Incoming issues labeled, prioritized, deduped, linked to related code/tickets.
- Stale-issue loop. Pings owners, summarizes status, closes truly dead items.
Support ↔ engineering loops
- Ticket investigation. Pulls logs, traces, recent deploys, similar past tickets → posts structured handoff or files a bug.
- Recurring-issue detector. Clusters support tickets, surfaces product/code root causes.
PR / code review loops
- Automated first-pass review on every PR (style, security, test gaps, regressions, perf). See tools in production.
- House recommendation: layer two reviewers. High-recall (Greptile or Sourcegraph Amp) on PR open, low-noise (CodeRabbit) summary at ready-for-review. Track human-edit distance on agent comments to measure usefulness.
- Add a verifier / critic pass over agent-authored PRs before they reach human review — a separate (often cheaper) model whose job is to find faults. Measurably reduces noise. See → Patterns · verifier.
- Description / changelog drafting from the diff.
- Reviewer assignment based on CODEOWNERS + recent activity.
CI / test loops
- Flaky-test loop. Detect flake → quarantine → agent investigates and proposes fix.
- Failed-build triage. On red CI, agent classifies failure (infra vs test vs real bug) and pings the right person.
- Coverage backfill. Writes tests for low-coverage modules touched recently; humans review.
Dependency & security loops
- Dependabot-style upgrades with agent-written changelogs, test runs, risk notes, and PRs.
- CVE response. New CVE → agent checks affected paths → proposes patch or mitigation.
- SBOM / license-drift monitor.
Incident / on-call loops
- Page fires → agent pulls dashboards, recent deploys, similar incidents → drafts summary and proposed first steps.
- Post-incident: drafts the postmortem from chat transcript + timeline + commits.
- Highest-leverage agent intro at risk-averse clients — read-only, high value, bounded downside. Vendors: Cleric, Resolve, Parity. See production · on-call.
Release / deploy loops
- Auto-generated release notes (user-facing + internal).
- Pre-deploy risk summary: what changed, who owns it, what to watch.
- Canary watcher. Observes metrics during canary; recommends promote / rollback.
Multi-agent shape (when a loop fans out)
- Default: single-threaded with subagent reads — many agents gather/critique in parallel; one trunk agent owns the write.
- Multi-agent writes only when work is genuinely parallel and write-conflict-free. Always run a cost ratio first — Anthropic's multi-agent research system costs ~15× a single agent.
- See → Patterns · multi-agent debate and reads fan out.
03 Agent surface UX (Defensive UX) #
Engineering clients under-invest here because their users are other engineers — "they'll figure it out." They won't; they'll just stop using the agent. 1–2 week deliverable that lifts adoption more than another month of prompt tuning. See → Patterns · Defensive UX.
- Eng-bot. Show confidence; cite which files / commits backed the answer; trivially closeable, especially on mobile. Don't reply unless asked.
- PR-review agent. Never approve, only comment. Require human "resolve" on each thread. Collapse low-confidence comments by default. Make agent-authorship visible at a glance.
- Issue → PR loop. Open a draft PR with a clearly-marked "agent-authored, plan attached" badge; provide an explicit "looks wrong, please revise" button.
- Production access (MCP). Surface every action as "this will do X" before execution. Default dry-run. One-click rollback. Cite source records.
Universal principles (from Yan, Microsoft HAI, Google PAIR, Apple HIG): set right expectations · enable efficient dismissal · provide attribution · anchor on familiarity · collect feedback in-flow.
04 Production access for dev agents #
- MCP server (or CLI) for the product. Typed, auditable surface for production reads + sandboxed actions. Avoids agents "shelling into prod."
- Tool tiering. Read-only / reversible / destructive — destructive requires explicit human approval.
- Read-replica or sanitized prod mirrors for agent investigation.
- Per-agent identities with scoped tokens, full audit log, default dry-run.
- Secret broker so agents never see raw credentials.
Security note. The Supabase MCP / Cursor incident showed what happens when an MCP server holds credentials more privileged than the user the agent represents. See incident dossier.
05 Code quality, refactors & migrations #
- Large-scale codemod campaigns: framework upgrades, monorepo splits/merges, deprecated-API removal, language-version bumps.
- Dead-code detection and removal with safety checks.
- Test backfill for legacy modules; doc backfill (docstrings, ADRs, runbooks) from code + git history.
- "Architecture-drift" reports — where the code has diverged from the documented design.
06 Developer-facing chat assistant #
- Eng-bot in Slack/Teams/Discord with a secure sandbox and scoped repo, docs, and ticketing access. Use cases:
- "Who owns this service?" / "What changed in prod yesterday?" / "Find the code that handles X."
- "Summarize this thread into a ticket." / "Open a PR to fix the typo in
foo.ts." - "Why did build #1234 fail?"
- Persona separation when multiple bots share infra (e.g. read-only repo-bot vs write-capable PR-bot).
07 Internal engineering knowledge #
- Connect engineering knowledge surfaces (repos, ADRs, runbooks, RFC archive, design docs, eng Slack) into a retrieval layer.
- Stale-docs agent. Flags docs that contradict current code or are untouched since the code they describe was rewritten.
- Onboarding agent. Answers from internal docs, pings humans when uncertain, tracks where new hires struggle.
08 Cross-cutting · Platform, security, governance #
Harness & provider strategy
- Matrix of available harnesses with selection criteria. See landscape.
- Provider independence. Route through a gateway (LiteLLM, OpenRouter, Bedrock, Vertex, in-house) so the team isn't locked to one vendor.
- Avoid paying API-token prices. Enterprise/seat plans, batch tiers, aggressive caching, self-hosted open-weights for bulk non-sensitive work.
Sandboxing & permissions
- Standard agent sandbox spec: filesystem boundary, network allowlist, secret broker, time/budget limits.
- Per-tool risk tiering with escalating approval flows.
- Audit log everything; weekly review of what agents actually did.
Guardrails (distinct from sandboxing & hooks)
- Hooks are deterministic and fire on tool actions. Guardrails are probabilistic and fire on text. Most clients conflate them; we don't. See → Patterns · guardrails.
- Four layers, in order of preference:
- Structural guidance — constrain generation to a valid format (Microsoft Guidance, OpenAI structured outputs, JSON-mode). Beats post-hoc validation.
- Syntactic — valid JSON, parseable SQL, value-in-range, diff applies cleanly, tests pass.
- Semantic — second LLM checks for hallucination / alignment with source.
- Safety — bad-word lists for easy cases; LLM evaluator for nuanced (toxicity, PII, prompt-injection echoes). Apply to inputs too — the entire incident dossier is what happens otherwise.
- Sellable as a single audit deliverable: "your agent's defense-in-depth map, with the gaps named."
Context engineering audit
- Use the Write / Select / Compress / Isolate taxonomy (Martin) as the diagnostic frame for any agent that's "behaving badly."
- Diagnose context-rot failure modes: Poisoning · Distraction · Confusion · Clash (Breunig).
- Most clients have no visibility into what was in the window when the agent failed. Trace-storage + context-window dumps are a small but high-ROI install.
Security & compliance
- Threat model for dev-agent deployments: prompt injection, exfiltration, tool abuse, supply-chain risk in MCP servers and skills.
- Code-leak prevention: which repos can be sent to which provider; PII/secret scanning before model calls.
- Eval & red-team suite for each deployed agent.
- Incident risk-slide pack for every engagement, drawn from the public incident dossier — Claude Code CVEs, "Comment & Control", Supabase/Cursor leak, Claudy Day, Cursor indirect injection.
Evaluation & observability (the prerequisite, not the polish)
- Eval suite is the first deliverable in nearly every engagement — before any autonomous loop ships. See → Patterns · evals.
- Match metric to task. Classification → recall / precision / PRAUC. Reference-bearing → BLEU / ROUGE / BERTScore (cite, don't trust). Open-ended → LLM-as-evaluator.
- LLM-as-judge bias mitigations to install from day one — clients won't think of them:
- Position bias → swap order, only count consistent wins.
- Verbosity bias → equalize response length.
- Self-enhancement bias → never judge with the same model family that generated.
- Score noise → ask for pairwise comparisons, not numeric scores.
- Agent-level metrics: PR acceptance rate, human-edit distance, time-to-resolution, escalation rate, cost per task.
- Trace storage (LangSmith / Langfuse / Braintrust / in-house) so bad runs are debuggable.
- Regression eval sets versioned alongside prompts and skills — treat agent-behavior changes like code changes.
Feedback flywheel instrumentation
- Build in week 1 of the pilot, not after. "Without instrumentation up front, the data is gone." See → Patterns · feedback flywheel.
- Signals to capture per deployment:
- PR-level: merge vs discard · human edit distance · time-to-merge · reverts within 7 days.
- Review agent: comment resolved vs dismissed · human edit on agent comment.
- Plan mode: plan accepted as-is vs revised vs rejected — direct proxy for plan quality.
- Eng-bot: thread closed vs escalated · follow-up question rate.
- These signals don't just measure — they become the next eval set.
Cost hygiene
- Prompt-cache audit is the highest-ROI quick win on most platforms — teams routinely burn 3–10× more than they need to. Cache-hit-rate becomes a top-tier observability metric.
- Batch off-peak — Anthropic Batch API and similar give ~50% off for non-real-time work (evals, codemods, doc backfill).
- Seat-based plans for bursty workloads (Claude Code Max, Cursor Pro+/Ultra) — much lower TCO than raw per-token at heavy use.
- Semantic caching: avoid in agent loops. False-match risk is silent and high. Safe only against item IDs or constrained inputs. See → Patterns · caching.
Enablement & curriculum
- Workshops by role (engineers, tech leads, EMs, platform team).
- Agent champions program: one power user per team owns local prompts/skills, feeds learnings back.
- Office hours and a shared library of patterns that worked / failed.
- "Jaggedness" vocabulary training — get teams to name the phenomenon so they don't over-trust benchmark performance. See → Patterns · jaggedness.
- Stochastic tool mastery curriculum — Zed's coinage; directing non-deterministic systems is itself an engineering skill. Sellable as a formal program. See → Patterns · interwoven workflow.
- Plan-mode coaching — teach tech leads to push back when plans get skipped.
→ How we package engagements #
1 · Assessment (2–4 weeks) — deliverables
- Interviews, tool inventory, value-map of where AI plausibly moves the needle.
- Three-phase mapping (Zed): which SDLC stages are still purely deterministic, which leapt to stochastic, where is the interweaving messy? That map is the pilot-priority list.
- Eval-gap audit — what does the team measure today, what's missing for the proposed loops.
- Defensive-UX gap audit — for each existing agent surface, score against the five principles.
- Risk-review checklist — incident dossier mapped onto the client's stack; surfaces where input guardrails / tool-tiering / MCP credentials would currently fail.
- Context-engineering audit — for existing agents, W/S/C/I review and context-rot diagnostic.
2 · Pilot (4–8 weeks) — ships with its supporting infra, not after
- 1–2 high-leverage loops (e.g. PR-review pair + bounded issue→PR for 3–5 ticket categories).
- Working eval suite with labeled error categories — before the loop runs autonomously.
- Minimum guardrails — input + output, per the four-layer model.
- Defensive-UX layer for the agent surface (badges, dismissal, attribution, dry-run).
- Feedback instrumentation from week 1 — structured logs tagged with run-id.
- Plan-mode norm installed; compliance measured.
3 · Scale & enable (ongoing)
- Platform work: harness gateway, sandbox spec, MCP discipline, audit log.
- Governance: prompt-file review process, eval regression gates, budget caps.
- Curriculum: jaggedness vocabulary, stochastic tool mastery, plan-mode coaching.
- Additional loops following the same install-eval-guardrails-UX-feedback shape.