Catalog — Areas & Solutions

Source: IDEAS.md Scope: software development only

01 IDE & local developer workflow #

Harness selection & house style

Help the team pick a coding-agent harness (Claude Code, Cursor, Codex, Aider, Continue, Cline, Zed AI). See the harness landscape.
Define defaults: model tier, permission mode, sandbox settings, MCP servers enabled, hook setup.
Onboarding kit: one-pager + screen recording so a new hire is productive on day one.

Project prompt files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`)

Author and maintain per-repo prompt files: project overview, architecture, build/test/lint commands, do-not-touch zones, naming conventions, review expectations.
Prompt-file linter / analyzer — flags bloat, stale references, "rule density", and coverage gaps.
Versioning + review process for prompt files. Treat them as code, not lore.
Reference template. The viral Karpathy CLAUDE.md (Karpathy's observations distilled by Forrest Chang, 110k+ stars) is a fine starting boilerplate. Strip the unverified "65→94%" claim and tailor to the client's stack and voice. → Patterns · our position.
MEMORY.md / ERRORS.md artifacts. Persistent decision and failure logs across sessions — the concrete instantiation of the Write operation in context engineering. Most under-installed thing in mature Claude Code deployments.

Plan-mode adoption as a team norm

"No code without an approved plan" installed as a hard team norm during pilots — not just a tool feature.
Measure compliance: % of agent PRs where the plan was approved before code was written.
Coaching for tech leads on how to push back when team members skip planning. See → Patterns · plan mode.

Skills / commands library

Per-repo skills for repeatable tasks: "scaffold endpoint", "add migration", "add feature flag", "write integration test in our style", "ship a hotfix".
Shared cross-repo skills: security review, dependency upgrade, changelog generation.
Catalog with usage metrics so unused skills get pruned.

Sub-agent recipes

Pre-built sub-agents: code reviewer, security reviewer, test generator, doc generator, refactor planner.

02 Agent loops in the SDLC #

Agent loops mapped to SDLC stages. Each is independently shippable as a pilot.

Issue / bug-tracker loops

New bug → draft PR. Agent pulls the issue, reproduces, writes a PR, requests human review.
Pilot scope guidance. Start with 3–5 of the most repetitive ticket categories (CVE bumps, typo fixes, lint cleanup, small endpoint additions). Avoid "X% of all PRs" framing. Measure: time-to-merge, reviewer-edit count, post-merge incident rate. See production · issue→PR.
Triage loop. Incoming issues labeled, prioritized, deduped, linked to related code/tickets.
Stale-issue loop. Pings owners, summarizes status, closes truly dead items.

Support ↔ engineering loops

Ticket investigation. Pulls logs, traces, recent deploys, similar past tickets → posts structured handoff or files a bug.
Recurring-issue detector. Clusters support tickets, surfaces product/code root causes.

PR / code review loops

Automated first-pass review on every PR (style, security, test gaps, regressions, perf). See tools in production.
House recommendation: layer two reviewers. High-recall (Greptile or Sourcegraph Amp) on PR open, low-noise (CodeRabbit) summary at ready-for-review. Track human-edit distance on agent comments to measure usefulness.
Add a verifier / critic pass over agent-authored PRs before they reach human review — a separate (often cheaper) model whose job is to find faults. Measurably reduces noise. See → Patterns · verifier.
Description / changelog drafting from the diff.
Reviewer assignment based on CODEOWNERS + recent activity.

CI / test loops

Flaky-test loop. Detect flake → quarantine → agent investigates and proposes fix.
Failed-build triage. On red CI, agent classifies failure (infra vs test vs real bug) and pings the right person.
Coverage backfill. Writes tests for low-coverage modules touched recently; humans review.

Dependency & security loops

Dependabot-style upgrades with agent-written changelogs, test runs, risk notes, and PRs.
CVE response. New CVE → agent checks affected paths → proposes patch or mitigation.
SBOM / license-drift monitor.

Incident / on-call loops

Page fires → agent pulls dashboards, recent deploys, similar incidents → drafts summary and proposed first steps.
Post-incident: drafts the postmortem from chat transcript + timeline + commits.
Highest-leverage agent intro at risk-averse clients — read-only, high value, bounded downside. Vendors: Cleric, Resolve, Parity. See production · on-call.

Release / deploy loops

Auto-generated release notes (user-facing + internal).
Pre-deploy risk summary: what changed, who owns it, what to watch.
Canary watcher. Observes metrics during canary; recommends promote / rollback.

Multi-agent shape (when a loop fans out)

Default: single-threaded with subagent reads — many agents gather/critique in parallel; one trunk agent owns the write.
Multi-agent writes only when work is genuinely parallel and write-conflict-free. Always run a cost ratio first — Anthropic's multi-agent research system costs ~15× a single agent.
See → Patterns · multi-agent debate and reads fan out.

03 Agent surface UX (Defensive UX) #

Engineering clients under-invest here because their users are other engineers — "they'll figure it out." They won't; they'll just stop using the agent. 1–2 week deliverable that lifts adoption more than another month of prompt tuning. See → Patterns · Defensive UX.

Eng-bot. Show confidence; cite which files / commits backed the answer; trivially closeable, especially on mobile. Don't reply unless asked.
PR-review agent. Never approve, only comment. Require human "resolve" on each thread. Collapse low-confidence comments by default. Make agent-authorship visible at a glance.
Issue → PR loop. Open a draft PR with a clearly-marked "agent-authored, plan attached" badge; provide an explicit "looks wrong, please revise" button.
Production access (MCP). Surface every action as "this will do X" before execution. Default dry-run. One-click rollback. Cite source records.

Universal principles (from Yan, Microsoft HAI, Google PAIR, Apple HIG): set right expectations · enable efficient dismissal · provide attribution · anchor on familiarity · collect feedback in-flow.

04 Production access for dev agents #

MCP server (or CLI) for the product. Typed, auditable surface for production reads + sandboxed actions. Avoids agents "shelling into prod."
Tool tiering. Read-only / reversible / destructive — destructive requires explicit human approval.
Read-replica or sanitized prod mirrors for agent investigation.
Per-agent identities with scoped tokens, full audit log, default dry-run.
Secret broker so agents never see raw credentials.

Security note. The Supabase MCP / Cursor incident showed what happens when an MCP server holds credentials more privileged than the user the agent represents. See incident dossier.

05 Code quality, refactors & migrations #

Large-scale codemod campaigns: framework upgrades, monorepo splits/merges, deprecated-API removal, language-version bumps.
Dead-code detection and removal with safety checks.
Test backfill for legacy modules; doc backfill (docstrings, ADRs, runbooks) from code + git history.
"Architecture-drift" reports — where the code has diverged from the documented design.

06 Developer-facing chat assistant #

Eng-bot in Slack/Teams/Discord with a secure sandbox and scoped repo, docs, and ticketing access. Use cases:
- "Who owns this service?" / "What changed in prod yesterday?" / "Find the code that handles X."
- "Summarize this thread into a ticket." / "Open a PR to fix the typo in foo.ts."
- "Why did build #1234 fail?"
Persona separation when multiple bots share infra (e.g. read-only repo-bot vs write-capable PR-bot).

07 Internal engineering knowledge #

Connect engineering knowledge surfaces (repos, ADRs, runbooks, RFC archive, design docs, eng Slack) into a retrieval layer.
Stale-docs agent. Flags docs that contradict current code or are untouched since the code they describe was rewritten.
Onboarding agent. Answers from internal docs, pings humans when uncertain, tracks where new hires struggle.

08 Cross-cutting · Platform, security, governance #

Harness & provider strategy

Matrix of available harnesses with selection criteria. See landscape.
Provider independence. Route through a gateway (LiteLLM, OpenRouter, Bedrock, Vertex, in-house) so the team isn't locked to one vendor.
Avoid paying API-token prices. Enterprise/seat plans, batch tiers, aggressive caching, self-hosted open-weights for bulk non-sensitive work.

Sandboxing & permissions

Standard agent sandbox spec: filesystem boundary, network allowlist, secret broker, time/budget limits.
Per-tool risk tiering with escalating approval flows.
Audit log everything; weekly review of what agents actually did.

Guardrails (distinct from sandboxing & hooks)

Hooks are deterministic and fire on tool actions. Guardrails are probabilistic and fire on text. Most clients conflate them; we don't. See → Patterns · guardrails.
Four layers, in order of preference:
1. Structural guidance — constrain generation to a valid format (Microsoft Guidance, OpenAI structured outputs, JSON-mode). Beats post-hoc validation.
2. Syntactic — valid JSON, parseable SQL, value-in-range, diff applies cleanly, tests pass.
3. Semantic — second LLM checks for hallucination / alignment with source.
4. Safety — bad-word lists for easy cases; LLM evaluator for nuanced (toxicity, PII, prompt-injection echoes). Apply to inputs too — the entire incident dossier is what happens otherwise.
Sellable as a single audit deliverable: "your agent's defense-in-depth map, with the gaps named."

Context engineering audit

Use the Write / Select / Compress / Isolate taxonomy (Martin) as the diagnostic frame for any agent that's "behaving badly."
Diagnose context-rot failure modes: Poisoning · Distraction · Confusion · Clash (Breunig).
Most clients have no visibility into what was in the window when the agent failed. Trace-storage + context-window dumps are a small but high-ROI install.

Security & compliance

Threat model for dev-agent deployments: prompt injection, exfiltration, tool abuse, supply-chain risk in MCP servers and skills.
Code-leak prevention: which repos can be sent to which provider; PII/secret scanning before model calls.
Eval & red-team suite for each deployed agent.
Incident risk-slide pack for every engagement, drawn from the public incident dossier — Claude Code CVEs, "Comment & Control", Supabase/Cursor leak, Claudy Day, Cursor indirect injection.

Evaluation & observability (the prerequisite, not the polish)

Eval suite is the first deliverable in nearly every engagement — before any autonomous loop ships. See → Patterns · evals.
Match metric to task. Classification → recall / precision / PRAUC. Reference-bearing → BLEU / ROUGE / BERTScore (cite, don't trust). Open-ended → LLM-as-evaluator.
LLM-as-judge bias mitigations to install from day one — clients won't think of them:
- Position bias → swap order, only count consistent wins.
- Verbosity bias → equalize response length.
- Self-enhancement bias → never judge with the same model family that generated.
- Score noise → ask for pairwise comparisons, not numeric scores.
Agent-level metrics: PR acceptance rate, human-edit distance, time-to-resolution, escalation rate, cost per task.
Trace storage (LangSmith / Langfuse / Braintrust / in-house) so bad runs are debuggable.
Regression eval sets versioned alongside prompts and skills — treat agent-behavior changes like code changes.

Feedback flywheel instrumentation

Build in week 1 of the pilot, not after. "Without instrumentation up front, the data is gone." See → Patterns · feedback flywheel.
Signals to capture per deployment:
- PR-level: merge vs discard · human edit distance · time-to-merge · reverts within 7 days.
- Review agent: comment resolved vs dismissed · human edit on agent comment.
- Plan mode: plan accepted as-is vs revised vs rejected — direct proxy for plan quality.
- Eng-bot: thread closed vs escalated · follow-up question rate.
These signals don't just measure — they become the next eval set.

Cost hygiene

Prompt-cache audit is the highest-ROI quick win on most platforms — teams routinely burn 3–10× more than they need to. Cache-hit-rate becomes a top-tier observability metric.
Batch off-peak — Anthropic Batch API and similar give ~50% off for non-real-time work (evals, codemods, doc backfill).
Seat-based plans for bursty workloads (Claude Code Max, Cursor Pro+/Ultra) — much lower TCO than raw per-token at heavy use.
Semantic caching: avoid in agent loops. False-match risk is silent and high. Safe only against item IDs or constrained inputs. See → Patterns · caching.

Enablement & curriculum

Workshops by role (engineers, tech leads, EMs, platform team).
Agent champions program: one power user per team owns local prompts/skills, feeds learnings back.
Office hours and a shared library of patterns that worked / failed.
"Jaggedness" vocabulary training — get teams to name the phenomenon so they don't over-trust benchmark performance. See → Patterns · jaggedness.
Stochastic tool mastery curriculum — Zed's coinage; directing non-deterministic systems is itself an engineering skill. Sellable as a formal program. See → Patterns · interwoven workflow.
Plan-mode coaching — teach tech leads to push back when plans get skipped.

→ How we package engagements #

1 · Assessment (2–4 weeks) — deliverables

Interviews, tool inventory, value-map of where AI plausibly moves the needle.
Three-phase mapping (Zed): which SDLC stages are still purely deterministic, which leapt to stochastic, where is the interweaving messy? That map is the pilot-priority list.
Eval-gap audit — what does the team measure today, what's missing for the proposed loops.
Defensive-UX gap audit — for each existing agent surface, score against the five principles.
Risk-review checklist — incident dossier mapped onto the client's stack; surfaces where input guardrails / tool-tiering / MCP credentials would currently fail.
Context-engineering audit — for existing agents, W/S/C/I review and context-rot diagnostic.

2 · Pilot (4–8 weeks) — ships with its supporting infra, not after

1–2 high-leverage loops (e.g. PR-review pair + bounded issue→PR for 3–5 ticket categories).
Working eval suite with labeled error categories — before the loop runs autonomously.
Minimum guardrails — input + output, per the four-layer model.
Defensive-UX layer for the agent surface (badges, dismissal, attribution, dry-run).
Feedback instrumentation from week 1 — structured logs tagged with run-id.
Plan-mode norm installed; compliance measured.

3 · Scale & enable (ongoing)

Platform work: harness gateway, sandbox spec, MCP discipline, audit log.
Governance: prompt-file review process, eval regression gates, budget caps.
Curriculum: jaggedness vocabulary, stochastic tool mastery, plan-mode coaching.
Additional loops following the same install-eval-guardrails-UX-feedback shape.

Areas of a software org we can audit

01 IDE & local developer workflow #

Harness selection & house style

Project prompt files (CLAUDE.md, AGENTS.md, .cursorrules)

Plan-mode adoption as a team norm

Skills / commands library

Sub-agent recipes

02 Agent loops in the SDLC #

Issue / bug-tracker loops

Support ↔ engineering loops

PR / code review loops

CI / test loops

Dependency & security loops

Incident / on-call loops

Release / deploy loops

Multi-agent shape (when a loop fans out)

03 Agent surface UX (Defensive UX) #

04 Production access for dev agents #

05 Code quality, refactors & migrations #

06 Developer-facing chat assistant #

07 Internal engineering knowledge #

08 Cross-cutting · Platform, security, governance #

Harness & provider strategy

Sandboxing & permissions

Guardrails (distinct from sandboxing & hooks)

Context engineering audit

Security & compliance

Evaluation & observability (the prerequisite, not the polish)

Feedback flywheel instrumentation

Cost hygiene

Enablement & curriculum

→ How we package engagements #

1 · Assessment (2–4 weeks) — deliverables

2 · Pilot (4–8 weeks) — ships with its supporting infra, not after

3 · Scale & enable (ongoing)

Project prompt files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`)