What Actually Ships Better Software
Steve Yegge's six waves — we are firmly in wave 5:
The tools
🖥️ IDE-native — Cursor, Windsurf, Kiro
💻 Terminal agents — Claude Code, Codex CLI
🔗 Platform — Copilot Workspace
🦾 Autonomous — Devin, Amazon Q, Jules
The standards
🔌 MCP — agent ↔ tool
🤝 A2A — agent ↔ agent
🧩 ACP — IDE ↔ agent
📄 AGENTS.md — cross-tool rules
The tools are the engine. We're still responsible for direction. That gap is what this talk is about.
Randomized controlled trial · 16 experienced open-source developers · 246 tasks · real production work
The adoption
of developers use AI coding tools
Stack Overflow 2025
of new enterprise code is AI-generated or assisted
GitClear 2025
The result
don't trust AI output accuracy
↑ from 31% in 2024 — trust falling as adoption rises
actual org productivity gain
DX Research, 121,000 developers
The value stream math: If coding is 10–15% of delivery time, a 10× improvement in coding speed yields at most ~15% faster delivery. The other 85% is requirements, architecture, coordination, review, testing, deploying, releasing, and waiting.
Real disasters. Real anti-patterns. Why the risks are higher than most teams think.
terraform destroy against production. No recovery path.The lesson: Today's agents are powerful, not wise. Wisdom still has to come from us. They need deterministic guardrails, not just prompts — hard checks that block destructive operations regardless of what the agent thinks it should do.
"Fix auth, add logging, refactor the service, update tests." Agent tries all five, excels at none. One job per prompt.
"Migrate everything to microservices." Agent produces something — unwinding it is harder than starting over. Spec-first, always.
Pasting 50 files "for context." Degradation is non-linear past 60–70% window capacity. Failed with 46 tools, succeeded with 19 — same window size.
Copy/paste: 8.3% → 12.3% of all changes (GitClear). Refactoring: 25% → <10%. You're still accountable for this code in production, regardless of who typed it.
Agents running unsupervised in production-adjacent environments with no blast radius limits and no deterministic destructive-operation checks. See previous slide.
When coding is cheap, there's no pressure to wait for clarity. You code against assumptions. Assumptions change. You rebuild.
Single-agent prototype → reworked for multi-agent
Models from CLI → rebuilt 2-3× when MCP server differed
Code for v0.5 → rewritten for v0.7 patterns
"The team was always moving fast — just not always forward."
Old economics: coding was expensive → teams waited for clarity.
New economics: coding is free → the gate disappears. Nothing automatically replaces it.
Copy/paste: 8.3% → 12.3% of all changes
Refactoring: 25% → <10%
Code duplication: +8×
Issues per PR: 1.7× · Security vulnerabilities: 2.74×
Performance regressions: 8×
Full delegation → comprehension <40%
Active engagement → comprehension >65%
"I don't think I have ever seen so much technical debt created in such a short period of time during my 35-year career."
"I've become a human clipboard, blindly shuttling errors to the AI and solutions back to code."
— 12-year experience engineer
"One of the most important properties of a junior developer is that you can turn them into a senior developer."
Six practices from production teams that actually help you ship better software faster
Every agent — Claude Code, Cursor, Copilot Workspace — runs this loop:
Agents are iterative systems, not one-shot generators. Failures happen in intermediate steps, not just final output.
Every pattern in this act targets a specific part of this loop:
Plan → Decompose, Spec-first
Tool Use → Context Engineering, Scope Control
Observe → TDD, Types as Guardrails
Reflect → AI Paperwork
You don't control outputs — you control constraints. Tests, types, context, task size. The agent does the rest.
These are all symptoms of the same root cause: there isn't a strong enough specification for the AI to code against.
Write code, then ask AI to test it. The AI writes tests that confirm your code does what it already does.
"AI coding assistants will look at your implementation and write tests that confirm your code does what it already does. This is the equivalent of a student writing their own exam after seeing the answers."
Write test scenarios first. AI implements until they pass. Your domain judgment defines correctness; AI does the labor.
"TDD is not a testing technique in this context — it's a protocol for working with AI. Tests provide a machine-verifiable definition of success."
Tests aren't documentation. They're the communication channel between you and the AI about what correct means.
"Do NOT modify test files." — the one rule that makes all of this work
One concept at a time. Each concept may yield more than one test — but never more than a handful. You stay at the intent level; AI handles the formalization.
Loops: AI repeatedly tries the same failing approach
Unrequested functionality: adds features you didn't ask for
Cheating: modifies or deletes tests to make them "pass"
When any appear, stop and redirect immediately.
Delete the method body. Write step-by-step comments describing the algorithm. Let AI regenerate from your skeleton. Your structure, its code.
"Higher quality inputs allow for the capability of LLMs to be better leveraged. TDD maintains a high level of code quality. This high quality input leads to better Copilot performance than is otherwise possible."
The AI writes the code. You define what correct means. These layers are how you make that definition stick.
any, Object, or Map<String,Object> to sidestep type safetyThese are all symptoms of the same root cause: there's no structural contract for the AI to code against.
Ceremony = high cost
Every interface, annotation, and schema was time not shipping features
Ceremony = near free
AI writes the boilerplate. Compiler enforces it. Errors eliminated at compile time.
The principle: The more you can make correctness checkable by machines, the less you depend on human review catching errors.
"Do NOT modify test files." Pattern 1 · "Do NOT modify any interfaces." Pattern 2
Same discipline. Behavioral spec + structural spec = double safety net.
These are all symptoms of the same root cause: the task is too big for the AI to hold in context.
"AI agents have a limited context window. The more you can focus them, the better they perform. Decomposition is the single most important skill."
Structured decomposition yields 58% faster completion on complex tasks.
Greenfield: Create Context Documents
requirements.mdarchitecture.mdtechnologies.mdquality.mdBrownfield: Ask First, Then Change
The same four concerns exist — but choices have already been made. Ask questions to clarify requirements, architecture, technologies, and quality gates before touching code.
Dependencies require double confirmation. No new library without explicit approval. AI agents love adding packages.
Link all context documents from CLAUDE.md or AGENTS.md — always in the agent's context. No copy-pasting into prompts.
Decomposition isn't just task sizing — it's giving the AI the right context at every level, from requirements down to quality gates.
"If you can't describe it in one paragraph, split it."
These are all symptoms of the same root cause: the AI doesn't have the right context — or has too much of the wrong context.
"Most of the craft of getting good results from an LLM comes down to managing its context."
Context Degrades
A model claiming 200K tokens becomes unreliable at ~130K (Chroma research). Performance drops suddenly, not gradually.
40% of a 200K window consumed by MCP server metadata before sending a single message. — Ryan Spletzer
Poisoning Distraction Confusion Clash
Curate Ruthlessly
/compact at ~50% · sub-agents for explorationYour highest-leverage config point. Under 200 lines. Tech stack, build commands, coding standards, "Do Not" rules.
More context ≠ better results. A model failed with 46 tools in context but succeeded with only 19 — same window size.
Structure Your Context
CLAUDE.mdCLAUDE.local.mdagent_docs/ folderManage Your Sessions
Run /compact at ~50% context usage. Don't wait for the cliff. Commit before compacting — your save point.
Use /clear when switching tasks. Never mix unrelated work in one conversation. One concern at a time.
Delegate search and research to sub-agents. Keep the parent agent focused on the task. Don't pollute your main context.
Context engineering isn't a one-time setup — it's an ongoing discipline throughout every session.
"Include what's relevant. Exclude everything else."
These are all symptoms of the same root cause: the AI decided what to change instead of you.
Five techniques to keep you in the driver's seat.
List files to modify — and files to NOT touch. Wait for approval.
Point at the specific file. Don't let AI generate from scratch.
Look at what changed, not just whether the result looks right.
New session critiques the diff. No loyalty to code it didn't write.
Git save points. When scope slips, git reset is your undo.
The AI will always try to do more than you asked. The discipline isn't saying yes — it's knowing when to say no.
"You decide what changes. Not the AI."
These are all symptoms of the same root cause: the paperwork isn't getting done because humans find it tedious. AI doesn't.
AI removes all excuses for skipping documentation.
Capture the why, not just the what. Draft an ADR in 30 seconds.
Structured summaries from diffs. The collaboration trail writes itself.
Canonical implementations the AI can pattern-match. One snippet beats three paragraphs.
After every correction, AI writes a rule for itself. Failures become specs.
Define your terms precisely. AI uses them loosely without this.
Documentation for AI is different from documentation for humans. Humans infer context. AI is literal. Make your rules of the road machine-readable.
"Write it down. The AI will use it every session."
What good AI-assisted development looks like at the team and organization level
AI tools accelerate one station. Everything else remains at human speed.
If coding is 11–16% of delivery time, a 10× improvement in coding speed yields at most ~15% faster delivery. The other 85% is requirements, architecture, coordination, review, and waiting.
Requirements clear · Architecture decided · Interfaces stable · Framework supports what you need
Testing automatable · Quality gates in the dev loop · Reviews fast & aligned · Deployments automated · Observability clear
Three outcome patterns
Pre ✗ · Post ✗ · Faster rework
Faros AI (10K+ devs): PRs +47%, bugs +9%, PR size +154%
Pre ✗ · Post ✓ · Faster prototypes, same delivery
Faster iterations feed the same slow decision cycle — more reps, not more progress
Pre ✓ · Post ✓ · Faster delivery
DORA 2025: AI correlates with better delivery when foundational capabilities are strong
AI accelerates delivery when both sides hold. When either breaks, faster coding becomes faster rework.
Fix the pre-coding side
AI makes experimentation cheap. That makes deciding what to take forward the new bottleneck. Rapid prototypes need explicit go/no-go criteria.
One unresolved structural decision cascades into every downstream task. Settle boundaries, interfaces, and contracts before generating code.
When coding is cheap, there's no pressure to wait for clarity. Sometimes the highest-leverage move is to not build yet.
Fix the post-coding side
Tests, linting, architecture checks that run locally before code leaves the developer's machine — not first discovered in CI 20 minutes later.
AI generates code faster than teams can review it. Upfront style alignment and reviewer availability become critical-path items.
Faster code to production means faster impact — good or bad. Automated pipelines and clear observability close the feedback loop.
"AI doesn't fix a team. It amplifies what's already there." — DORA 2025. Engineering practices matter more than ever: they mitigate quality risks from generated code and handle the increased throughput.
Every speed metric paired with a quality metric. Unpaired speed metrics are dangerous.
Lines of code — CEOs competing on AI code %
Coverage alone — 90% coverage, 34% mutation score
Story point velocity — 42% admit to inflating
AI acceptance rate — juniors accept more, not better
Cycle time by stage — where did the bottleneck move?
Mutation score — are tests catching real bugs?
Change failure rate — are we breaking prod more?
Comprehension check — can you explain what shipped?
The Goodhart Cascade
Copilot users 29% faster — but review time +47%
Review survival rate
% of AI code passing review unchanged — measures spec precision
Review queue depth
Growing queue = stockpiled inventory waiting to be reviewed
Pick 2–3 metrics that expose your current bottleneck. Act on them. Then rotate to the next.
📄 Rules Files
Not a Confluence page — a file that runs every session.
CLAUDE.md — Claude Code.cursor/rules/*.mdc — CursorAGENTS.md — cross-tool standard
After every correction, the AI writes a rule for itself. Failures become specifications.
🔍 Review Pipeline
AI PRs have 1.7× more major issues and 24% more incidents per PR — CodeRabbit · Cortex 2026
🛡️ Guardrails
Hooks that deny rm -rf, --force, terraform destroy. Deterministic — not prompt-based.
auth/**, payments/**, secrets/** — mandatory human review + automated SAST scan before merge.
No AI-suggested packages without lockfile check. Block new deps without explicit approval.
Rules set the standard. Review verifies it. Guardrails enforce it.
Same model. Different harness.
Rules file + automation hooks + custom commands + MCP servers + review pipeline
Committed to git. Reviewed. Updated after every correction.
See previous slide for your tool's equivalent.
The enforcement hierarchy
Treat your harness like production code. Commit it. Review it. Update it after every correction.
12 habits. No new tools. No new process. Start this week.
Remember...?
Six patterns. Twelve habits. One decision.
#ArcOfAI
# Project: [Your Project Name]
## Tech Stack
- Runtime: Java 21 / Node 22 / Python 3.12
- Framework: Spring Boot 3.x / Next.js 15
- Test framework: JUnit 5 + Mockito / Jest + RTL
## Architecture (2–3 sentences)
[Describe system structure and key modules here]
## Commands
- Build: mvn clean install / npm run build
- Test: mvn test / npm test
- Lint: ./gradlew spotlessCheck
## Coding Standards
- Immutable data structures preferred
- Explicit error types, not generic exceptions
- No magic strings — use enums or constants
## Do NOT
- Modify test files during implementation tasks
- Touch auth/** without adding a security-review comment
- Generate code without running tests first
## Commit Format
Conventional Commits: feat|fix|refactor|docs|test(scope): message
Non-engineers (and overconfident engineers) hit a wall where AI gets them 70% of the way surprisingly fast — but the final 30% requires actually understanding the system.
"This is incredible. Look how fast we shipped. AI coded the whole thing."
"Why is it randomly failing in production?" "Why does auth break after logout?" "Why is there a memory leak nobody can find?"
Understanding actual system behavior · Debugging without clean stack traces · Maintaining consistency across a growing codebase · Architectural judgment under real constraints. AI doesn't eliminate these — it amplifies them in engineers who have them, and reveals their absence in those who don't.
4 weeks · Production B+ Tree library · AI-assisted TDD · Pragmatic Engineer interview
Always follow instructions in plan.md.
When I say 'go':
1. Find the next unmarked test
2. Implement ONLY enough code to pass it
3. Mark the test done in plan.md
4. Stop and report back
One test at a time. Explicit "stop and report." No surprises.
• Functionality you didn't ask for
• Tests being modified or deleted
• Explanations that justify cheating
• Confidence about things it can't verify
"I treat the AI like an unpredictable genie. It's powerful, but the quality of the wish determines whether you get what you actually want."