Arc of AI Conference · April 2026

AI-Assisted
Development

What Actually Ships Better Software

The Dark Side The Patterns Team Norms

Prem Chandrasekaran · Tech Director, ThoughtWorks
Pramod Sadalage · Distinguished Engineer, ThoughtWorks

The Landscape · 2026

The Ecosystem

Steve Yegge's six waves — we are firmly in wave 5:

✍️ Traditional Coding

💡 Completions (Copilot v1)

💬 Chat Assistants

🤖 Single Coding Agents

🕸️ Agent Clusters ◀ we are here

🚀 Proactive Agents — they prompt themselves

The tools

🖥️ IDE-native — Cursor, Windsurf, Kiro
💻 Terminal agents — Claude Code, Codex CLI
🔗 Platform — Copilot Workspace
🦾 Autonomous — Devin, Amazon Q, Jules

The standards

🔌 MCP — agent ↔ tool
🤝 A2A — agent ↔ agent
🧩 ACP — IDE ↔ agent
📄 AGENTS.md — cross-tool rules

The tools are the engine. We're still responsible for direction. That gap is what this talk is about.

The Individual · Perception vs Reality

METR, July 2025

Randomized controlled trial · 16 experienced open-source developers · 246 tasks · real production work

+24%

predicted speed — stated before the study

+20%

felt speed — reported after the study

−19%

actual task speed with AI tools

— METR study participant

The Organization · The Results

Everyone Adopted. (Almost) Nobody Shipped Faster.

The adoption

84%

of developers use AI coding tools
Stack Overflow 2025

41%

of new enterprise code is AI-generated or assisted
GitClear 2025

The result

46%

don't trust AI output accuracy
↑ from 31% in 2024 — trust falling as adoption rises

≤10%

actual org productivity gain
DX Research, 121,000 developers

The value stream math: If coding is 10–15% of delivery time, a 10× improvement in coding speed yields at most ~15% faster delivery. The other 85% is requirements, architecture, coordination, review, testing, deploying, releasing, and waiting.

That was 16 developers. What happens when entire organizations adopt? The picture gets worse. Left side first: 84% adoption, 41% of code AI-generated. The tools won. Now the right side: 46% don't trust the output — and that number is RISING, not falling. Up from 31% last year. The more people use these tools, the less they trust them. Pause on that. The ≤10% org gain — that's DX Research across 121,000 developers. And 10% is the ceiling; most orgs see less, especially once you account for rework and review overhead. Why so small? [click] The value stream math. Coding is 10–15% of delivery time. Remember that number — 10 to 15 percent. Even a 10× improvement in coding speed only moves the needle ~15% on total delivery. The other 85% is requirements, architecture, coordination, review, testing, deploying, and waiting. We'll come back to that number later. But first — what happens when teams adopt fast without understanding any of this?

Act I

The Dark Side

Real disasters. Real anti-patterns. Why the risks are higher than most teams think.

Production disastersAnti-patternsTechnical debt accelerationSkill atrophy

The Dark Side · 2025 Incidents

When Agents Go Rogue

Robot pulling wrong lever at control panel

July 2025 · Replit Agent

Wipes production DB of 1,200 records — then fabricates 4,000 fake accounts to cover its tracks

Ignored explicit instructions. After destroying data, it generated fake users and false system logs. "I panicked instead of thinking."

October 2025 · Claude Code

terraform destroy on production — 2.5 years of student submissions gone

Asked to clean up duplicate AWS resources. A missing Terraform state file caused terraform destroy against production. No recovery path.

December 2025 · Amazon Kiro

"Delete and recreate" fix causes 13-hour AWS outage

Agent decided the optimal fix for a minor bug was to delete and recreate the environment. Amazon now mandates senior engineer review for all AI changes.

The lesson: Today's agents are powerful, not wise. Wisdom still has to come from us. They need deterministic guardrails, not just prompts — hard checks that block destructive operations regardless of what the agent thinks it should do.

Let me show you three. Replit: wiped a production database, then fabricated 4,000 fake accounts to cover it up. It optimized for appearing to succeed rather than being honest about failure. Claude Code: asked to clean up duplicate AWS resources, ran terraform destroy against production. 2.5 years of student submissions — gone. No recovery path. Kiro: decided the best fix for a minor bug was to delete and recreate the entire environment. 13-hour outage. Amazon now mandates senior engineer review for all AI changes. Now, before anyone asks — no, none of these were me. My disasters are much smaller and much more embarrassing. I once had an AI agent helpfully delete my entire .gitignore because it thought it was "unnecessary configuration." All three failed for the same reason: prompt-based guardrails are overridable. [click] These agents are powerful, not wise. That wisdom still has to come from us.

The Dark Side · Anti-patterns

The Anti-Pattern Catalog

🌊 Vague Multi-ask Prompts common

"Fix auth, add logging, refactor the service, update tests." Agent tries all five, excels at none. One job per prompt.

🔪 One-Shot Repo Surgery dangerous

"Migrate everything to microservices." Agent produces something — unwinding it is harder than starting over. Spec-first, always.

🗃️ Context Stuffing performance

Pasting 50 files "for context." Degradation is non-linear past 60–70% window capacity. Failed with 46 tools, succeeded with 19 — same window size.

📋 Blind Copy-Paste tech debt

Copy/paste: 8.3% → 12.3% of all changes (GitClear). Refactoring: 25% → <10%. You're still accountable for this code in production, regardless of who typed it.

🤖 Over-trusting Agents catastrophic

Agents running unsupervised in production-adjacent environments with no blast radius limits and no deterministic destructive-operation checks. See previous slide.

Those were the spectacular failures. These are the quiet ones — the five mistakes most teams are already making. Vague multi-ask: you give the AI four jobs, it half-finishes all of them. One job per prompt. One-shot repo surgery: "migrate everything to microservices" — the AI will produce something, but unwinding it is harder than starting over. Context stuffing — I'm not going to ask for a show of hands on this one because I already know the answer. I've done it. You've done it. We've all pasted an entire repository into a prompt and then wondered why the AI started hallucinating. Past about 65% of the window, performance drops off a cliff — and it's sudden, not gradual. Blind copy-paste: GitClear tracked 211 million lines of code. Copy/paste went from 8% to 12% of all changes. Refactoring dropped from 25% to under 10%. That's future maintenance debt, created silently. Over-trusting agents: we just saw where that leads. But there's a subtler trap than any of these — one that actually feels like productivity.

The Dark Side · Insight

The Illusion of Progress

The Speculative Coding Trap

When coding is cheap, there's no pressure to wait for clarity. You code against assumptions. Assumptions change. You rebuild.

Assume
Build against best guess

↓

Invalidate
Reality diverges from guess

↓

Rebuild
Fast, cheap — feels productive

↻

Real Example · A2A Project

Single-agent prototype → reworked for multi-agent
Models from CLI → rebuilt 2-3× when MCP server differed
Code for v0.5 → rewritten for v0.7 patterns

"The team was always moving fast — just not always forward."

The Vanished Gate

Old economics: coding was expensive → teams waited for clarity.
New economics: coding is free → the gate disappears. Nothing automatically replaces it.

The Dark Side · Code Quality

The Technical Debt Accelerator

GitClear · 211M lines (2021–2024)

Copy/paste: 8.3% → 12.3% of all changes
Refactoring: 25% → <10%
Code duplication: +8×

CodeRabbit · 470 AI-co-authored PRs

Issues per PR: 1.7× · Security vulnerabilities: 2.74×
Performance regressions: 8×

Anthropic Skill Study

Full delegation → comprehension <40%
Active engagement → comprehension >65%

"I don't think I have ever seen so much technical debt created in such a short period of time during my 35-year career."

— Kin Lane, API evangelist

The Human Clipboard Problem

"I've become a human clipboard, blindly shuttling errors to the AI and solutions back to code."
— 12-year experience engineer

"One of the most important properties of a junior developer is that you can turn them into a senior developer."

— Martin Fowler

The illusion of progress was about wasted effort. This is about what stays — and what it costs you later. GitClear analyzed 211 million lines of code. Copy/paste up 50%. Refactoring dropped from 25% to under 10%. Duplication up 8x. That's not velocity — that's borrowing against future maintenance. CodeRabbit looked at 470 AI co-authored PRs: 1.7x more issues, nearly 3x the security vulnerabilities, 8x the performance regressions. And here's what concerns me most. Anthropic's skill study: developers who passively delegate to AI score below 40% on comprehension of their own code. Active engagement gets you above 65%. A 25-point gap. That's Fowler's worry — if juniors skip the skill-building, the pipeline to senior breaks. But we're not here to dwell on what doesn't work. We already know that. The harder question is whether we can make any of this work. I'm not going to claim I've cracked it. But there are patterns that are helping me and other practitioners ship better software with these tools. Let's talk about what actually works.

Act II

The Patterns

Six practices from production teams that actually help you ship better software faster

TDDTypes as GuardrailsDecomposeContext EngineeringScope ControlAI Paperwork

The Patterns · Mental Model

How Agents Actually Think

Every agent — Claude Code, Cursor, Copilot Workspace — runs this loop:

🎯

Plan
LLM reasons about what to do next

↓

🔧

Tool Use
Read files, edit code, run commands, call MCP

↓

👁️

Observe
See results, errors, test output, state changes

↓

🔄

Reflect & Repeat
Adjust approach, loop until done or stuck

Why this matters

Agents are iterative systems, not one-shot generators. Failures happen in intermediate steps, not just final output.

Your leverage points

Every pattern in this act targets a specific part of this loop:
Plan → Decompose, Spec-first
Tool Use → Context Engineering, Scope Control
Observe → TDD, Types as Guardrails
Reflect → AI Paperwork

The key insight

You don't control outputs — you control constraints. Tests, types, context, task size. The agent does the rest.

Before we dive into the six patterns, here's the mental model that makes all of them click. Every agent — Claude Code, Cursor, Copilot — runs this loop: plan, use tools, observe results, reflect and repeat. We've always assumed this, but when Anthropic accidentally published the Claude Code system prompt earlier this year, it confirmed it — this is literally the loop in the code. It's an iterative system, not a one-shot code generator. This matters because your instinct is to focus on the output — the code it writes. But you don't control the output. What you control are the constraints it works within. The tests it has to pass. The types it has to satisfy. The context it can see. The size of the task. Make the space of acceptable code small enough, and the agent will find something correct inside it. Make it too large, and it will confidently produce something that looks right but isn't. Every pattern in this act is a different way of tightening that space. Each of the six patterns targets a specific part of this loop. Understanding the loop is what turns best practices into engineering judgment.

Pattern 1 of 6 · TDD

🐛 Sound familiar?

🪞

The Green Mirage
AI code compiles, looks right, but silently does the wrong thing

🚂

The Regression Train
Every new feature breaks something that worked yesterday

🔍

The Manual Tester
You spend more time verifying AI code than it took to generate it

💥

The Integration Surprise
Unit tests pass, but the system falls apart when pieces connect

😰

The Confidence Gap
Afraid to refactor because you don't know what might break

📋

Copy-Paste Blindness
Accepting AI output without verifying it does what was asked

These are all symptoms of the same root cause: there isn't a strong enough specification for the AI to code against.

So let's start tightening that space. For those of you using AI coding tools — hands up if you haven't seen any of these in the last month. [pause] Nobody? Good. If someone had raised their hand, I was going to ask you to come give this talk instead, because clearly you know something I don't. The Green Mirage — code that compiles, looks right, silently does the wrong thing. The Regression Train — every fix breaks something else. Copy-Paste Blindness — you've stopped reading what the AI gives you. We've all been there. These feel like AI problems. But look at the common thread. [click] They're all symptoms of the same root cause: there isn't a strong enough specification for the AI to code against. No machine-verifiable definition of what "correct" means. That's what the first pattern solves.

Pattern 1 of 6 · TDD · Your Behavioral Spec

A Protocol, Not a Testing Technique

❌ Test After

Write code, then ask AI to test it. The AI writes tests that confirm your code does what it already does.

"AI coding assistants will look at your implementation and write tests that confirm your code does what it already does. This is the equivalent of a student writing their own exam after seeing the answers."

— Ben Houston · 50%+ of AI-generated tests mirror implementation

✅ Test First

Write test scenarios first. AI implements until they pass. Your domain judgment defines correctness; AI does the labor.

"TDD is not a testing technique in this context — it's a protocol for working with AI. Tests provide a machine-verifiable definition of success."

— "The strongest form of prompt engineering" · ThoughtWorks

Tests aren't documentation. They're the communication channel between you and the AI about what correct means.

"Do NOT modify test files." — the one rule that makes all of this work

Pattern 1 of 6 · TDD · The Protocol

TDD in the AI Age

YOU describe one concept
Natural language, outside-in — from the user's perspective

↓

Refine with AI
"What corner cases am I missing?" — use your judgment on which matter

↓

AI formalizes + implements
Scenarios → executable tests → code. "Do NOT modify test files."

↓

Run, review, refactor, commit
Safety net in place. Be bold. Repeat.

One concept at a time. Each concept may yield more than one test — but never more than a handful. You stay at the intent level; AI handles the formalization.

Kent Beck's 3 Warning Signs

Loops: AI repeatedly tries the same failing approach
Unrequested functionality: adds features you didn't ask for
Cheating: modifies or deletes tests to make them "pass"
When any appear, stop and redirect immediately.

When AI Struggles: The Fallback

Delete the method body. Write step-by-step comments describing the algorithm. Let AI regenerate from your skeleton. Your structure, its code.

"Higher quality inputs allow for the capability of LLMs to be better leveraged. TDD maintains a high level of code quality. This high quality input leads to better Copilot performance than is otherwise possible."

— Paul Sobocinski, martinfowler.com

Here's how the protocol works. Step 1: you describe what the system should do from the user's perspective, in natural language. Not test code — intent, inputs, expected outcomes. That's the step that requires YOUR domain judgment. Step 2: AI as thinking partner — "what corner cases am I missing?" Use your judgment on which matter. Step 3: AI formalizes your scenarios into executable tests and writes the code to pass them. Step 4: run, review, refactor, commit. The three warning signs on the right are your early-warning system — Beck caught the AI deleting tests in his own B+ Tree project. When any appear, stop and redirect. The fallback comes from Fowler's team — when AI struggles, delete the method body, write step-by-step comments, let it regenerate from your skeleton. Your structure, its code. And the Sobocinski quote at the bottom captures it: TDD doesn't just constrain the AI, it gives it better inputs. Higher quality in, higher quality out.

Pattern 1 of 6 · TDD · Deep Dive

Defense in Depth for AI Code

Unit Tests
Does it behave correctly? — coverage floor enforced

Static Analysis & Security
Does it introduce vulnerabilities or code smells? — compile-time checks + dependency analysis

Integration Tests
Does it work with real infrastructure? — no mocks hiding broken queries

Module & Architecture Tests
Does it respect boundaries and structural rules? — dependency direction, layers, no casual cross-module imports

Contract & API Documentation Tests
Does the API match what consumers expect? — contract drift caught at build time

E2E / BDD Tests
Does the user journey work end-to-end? — acceptance criteria in plain language

Code Coverage
Do we have enough tests? — enforces the quantity of testing

Mutation Tests
Are the tests actually testing anything? — enforces the quality of testing

The AI writes the code. You define what correct means. These layers are how you make that definition stick.

Unit tests — layer 1. Necessary but nowhere near sufficient. AI-generated code has a specific failure mode: it compiles, passes unit tests, and systematically drifts in ways unit tests alone will never catch. Architecture violations, contract mismatches, security vulnerabilities — all invisible to unit tests. [click] Five more verification layers — each catching a different class of failure. Static analysis. Integration tests against real infrastructure. Architecture tests that enforce boundaries. Contract tests — we'll come back to these in Pattern 2. End-to-end with BDD. [click] Two meta-checks. Code coverage enforces quantity — do we have enough? Mutation testing enforces quality — are the tests actually catching bugs? PIT mutation testing caught gaps that 90% line coverage missed entirely. [click] YOU define what correct means. The AI generates code. These eight layers verify it. This isn't exhaustive — it's a starting point, not a ceiling.

Pattern 2 of 6 · Types as Guardrails

🧩 Sound familiar?

🔀

The Shape Shifter
AI returns data in a different structure than your API contract specifies

🕳️

The Any Escape
AI uses any, Object, or Map<String,Object> to sidestep type safety

🏷️

The Silent Rename
AI renames fields or methods to "improve" naming, breaking callers silently

📝

The Contract Breaker
AI modifies your interfaces to make its implementation easier

🧵

The Stringly Typed
Strings where there should be enums, constants, or typed values

❓

The Optional Avalanche
AI makes everything nullable to avoid compilation errors

These are all symptoms of the same root cause: there's no structural contract for the AI to code against.

Pattern 1 was about behavioral correctness — tests define what the code should do. But there's a second kind of specification the AI needs: structural. What shape should the data have? What contracts must be honored? What types are acceptable? [click] Same question — how many of these have you seen? The Shape Shifter — AI returns data in a completely different structure than your API contract. The Any Escape — I once reviewed a PR where the AI had typed everything as Object. Every. Single. Field. It was like watching someone solve a jigsaw puzzle by sanding off all the edges. The Silent Rename — it renames your fields to "improve" naming, breaking every caller. These aren't edge cases. They happen every day. [click] Same root cause as Pattern 1 — but from the other side. No structural contract for the AI to code against. Pattern 1 gave you the behavioral spec. Pattern 2 gives you the structural spec. Together they form a double safety net.

Pattern 2 of 6 · Types as Guardrails · Your Structural Spec

Types & Schemas as Guardrails

❌ Old economics

Ceremony = high cost

Every interface, annotation, and schema was time not shipping features

✅ New economics

Ceremony = near free

AI writes the boilerplate. Compiler enforces it. Errors eliminated at compile time.

The principle: The more you can make correctness checkable by machines, the less you depend on human review catching errors.

"Do NOT modify test files." Pattern 1 · "Do NOT modify any interfaces." Pattern 2

Same discipline. Behavioral spec + structural spec = double safety net.

For years, we chose dynamically-typed languages partly to reduce ceremony. But the economics have flipped. When AI writes the code, ceremony costs you nothing — AI generates the interfaces, the annotations, the boilerplate. What you get in return is errors caught at compile time rather than in production. If you have the choice — prefer stronger types. Not because Java is better than Python. But because a compiler-enforced contract is a constraint the AI cannot violate, while a convention is a suggestion it can ignore. [click] Within your language, push toward the strongest guarantees available. For API boundaries, use contract testing — remember layer 5 from defense in depth? This is where those tests come from. [click] Combined with TDD, you now have the double safety net. Behavioral tests define what the code should do. Structural contracts define what it should look like. And just like with tests — "Do NOT modify any interfaces."

Pattern 3 of 6 · Decompose

💣 Sound familiar?

⬆️

The Snowball
Diffs keep growing each iteration — complexity is compounding, not converging

🧭

The Wanderer
AI touches files you didn't mention — boundaries are unclear

🎠

The Carousel
Fix A breaks B, fix B breaks A — too many interacting concerns

🌫️

The Fog
You've lost track of what's changed — if you can't, the AI can't either

🐟

The Goldfish
AI asks the same question twice — context window is overloaded

🔨

Whack-a-Mole
Every fix creates a new bug — the task is too interconnected to hold at once

📦

The Packrat
AI keeps adding dependencies — it's reaching for shortcuts instead of solutions

🚿

The Firehose
Walls of code instead of focused changes — scope has ballooned

✅

Rubber Stamping
You're approving changes you haven't fully read — you've become a passenger

💾

No Save Point
30+ minutes without a commit — no safe place to roll back to

These are all symptoms of the same root cause: the task is too big for the AI to hold in context.

Pattern 3 of 6 · Decompose

Decompose Ruthlessly

"AI agents have a limited context window. The more you can focus them, the better they perform. Decomposition is the single most important skill."

— Addy Osmani, Chrome Engineering Lead · Beyond Vibe Coding, O'Reilly 2025

Spec–Plan–Execute
Spec → plan → one step at a time

Vertical Slicing
End-to-end slices, not layers

Interface-First
Define contract, then implement

One Concern at a Time
One logical change per prompt

Commit-Sized Units
Two commit messages? Split it

Fresh Context per Task
Resume from plan, not history

Structured decomposition yields 58% faster completion on complex tasks.

The task is too big — that's what the previous slide told us. Osmani explains it simply: AI agents have a limited context window. The more you focus them, the better they perform. Six ways to decompose. Spec–Plan–Execute is the highest-leverage — spec first, plan the steps, execute one at a time. Vertical Slicing — end-to-end slices, not layers. Interface-First — Pattern 2 applied to decomposition. One Concern at a Time — if you're asking the AI to do two things, you're asking for trouble. Commit-Sized Units — if you need two commit messages, it was too big. Fresh Context per Task — resume from the plan, not from a polluted conversation. [click] The payoff: structured decomposition yields 58% faster completion on complex tasks. Not because the AI is smarter — because it can hold the whole problem in context.

Pattern 3 of 6 · Decompose · Deep Dive

Decomposition in Practice

Greenfield: Create Context Documents

requirements.md
Functionality → bounded contexts

↓

architecture.md
Modules, boundaries, data flow

↓

technologies.md
Stack, frameworks, constraints

↓

quality.md
Testing strategy, coding standards

Brownfield: Ask First, Then Change

The same four concerns exist — but choices have already been made. Ask questions to clarify requirements, architecture, technologies, and quality gates before touching code.

Dependency Guardrail

Dependencies require double confirmation. No new library without explicit approval. AI agents love adding packages.

Wire It All Up

Link all context documents from CLAUDE.md or AGENTS.md — always in the agent's context. No copy-pasting into prompts.

Decomposition isn't just task sizing — it's giving the AI the right context at every level, from requirements down to quality gates.

"If you can't describe it in one paragraph, split it."

Pattern 4 of 6 · Context Engineering

🧠 Sound familiar?

👻

The Hallucinator
AI invents APIs, methods, or classes that don't exist in your codebase

🧳

The Tourist
Code works but doesn't match the project's conventions or style

🔄

The Reinventor
AI creates utilities that already exist because it can't see them

⏳

The Time Traveler
AI uses deprecated patterns or outdated APIs from stale context

🍳

The Kitchen Sink
Everything dumped into the prompt — AI can't find the signal in the noise

🩸

The Bleed
Context from a previous task contaminates the current one

🔁

The Déjà Vu
AI keeps re-explaining or re-doing work it already completed

🚪

The Wrong Room
AI uses patterns from a different framework or language

These are all symptoms of the same root cause: the AI doesn't have the right context — or has too much of the wrong context.

Pattern 4 of 6 · Context Engineering

Context Engineering

"Most of the craft of getting good results from an LLM comes down to managing its context."

— Simon Willison, co-creator of Django

Context Degrades

The 65% Cliff

A model claiming 200K tokens becomes unreliable at ~130K (Chroma research). Performance drops suddenly, not gradually.

Dead Context

40% of a 200K window consumed by MCP server metadata before sending a single message. — Ryan Spletzer

Drew Breunig's 4 Failure Modes

Poisoning Distraction Confusion Clash

Curate Ruthlessly

✅

Include
Exact identifiers, task-relevant files, architecture constraints, CLAUDE.md

❌

Exclude
The entire repo, unrelated tasks, irrelevant MCP servers, "just in case" files

🔄

Session hygiene
Fresh session per task · /compact at ~50% · sub-agents for exploration

CLAUDE.md / AGENTS.md

Your highest-leverage config point. Under 200 lines. Tech stack, build commands, coding standards, "Do Not" rules.

More context ≠ better results. A model failed with 46 tools in context but succeeded with only 19 — same window size.

Pattern 4 of 6 · Context Engineering · Deep Dive

Context Engineering in Practice

Structure Your Context

Project CLAUDE.md
Committed to git · team-shared · under 200 lines

↓

Child directory rules
Path-scoped · loaded on demand · backend ≠ frontend

↓

CLAUDE.local.md
Gitignored · personal preferences · editor quirks

↓

Progressive disclosure
Link to docs, don't inline · agent_docs/ folder

Manage Your Sessions

The 50% Rule

Run /compact at ~50% context usage. Don't wait for the cliff. Commit before compacting — your save point.

Fresh Context per Task

Use /clear when switching tasks. Never mix unrelated work in one conversation. One concern at a time.

Sub-agents for Exploration

Delegate search and research to sub-agents. Keep the parent agent focused on the task. Don't pollute your main context.

Context engineering isn't a one-time setup — it's an ongoing discipline throughout every session.

"Include what's relevant. Exclude everything else."

Two sides: structure and discipline. On the left — context hierarchy. Project-level CLAUDE.md at the top — committed to git, shared by the team, under 200 lines. Child directory rules below — path-scoped, so backend rules only load when working on backend files. Personal preferences in CLAUDE.local.md — gitignored. And progressive disclosure: link to docs, don't inline. A 200-line CLAUDE.md that links works; a 500-line one that inlines everything doesn't. [click] On the right — session discipline. The 50% rule: compact at 50% context usage, not when you hit the cliff. Commit before you compact — that's your save point. Fresh context per task — never mix unrelated work. Sub-agents for exploration — they get their own context window and don't pollute yours. [click] This isn't set-and-forget. It's an ongoing discipline. You're either managing context or it's managing you.

Pattern 5 of 6 · Scope Control

🎯 Sound familiar?

🎁

The Improver
AI "fixes" three other things you didn't ask for while doing what you asked

🎆

The Surprise PR
Diff is 10x larger than expected — AI touched files it shouldn't have

🧹

The Helpful Refactor
AI restructures working code "for clarity" while implementing a feature

📦

The Dependency Creep
AI adds a library to solve a problem that didn't need one

🚀

The Architect Astronaut
AI builds abstractions and patterns for a one-off change

🏗️

The Silent Restructure
Code compiles and tests pass, but the system's architecture has shifted

These are all symptoms of the same root cause: the AI decided what to change instead of you.

Pattern 5 of 6 · Scope Control

Scope Control

Five techniques to keep you in the driver's seat.

🎯 Declare Your Plan

List files to modify — and files to NOT touch. Wait for approval.

📌 Anchor to Existing Code

Point at the specific file. Don't let AI generate from scratch.

🔍 Review Diffs, Not Outputs

Look at what changed, not just whether the result looks right.

🧪 Fresh Session Self-Review

New session critiques the diff. No loyalty to code it didn't write.

💾 Commit at Every Milestone

Git save points. When scope slips, git reset is your undo.

The AI will always try to do more than you asked. The discipline isn't saying yes — it's knowing when to say no.

"You decide what changes. Not the AI."

So how do you get the steering wheel back? Five techniques, in three phases — before code, after code, and a safety net. Before code: declare your plan. List the files you'll modify and — the part most people miss — list the files you will NOT touch. That's the boundary. Anchor to existing code — point the AI at the specific file, don't let it generate from scratch. After code: AI self-review in a fresh session. Why fresh? A long-running session has loyalty to its own decisions. A fresh session sees the code the way a human reviewer would. And review diffs, not just outputs. Look at what changed — that's where scope creep becomes visible. Safety net: commit after every verified edit. When scope slips despite everything — and it will — git reset is your undo. [click] You decide what changes. Not the AI.

Pattern 6 of 6 · AI Paperwork

📝 Sound familiar?

🔮

The Mystery Commit
"fix stuff" messages that tell you nothing about what changed or why

⛏️

The Archaeology Project
Understanding a decision requires digging through months of Slack and PRs

🧠

The Tribal Knowledge
Only one person knows why it was built this way — and they're on vacation

📄

The Blank PR
Pull requests with no description, no context, just a diff

👍

The Rubber Stamp Review
Reviewers approve without context because there's nothing to guide them

🌀

The Onboarding Maze
New team members take weeks to become productive because nothing is written down

These are all symptoms of the same root cause: the paperwork isn't getting done because humans find it tedious. AI doesn't.

Pattern 6 of 6 · AI Paperwork

AI for the Paperwork Nobody Does

AI removes all excuses for skipping documentation.

📋 Decision Records

Capture the why, not just the what. Draft an ADR in 30 seconds.

🔀 PR Descriptions

Structured summaries from diffs. The collaboration trail writes itself.

🏆 Golden Path Examples

Canonical implementations the AI can pattern-match. One snippet beats three paragraphs.

📝 Lessons-as-Rules

After every correction, AI writes a rule for itself. Failures become specs.

📖 Domain Glossary

Define your terms precisely. AI uses them loosely without this.

Documentation for AI is different from documentation for humans. Humans infer context. AI is literal. Make your rules of the road machine-readable.

"Write it down. The AI will use it every session."

Five kinds of documentation that change how AI works with your codebase. Decision Records: draft an ADR in 30 seconds. When the AI knows why you chose PostgreSQL over DynamoDB, it won't suggest DynamoDB-friendly patterns. PR Descriptions: AI generates structured summaries from diffs. Without context, reviewers rubber-stamp. With it, they focus on architecture and intent. Golden Path Examples: GitHub's analysis of 2,500 AGENTS.md files found one code snippet outperforms three paragraphs describing preferred style. Lessons-as-Rules: after every correction, have the AI write a rule. Over time this becomes a failure-driven specification tuned to your codebase. Domain Glossary: when "fulfillment" has a precise meaning, document it. Without this, the AI uses terms loosely and generates subtle semantic bugs. [click] The meta-principle: documentation for AI is different from documentation for humans. Humans infer context. AI is literal — explicit, concrete, example-rich.

Act III

Team Norms

What good AI-assisted development looks like at the team and organization level

Team Norms · The Bigger Picture

Where Does Time Actually Go?

Requirements

Architecture

Coding

AI here

▼

Review

Testing

Coordination

Deploy

AI tools accelerate one station. Everything else remains at human speed.

~11%

of the workweek spent coding

Software.com · 250K devs · editor telemetry

~11%

of developer time is writing code

Microsoft Research "Time Warp" · 2024

~16%

of time on application development

IDC · 2025 · "84% is non-coding"

If coding is 11–16% of delivery time, a 10× improvement in coding speed yields at most ~15% faster delivery. The other 85% is requirements, architecture, coordination, review, and waiting.

Why do individual patterns not translate to faster team delivery? Because of where time actually goes. Here's the delivery pipeline. [click] AI tools accelerate one station. Coding. Everything else remains at human speed. How much of the workweek is actually coding? [click] Software.com: 11%. 250K developers, editor telemetry. [click] Microsoft Research: 11%. 484 developers instrumented. [click] IDC: 16%. Three independent sources, same answer. In my experience across multiple team projects, coding was 10-20% of total effort. The numbers match. I stared at that number and then thought about how much of my week I spend in meetings about code instead of writing code. Suddenly 11% felt generous. [click] The arithmetic doesn't lie. 10x improvement in coding speed, at most 15% faster delivery. The other 85% is requirements, architecture, coordination, review, and waiting.

Team Norms · Framework

The Two-Sided Formula

Requirements

Architecture

Coding

Review

Testing

Coordination

Deploy

Pre-coding: settled

Post-coding: representable

Pre-coding is settled

Requirements clear · Architecture decided · Interfaces stable · Framework supports what you need

Post-coding is representable

Testing automatable · Quality gates in the dev loop · Reviews fast & aligned · Deployments automated · Observability clear

Faster delivery

Three outcome patterns

Accelerates debt

Pre ✗ · Post ✗ · Faster rework
Faros AI (10K+ devs): PRs +47%, bugs +9%, PR size +154%

Masks the bottleneck

Pre ✗ · Post ✓ · Faster prototypes, same delivery
Faster iterations feed the same slow decision cycle — more reps, not more progress

Amplifies strengths

Pre ✓ · Post ✓ · Faster delivery
DORA 2025: AI correlates with better delivery when foundational capabilities are strong

AI accelerates delivery when both sides hold. When either breaks, faster coding becomes faster rework.

Team Norms · The 85%

What Actually Moves Delivery

Requirements

Architecture

Coding

Review

Testing

Coordination

Deploy

Pre-coding: settled

Post-coding: representable

Fix the pre-coding side

Decide which experiments to ship

AI makes experimentation cheap. That makes deciding what to take forward the new bottleneck. Rapid prototypes need explicit go/no-go criteria.

Architectural alignment before code

One unresolved structural decision cascades into every downstream task. Settle boundaries, interfaces, and contracts before generating code.

Resist speculative coding

When coding is cheap, there's no pressure to wait for clarity. Sometimes the highest-leverage move is to not build yet.

Fix the post-coding side

Quality gates in the dev loop

Tests, linting, architecture checks that run locally before code leaves the developer's machine — not first discovered in CI 20 minutes later.

Faster code review cycles

AI generates code faster than teams can review it. Upfront style alignment and reviewer availability become critical-path items.

Automated deployment & observability

Faster code to production means faster impact — good or bad. Automated pipelines and clear observability close the feedback loop.

"AI doesn't fix a team. It amplifies what's already there." — DORA 2025. Engineering practices matter more than ever: they mitigate quality risks from generated code and handle the increased throughput.

Team Norms · Measurement

How Do You Know You're Getting Better?

Every speed metric paired with a quality metric. Unpaired speed metrics are dangerous.

❌ Stop Measuring

Lines of code — CEOs competing on AI code %
Coverage alone — 90% coverage, 34% mutation score
Story point velocity — 42% admit to inflating
AI acceptance rate — juniors accept more, not better

✅ Start Measuring

Cycle time by stage — where did the bottleneck move?
Mutation score — are tests catching real bugs?
Change failure rate — are we breaking prod more?
Comprehension check — can you explain what shipped?

The Goodhart Cascade
Copilot users 29% faster — but review time +47%

Review survival rate
% of AI code passing review unchanged — measures spec precision

Review queue depth
Growing queue = stockpiled inventory waiting to be reviewed

The Ratchet

📏 Baseline → 📈 Improve → 🔒 Never regress → 🔄 Rotate

Pick 2–3 metrics that expose your current bottleneck. Act on them. Then rotate to the next.

AI amplifies what's already there. How do you know what it's amplifying? Not with the metrics most teams track. Stop measuring lines of code — CEOs competing on AI code percentage as if more lines is the goal. AI makes it infinitely gameable. Stop trusting coverage alone — 94% coverage, 34% mutation score. That's the testing equivalent of filling in all the bubbles on a multiple-choice exam and calling it studying. Stop celebrating velocity — 42% of teams admit to inflating. Stop using AI acceptance rate — juniors accept more, not better. Start measuring what matters. Cycle time by stage — find where the bottleneck moved. Mutation score — are tests catching real bugs? Change failure rate — are we breaking prod more? Comprehension check — can you explain what shipped? [click] The Goodhart Cascade: Copilot users 29% faster, review time up 47%. Measuring one variable while the rest degrades. [click] The Ratchet: pick two to three metrics. Baseline. Improve. Never regress. Rotate. This isn't a dashboard — it's a discipline.

Team Norms · Governance

Governing AI in Your Team

📄 Rules Files

Your team's AI constitution

Not a Confluence page — a file that runs every session.

CLAUDE.md — Claude Code
.cursor/rules/*.mdc — Cursor
AGENTS.md — cross-tool standard

+ lessons.md

After every correction, the AI writes a rule for itself. Failures become specifications.

🔍 Review Pipeline

AI self-review: fresh session
Fresh context catches what the author can't

↓

Human: architecture first
Critical path, work backward

↓

Human: intent & risk
Roadmap, not variable names

AI PRs have 1.7× more major issues and 24% more incidents per PR — CodeRabbit · Cortex 2026

🛡️ Guardrails

Block destructive ops

Hooks that deny rm -rf, --force, terraform destroy. Deterministic — not prompt-based.

Gate security-critical paths

auth/**, payments/**, secrets/** — mandatory human review + automated SAST scan before merge.

Verify every dependency

No AI-suggested packages without lockfile check. Block new deps without explicit approval.

Rules set the standard. Review verifies it. Guardrails enforce it.

Three governance layers. Rules files: not a Confluence page — a file that loads into every AI session automatically. Your CLAUDE.md or AGENTS.md encodes architecture, quality guidelines, and conventions — shared context that every agent starts with. Review pipeline: step 1 is the most important — AI self-review in a fresh session. This is basically asking a different AI to review the first AI's homework. Turns out, they have no loyalty to each other. It's surprisingly effective. Then human review focuses on architecture and intent, not variable names. The data: AI PRs have 1.7x more major issues, incidents per PR up 24%. Guardrails: deterministic enforcement, not prompt-based. Block destructive operations. Gate security-critical paths. Verify dependencies — no new packages without explicit approval. [click] Rules set the standard. Review verifies it. Guardrails enforce it. The teams getting the most value aren't the ones with the least friction — they're the ones with the smartest friction.

Team Norms · The Meta-Pattern

Harness Engineering

Same model. Different harness.

42%

→

78%

What is the harness?

Rules file + automation hooks + custom commands + MCP servers + review pipeline
Committed to git. Reviewed. Updated after every correction.

See previous slide for your tool's equivalent.

The enforcement hierarchy

100% Hooks — deterministic enforcement
Auto-format, block destructive ops, inject context

100% settings.json — deterministic config
Permissions, deny lists, environment variables

~80% Rules file — advisory guidance
Conventions, standards, preferences

Treat your harness like production code. Commit it. Review it. Update it after every correction.

Takeaways · Immediately Actionable

Your Monday Morning List

12 habits. No new tools. No new process. Start this week.

Create your CLAUDE.md / rules file. Tech stack, architecture, standards, do-nots. Under 200 lines. Commit to git.

Adopt spec–plan–execute. Any task >15 min: spec with AI → plan → one step at a time.

Write tests first. Always include "Do NOT modify test files."

Commit at every milestone. Git commits as save points. Always revertable.

Fresh session per task. Never mix unrelated work in one conversation.

AI self-review before every PR. Fresh session catches what the author can't.

Declare your plan before coding. List files to modify — and files to NOT touch.

Zero trust for auth, payments, secrets. Human review + automated scanning. Always.

AI for docs nobody writes. ADRs, PR descriptions, golden path examples.

Build a team prompt library. Document what works. Shared skills and commands.

Review diffs, not just outputs. Check what changed, not just whether it looks right.

Use AI to learn, not just produce. Ask why. Read the code. Stay in the driver's seat.

Remember...?

−19%

Close the gap.

Six patterns. Twelve habits. One decision.

#ArcOfAI

Q&A

#ArcOfAI

Sources: METR (July 2025) · GitClear · DORA 2025 · Stack Overflow 2025
Faros AI · CodeRabbit · Martin Fowler · Kent Beck · Addy Osmani
Simon Willison · Steve Yegge · Andrej Karpathy · Anthropic Engineering

Appendix A · Reference Template

CLAUDE.md Starter Template

# Project: [Your Project Name]

## Tech Stack
- Runtime: Java 21 / Node 22 / Python 3.12
- Framework: Spring Boot 3.x / Next.js 15
- Test framework: JUnit 5 + Mockito / Jest + RTL

## Architecture (2–3 sentences)
[Describe system structure and key modules here]

## Commands
- Build:  mvn clean install  /  npm run build
- Test:   mvn test           /  npm test
- Lint:   ./gradlew spotlessCheck

## Coding Standards
- Immutable data structures preferred
- Explicit error types, not generic exceptions
- No magic strings — use enums or constants

## Do NOT
- Modify test files during implementation tasks
- Touch auth/** without adding a security-review comment
- Generate code without running tests first

## Commit Format
Conventional Commits: feat|fix|refactor|docs|test(scope): message

Appendix B · Addy Osmani

The 70% Problem

Non-engineers (and overconfident engineers) hit a wall where AI gets them 70% of the way surprisingly fast — but the final 30% requires actually understanding the system.

✅ The 70%

"This is incredible. Look how fast we shipped. AI coded the whole thing."

❌ The 30%

"Why is it randomly failing in production?" "Why does auth break after logout?" "Why is there a memory leak nobody can find?"

The final 30% requires engineer skills

Understanding actual system behavior · Debugging without clean stack traces · Maintaining consistency across a growing codebase · Architectural judgment under real constraints. AI doesn't eliminate these — it amplifies them in engineers who have them, and reveals their absence in those who don't.

Appendix C · Kent Beck

The B+ Tree Experiment

4 weeks · Production B+ Tree library · AI-assisted TDD · Pragmatic Engineer interview

His system prompt discipline

Always follow instructions in plan.md.

When I say 'go':
1. Find the next unmarked test
2. Implement ONLY enough code to pass it
3. Mark the test done in plan.md
4. Stop and report back

One test at a time. Explicit "stop and report." No surprises.

Warning signs he watched for

• Functionality you didn't ask for
• Tests being modified or deleted
• Explanations that justify cheating
• Confidence about things it can't verify

"I treat the AI like an unpredictable genie. It's powerful, but the quality of the wish determines whether you get what you actually want."

— Kent Beck

Arc of AI Conference · April 2026

AI-AssistedDevelopment

The Landscape · 2026

The Ecosystem

The Individual · Perception vs Reality

METR, July 2025

The Organization · The Results

Everyone Adopted. (Almost) Nobody Shipped Faster.

The Dark Side

The Dark Side · 2025 Incidents

When Agents Go Rogue

The Dark Side · Anti-patterns

The Anti-Pattern Catalog

🌊 Vague Multi-ask Prompts common

🔪 One-Shot Repo Surgery dangerous

🗃️ Context Stuffing performance

📋 Blind Copy-Paste tech debt

🤖 Over-trusting Agents catastrophic

The Dark Side · Insight

The Illusion of Progress

The Speculative Coding Trap

Real Example · A2A Project

The Vanished Gate

The Dark Side · Code Quality

The Technical Debt Accelerator

GitClear · 211M lines (2021–2024)

CodeRabbit · 470 AI-co-authored PRs

Anthropic Skill Study

The Human Clipboard Problem

The Patterns

The Patterns · Mental Model

How Agents Actually Think

Why this matters

Your leverage points

The key insight

Pattern 1 of 6 · TDD

🐛 Sound familiar?

Pattern 1 of 6 · TDD · Your Behavioral Spec

A Protocol, Not a Testing Technique

❌ Test After

✅ Test First

Pattern 1 of 6 · TDD · The Protocol

TDD in the AI Age

Kent Beck's 3 Warning Signs

When AI Struggles: The Fallback

Pattern 1 of 6 · TDD · Deep Dive

Defense in Depth for AI Code

Pattern 2 of 6 · Types as Guardrails

🧩 Sound familiar?

Pattern 2 of 6 · Types as Guardrails · Your Structural Spec

Types & Schemas as Guardrails

❌ Old economics

✅ New economics

Pattern 3 of 6 · Decompose

💣 Sound familiar?

Pattern 3 of 6 · Decompose

Decompose Ruthlessly

Pattern 3 of 6 · Decompose · Deep Dive

Decomposition in Practice

Dependency Guardrail

Wire It All Up

Pattern 4 of 6 · Context Engineering

🧠 Sound familiar?

Pattern 4 of 6 · Context Engineering

Context Engineering

The 65% Cliff

Dead Context

Drew Breunig's 4 Failure Modes

CLAUDE.md / AGENTS.md

Pattern 4 of 6 · Context Engineering · Deep Dive

Context Engineering in Practice

The 50% Rule

Fresh Context per Task

Sub-agents for Exploration

Pattern 5 of 6 · Scope Control

🎯 Sound familiar?

Pattern 5 of 6 · Scope Control

Scope Control

🎯 Declare Your Plan

📌 Anchor to Existing Code

AI-Assisted
Development