Phase 3: Implementation

How a team delegates to agents, executes, and validates code work.

A developer and an agent can stand up a working feature in an afternoon. That's the easy part. The hard part is production software a team understands, trusts, and keeps building on. The Implementation Phase is how a team turns ready specs into exactly that, as fast as agents now allow.

Even when an agent writes most of the code, developers still own the result. They keep that ownership by overseeing the plan that goes in, the validation that comes out, and the final call on whether the work is safe to merge. Two habits make this possible. The developer shapes the work before handing it off, and the developer stays close to the product as the code comes together. Together those habits let the team keep ownership while agents do most of the typing.

Some implementation work is simple enough for a short brief: fix a bug, make a small refactor, or adjust existing behavior. The agent can work while the project's normal checks catch regressions: tests, review agents, and developer review. Other work needs shaping first. If the task depends on a product, design, data, or architecture decision the team has not already made, an agent should not be the one to make it. Skipping that step moves the decision into an agent session with less context than the team has. This chapter focuses on complex tasks, where the developer has to shape the work before handing it off.

What happens in this phase

Because writing code is no longer the bottleneck, a complex task is usually broader than a single development task would have been before agents. Each one moves the system toward an updated specification. The default workflow below gives that work a shape. It establishes the foundations early, creates clear points to merge along the way, and ends by explaining, validating, and reconciling what got built, so the team's understanding keeps pace with how fast the agents generate code.

The workflow below is a starting point, not a fixed procedure. Teams should adapt it as they learn. The core pattern holds regardless. Make the important decisions early, then scale Implementation against them.

Shaping → Building → Validating →

Shape

Establish foundations

Key design decisions: schemas, APIs, module boundaries

Solo

Critique

Pressure-test the design

Walk through trade-offs with a critique prompt or a pair before scaling up

SoloCollab

Build

Build the next bounded part

Implement against the plan; pause for human judgment

Solo

Harden

Check and commit

Run review agents, get tests green, address gaps, commit what is ready

Solo

Explain

Walk through what changed

Agent walks through the work; the pair builds the shared picture the stress-test will need

SoloCollab

Validate

Stress-test the whole

Exploratory testing, lateral review, NFR check, catch what earlier passes missed

SoloCollab

Reconcile and deliver

Reconcile with spec, back-port intentional deviations, capture follow-ups, merge

Solo

The workflow moves between the two ways of working with agents from the Overview. In agent delegation , the developer hands off well-defined work and then checks the result. In agent partnership , the developer works alongside the agent to inspect, critique, and pressure-test a direction. Shape, Build, and Harden run mostly on delegation. Critique, Explain, and Validate run on partnership.

The first two stages set up everything that follows. Shape begins from the effort, the execution brief written during Planning, where the approach, risks, and plan are already on the page. From that brief the developer drafts the foundational choices the rest of the work depends on, such as schemas, interfaces, and module boundaries. Critique then pressure-tests those choices with a pair or a review agent before the team scales Implementation against them.

Build and Harden run as a loop. The developer builds a bounded part, checks it, reads what failed, fixes the gaps, and stops at the next review point. Over time, teams can automate more of that loop so agents keep going without a developer typing every prompt. That only works when the agent has reliable tests and review checks to read, limits that prevent it from spinning forever, and a clear handoff when a human decision is needed.

One common variation changes how much Harden carries. Some teams ship each finished chunk as a stacked pull request, a chain of small PRs where each one depends on the last. In that mode Harden does more. The developer still leans on agents for tests, review, cleanup, and commit prep, but the developer also shapes each checkpoint into a reviewable increment, one with clear scope, passing checks, a useful description, and an explicit note about what later PRs will depend on.

Once building wraps, the work shifts back to partnership. Explain comes first. The agent walks through what changed, including its design choices, where it deviated from the plan, and any side effects it noticed, so the developer holds a clear picture before stress-testing begins. Validate comes next and exercises the feature as a whole. The developer runs exploratory testing across the full surface, thinks through what building in pieces might have missed, and checks the result against the non-functional requirements and testing strategy set during Planning. Anything Validate turns up reopens Build rather than landing on a follow-up list. The start and end of a complex task each demand more human attention than the middle does.

A second developer earns a place at Critique, Explain, and Validate. Pull one in when a decision will shape everything after it, such as a schema or API boundary, security-sensitive logic, an unfamiliar integration, a silent failure mode, or any work the lead developer can't yet explain clearly. Build and Harden stay one-developer work, because the agent context shifts too fast there to share.

The developer as scheduler

Once the developer starts delegating, the developer is tracking several things at once. What the feature should become, what the agent is doing right now, what CI and review are reporting, and which decisions still need a human. Carrying that load, the developer has to guard enough focus to keep making good calls. In practice that means keeping one priority path easy to reload, clearing the next blocker as it appears, and taking on side work only when it won't make the main task hard to get back into. Use worktrees or secondary checkouts when running work in parallel, and keep work-in-progress low.

As a rule, keep one complex task as the priority path you carry to completion. Pulling in a second one can make sense when the first has long agent runs that leave you waiting. Simpler tasks usually make better company, because they produce progress without competing for much attention. Finishing, committing, and merging already-reviewed work counts as priority work too, so don't let it sit while you start something new. Priorities also move as friction, follow-ups, or newly discovered work surface mid-execution.

Closing a complex task

Close reconciles the finished work with the spec, back-ports any intentional deviations, captures follow-ups, and merges. A complex task is done when all of the following hold.

The spec and Implementation agree.
The task is validated with automated tests. Where tests alone aren't enough, the developer has verified the behavior by running the software.
Important findings are fixed or captured as follow-up work.
The developer can explain the design, the checks, the risks, and the merge and release judgment.

Example: Building the Invite Members effort

The Invite Members effort arrives from Planning with its plan, its checkpoints, and one risk already flagged. Two admins might invite into the last open seat at the same moment. The plan says what to build. It doesn't settle how the underlying structures should be shaped. One developer directs agents through that shaping, holding onto the decisions that would be expensive to undo if an agent guessed wrong.

Running example · Invite Members effort

Shape (delegation). The developer starts by loading context the plan and specs don't carry, namely the architecture the developer is leaning toward and the non-functional requirements the design has to survive. The seat count has to stay correct as people are invited, promoted, and removed. The limit has to hold even when two admins act at the same moment. Email has to stay behind the sender interface, so real delivery can drop in later without touching the invite code. With those constraints stated, the developer has an agent lay out options for the core structures, covering how to represent invites, seats, and memberships, and how to track the seat count against the limit. The agent weighs the trade-offs and produces a first pass, an initial data model with the scaffolding around it. This is delegated work. A rich brief goes in, a first cut comes out, and nobody expects it to be right yet.

Critique (partnership). A delegated first pass rarely hits the goal on the first try, and Critique is where the developer and the agent work it out together. The developer runs the agent hard against its own design, asking where the design misses the requirements from Shape and where it breaks under load. The agent's first data model tracked the seat count as a stored number that each new grant increments. Pushed to stress that choice, the agent finds the hole itself. The stored number drifts when two grants land at once, so at the limit two admins could each invite into the last seat and both succeed, pushing the organization past its paid count. Seeing that concrete failure is what prompts the developer to act. The developer makes a fix the spec never spelled out. Instead of storing the count, the system now derives it from the records, counting active editors plus pending editor invites, and each grant becomes a single all-or-nothing transaction, so the second simultaneous attempt fails cleanly. The data model changes here, before any UI is built on top of it.

Build and harden (delegation, looped). With the data model corrected, the developer delegates the build one chunk at a time against the pretend sender, starting with the invite dialog. The agent works from the Figma design and writes its tests first, so every build is checked against those tests from the start. Hardening asks for more than green tests, though. The agent renders the dialog, compares it against the design, and iterates on the visual gaps it spots. Where the agent misses one, the developer points at the specific flaw until it's fixed, and each fix becomes a new regression test so the flaw can't return. Once the dialog holds up against both the design and the tests, the developer commits it and moves through seat reservation, the admin-only checks, and the invite table's Pending row the same way.

Explain (partnership). Before stress-testing, the developer has the agent evaluate the finished change cold, with fresh context, and walk through it. Together they review the architecture the agent settled on, how the pieces fit, and the Implementation details that matter most. The developer asks questions wherever the picture is thin and pushes past what changed to why it changed, such as why the seat count is derived rather than stored and why each grant is a single transaction. The goal is a developer who understands the design well enough to stand behind it.

Validate (partnership). Validate is Critique again, this time aimed at the whole delivered feature. The developer has the agent check the change for what tends to drift as a build grows, including the team's style guide, the architecture rules, and the non-functional requirements the feature was meant to hold. The two of them brainstorm where it could still go wrong. Then the developer uses the feature the way a real admin would, inviting a viewer and watching no seat move, inviting an editor and watching one disappear, hitting the limit, resending, revoking. The developer is feeling for whether the feature hangs together and makes sense to the person who would actually use it. Whatever that surfaces - a confusing empty state, or a seat that doesn't come back on revoke until the page reloads - feeds one more short build-and-validate pass to tighten the feature before merge.

Output. A merged increment whose seat count can't drift and whose limit holds under the race condition, fully tested and reviewed, handed over by a developer who can explain the design and the risks it survived rather than just point at passing checks.