Workflow

The 70/30 AI human review loop, and why it only works when the unit is one prompt

The ratio is the easy part. AI does most of the writing, you do most of the reviewing. The hard part is keeping the human side at 30% of the wall clock instead of 90%. That only happens when each AI turn is small enough to glance at, which means the unit of review has to be one prompt, not one feature.

Matthew Diakonov, Written with AI

Published May 13, 20267 min read

Direct answer (verified 2026-05-13)

The 70/30 AI human review loop is a working ratio where AI handles roughly 70% of the work (drafting, generating, iterating on changes) and a human handles roughly 30% (reading the diff, accepting it, redirecting if needed). The split only holds when each AI unit is small enough for the 30% to keep pace, in practice that means committing per turn so the reviewer sees one prompt-sized diff at a time, not a wall of changes.

Closest formal framing of the gap behind it: Werner Vogels' "verification debt", surfaced in Sonar's January 2026 State of Code report (96% of devs do not fully trust AI code, only 48% always verify before commit). Karpathy's vibe coding writeups describe the opposite end of the same spectrum.

The math, in two numbers

Drafting, generating UI, iterating, fixing the typo it just made.

Human

Reading the diff, deciding to keep it, deciding to roll back, redirecting.

The split is intuitive. The trap is treating it as a discipline ("just review more carefully") instead of a structural property of the tool. Generation runs at model speed. Review runs at human speed. If the smallest reviewable unit the tool produces is a feature's worth of changes, the 30% silently expands into either rubber-stamping or doing the work over by hand.

One turn of the loop, traced

Here is what happens between you typing a prompt and the diff being available for review. The diagram is the actual sequence of events inside the product, with the source line that owns each step.

One agent turn → one commit → one review unit

Source: src/app/api/chat/route.ts line 1112-1120 (commit on turn end), src/core/e2b.ts line 1847-1900 (commitTurn), line 1943-1975 (undoTurn / redoTurn).

The unit of review is your prompt

When the agent finishes a turn, the chat route does this:

// src/app/api/chat/route.ts (line 1112-1117)
const msg = (typeof prompt === "string" ? prompt : "")
  .split("\n")[0]
  .trim()
  .slice(0, 120) || "Agent turn";
const sha = await commitTurn(sessionKey, msg);
if (sha) send({ type: "version", sha, message: msg });

That five-line block is the entire structural change. The first line of the user prompt becomes the commit message, sliced to 120 characters so noisy prompts do not blow up the log. commitTurn in src/core/e2b.ts wraps the actual git work:

// src/core/e2b.ts (line 1855-1863)
const script = [
  "set -e",
  "export HOME=${HOME:-/root}",
  "cd /app",
  "git add -A",
  "if git diff --cached --quiet; then echo NOCHANGE; exit 0; fi",
  `git commit -q -m '${safeMsg}'`,
  "git rev-parse HEAD",
].join(" && ");

Skip-on-noop matters more than it looks: a turn that produced no file changes does not pollute the history with empty commits, so the historyStack stays one-to-one with prompts that actually moved code. The new SHA gets pushed onto session.historyStack and the active index advances. That stack is the review queue.

The same five prompts, two different review surfaces

Five prompts have produced a sprawling working tree of changes. The diff is one giant blob; you cannot tell which prompt caused which line. To roll back a single bad change you either git reset to a vague checkpoint and lose the good edits, or you read the whole tree and patch by hand. Most people pick door three: trust the AI and ship.

One review unit = five prompts of mixed changes
No way to point at which prompt broke what
Roll back is destructive or manual

The agent does part of the human's job too

Smaller diffs are half of keeping the 30% honest. The other half is shrinking what counts as "needs human eyes" in the first place. The CLAUDE.md the agent reads when it boots inside the VM hardcodes a five-step browser check it must run before saying done:

# src/core/vm-claude-md.ts (around line 273-283)

## Browser Testing

After UI changes:
1. Navigate to http://localhost:5173 via Playwright MCP
2. Take a snapshot to verify the DOM rendered correctly
3. Check browser_console_messages for runtime errors
4. If the page is blank, verify the component is imported in App.tsx
5. Do not report completion until the browser shows the expected result

Step five is the load-bearing one. It moves the "did this actually work" pass from the human's 30% into the agent's 70%, so by the time you see the diff the obvious failures (blank screen, console error, missing import) are already filtered out. What remains is the question only you can answer: is this the change I wanted?

What this fixes, and what it does not

Honest limits, because nothing here is a free lunch:

120 character message cap. Long thoughtful prompts get truncated. For most sessions the first line is enough, but a multi-paragraph instruction loses the body in the log.
One prompt is the smallest unit, not the line. If one prompt asks for five distinct things and the agent does all of them, the diff bundles them. You can review per turn, but you cannot bisect inside the turn.
The browser self-check only catches visible failures. A schedule worker, an API endpoint with no caller yet, or a logic-only change all pass the five-step protocol vacuously. Those still owe the diff a real read.
The structural unit makes review cheap; it does not make it mandatory. If you skip the glance, you are back to vibe coding. The mechanism makes the right thing easy, not impossible to skip.

Frequently asked questions

What is the 70/30 AI human review loop, in one paragraph?

It is a working ratio where AI handles roughly 70% of the work (drafting code, generating UI, iterating on small changes) and a human handles roughly 30% (reading the diff, accepting it, redirecting if it went sideways). The split is a target, not a guarantee. It only actually holds when each AI unit of work is small enough that the human's 30% can keep pace. If the AI ships a five-file rewrite as one unit, the human cannot review it in a third of the time it took to generate, so the 30% silently dilates into 'I trust it' and the loop collapses into either rubber-stamping or rewriting from scratch.

Where does the 70/30 framing come from?

It is a community shorthand rather than a single canonical source. The two sides of it have been named separately: Werner Vogels has talked about 'verification debt', the gap between how much code AI writes and how much of it humans actually verify, which is the framing behind Sonar's January 2026 State of Code report (their survey put 96% of developers as not fully trusting AI code, but only 48% always verifying before commit). Andrej Karpathy described 'vibe coding' as the opposite end, the practice of accepting and shipping without reading. The 70/30 ratio is how teams talk about the in-between: AI does most, but a human still owns the merge.

Why does the 30% review side usually break first?

Because review is human-rate work and generation is model-rate work. The model can keep producing diffs faster than a person can read them, and the natural diff size most builders ship is a turn's worth of changes (often dozens of file edits). Review at that granularity is a long context-switch: open the diff, page through, build a mental model of what changed and why, decide if it is the right change. If you do that for every prompt, you spend more time reviewing than prompting, and the 70/30 inverts. The fix is structural, not motivational. Make each AI unit small enough that 'review one diff' is a 20 second task, not a 5 minute task.

How does mk0r make the unit small enough?

Each successful agent turn in mk0r writes exactly one git commit inside the E2B sandbox at /app. The chat route picks the message as the first line of the user's prompt, trimmed and sliced to 120 characters (src/app/api/chat/route.ts line 1112-1116). It calls commitTurn in src/core/e2b.ts (line 1847), which runs `git add -A`, skips the commit if the diff is empty, otherwise `git commit -q -m '<safeMsg>'` and returns the new SHA. The SHA gets pushed onto a per-session historyStack (line 1882-1891). Result: every prompt produces one diff with that prompt sitting on top of it as the message. The review unit is one prompt.

What does the review actually look like in practice?

Three options, all backed by the same per-turn commit. One, in the chat UI itself: each turn surfaces a version chip with the SHA and the prompt-as-message; you see what changed by glancing at the diff. Two, in a terminal inside the sandbox: `git log --oneline` reads like the chat history because the messages are the prompts; `git show <sha>` is the per-prompt diff. Three, undo and redo: `undoTurn` and `redoTurn` (src/core/e2b.ts line 1943-1975) walk the activeIndex backward or forward through historyStack and apply the target SHA's tree as a new revert commit, so a bad turn is one click to roll back. None of those flows requires reading more than one prompt's worth of changes at once.

Does the AI also do part of the human's 30%?

Yes, intentionally. The CLAUDE.md the agent gets when it boots inside the VM (src/core/vm-claude-md.ts around line 273-283) hardcodes a five-step browser check the agent must run after any UI change: navigate to http://localhost:5173, snapshot the DOM, check console for runtime errors, verify the component import if the page is blank, and 'do not report completion until the browser shows the expected result'. That last line is the load-bearing one. It moves the 'did this actually work' check from human review time into agent generation time, which shrinks the human's 30% to the parts that genuinely need a human (does this change match what I wanted, is this the right architecture).

Doesn't this mean the human can just rubber-stamp?

It can, and that is a real failure mode. The structural unit only changes the cost of review, it does not force review to happen. The honest version of the 70/30 loop is: small commits make it cheap enough that a glance is enough most turns, so the few turns that genuinely need careful reading get the attention. If you skip even the glance, you are back to vibe coding (90% AI, 10% human, ship and pray). The mechanism here makes the right thing easy; it does not make the wrong thing impossible. That is true of every dev workflow.

Where does this loop break for mk0r specifically?

Three places. First, the prompt-as-commit-message is sliced to 120 characters; long thoughtful prompts get truncated and the message becomes a worse summary than something the agent could write itself. Second, the per-turn unit is fine for review but coarse for blame: if one prompt produced five logically distinct edits, you cannot bisect inside the turn. Third, the in-VM browser check helps a lot for visible UI regressions but does nothing for logic that has no UI surface (a worker that runs on a schedule, an API endpoint with no caller yet). For those you still owe the diff a real read.

Is the 70/30 split a hard target or just a vibe?

Just a vibe, and that is fine. The actual ratio for any given session depends on how ambitious the prompts are, how good the model is, and how strict the reviewer is. The point of the framing is the shape, not the number: AI does most, human still owns the merge, and the loop only stays sustainable when each handoff is small. If your sessions land closer to 80/20 some days and 50/50 on debugging-heavy days, that is healthy. The thing to avoid is the implicit 100/0 where you stop reading the output entirely.

How is this different from a normal pull request review?

PR review batches a whole feature into one review pass and asks the reviewer to hold the entire mental model at once. The 70/30 turn-loop inverts that: each prompt is its own review unit, accepted or reverted in seconds, and the cumulative effect is a sequence of small approved deltas instead of one big approved bundle. PRs are still useful at the boundary between a vibe-coded prototype and a production codebase, but inside one builder session the small unit is a better fit for the cadence of AI generation.