Argument

The review skill bottleneck of vibe coding

The audience for vibe coding is mostly non-coders. The audience for code review is mostly coders. Those are different people. That is the quiet bottleneck nobody names, and it does not go away by working faster.

Matthew Diakonov, Written with AI

Published May 14, 20269 min read

Direct answer, verified 2026-05-14

The review skill bottleneck is the gap between who vibe coding attracts (non-coders, indie builders, product people) and what shipping code actually rewards (somebody who can read the diff and notice what changed). It is a skill mismatch, not a speed mismatch, and it sits on the wrong side of the workflow. You shrink it not by asking people to learn to read code faster, but by changing what the review surface is. In mk0r the review surface is the one-line prompt you typed plus the live preview, not the diff under it.

The premise of vibe coding cuts against its own review step

Vibe coding, as Andrej Karpathy named it in early 2025, is the practice of describing what you want in plain English and accepting the output without reading much of it. The pitch is openly aimed at people who do not write code: kids, founders without a technical co-founder, product managers between roles, anyone with an idea and no willingness to learn TypeScript first. That is the user. That is the person the product surface is designed to delight.

Now look at the standard advice for shipping AI-generated code. Read the diff. Run the tests. Audit before you commit. Werner Vogels has been writing all year about verification debt. The Sonar 2026 State of Code survey put a number on it. Every responsible post on the topic ends with some version of: trust but verify, and verify means read.

Both halves are coherent on their own. Together they describe a workflow whose generation step and review step are aimed at different people. The generation step works for anybody with a sentence. The review step works for somebody who can read a unified diff. If the person at the keyboard is the first kind, the review step quietly becomes a no-op.

96% / 48%

“96 percent of developers do not fully trust that AI-generated code is functionally correct, but only 48 percent always verify their AI-assisted code before committing it.”

Sonar, State of Code Developer Survey 2026, surveyed 1,100+ developers

The number is striking because it is for developers, people who can read diffs. For non-coders running a vibe coding tool, the always-verify rate is closer to zero by construction. Source: sonarsource.com press release, January 8 2026.

Faster does not fix it

The default reflex is to make review cheaper: better diff UI, smaller changes, inline annotations, AI-generated review comments on AI-generated code. All of those help if the reviewer can read. If the reviewer cannot read, they help nothing. A non-coder looking at an annotated diff is a non-coder looking at an annotated diff. The mismatch survives every speed improvement, because the bottleneck is not how long review takes. It is what review is.

That is why this is a skill bottleneck, not a workflow bottleneck. You cannot scale around it from the generation side. You have to move the review surface to something the reviewer can actually parse.

The move: review what the user already wrote

The thing a non-coder can read is the sentence they just typed. They wrote it. They understand it. They can compare it against the screen in front of them. That comparison is a real review unit. It catches a large fraction of bad turns: the AI did the wrong thing, the AI did part of the right thing, the AI did nothing visible at all.

In mk0r the implementation of this is mechanical, not aspirational. Every successful agent turn ends with one git commit inside the sandbox. The commit message is the first line of your prompt, trimmed and sliced to 120 characters. Here is the relevant code, lifted verbatim from src/app/api/chat/route.ts:

const msg = (typeof prompt === "string" ? prompt : "")
  .split("\n")[0]
  .trim()
  .slice(0, 120) || "Agent turn";
const sha = await commitTurn(sessionKey, msg);
if (sha) send({ type: "version", sha, message: msg });

That message: msg is what later renders in the Version History panel. Each prior turn shows up as a small card, and the card's primary content is the prompt itself (src/components/version-history.tsx line 119):

<div className="text-sm text-zinc-700 truncate">
  {e.message || "(no message)"}
</div>

The Revert button right next to it calls /api/chat/revert, which checks out the target tree as a new commit (revertToSha at src/core/e2b.ts line 1951). The reviewer never reads code. They read their own English, look at the preview, hit Revert if the two do not agree. That is the whole loop.

The agent picks up the part the reviewer can't

There is a class of review work that even a careful non-coder cannot do from the preview alone: did the page actually load, did the console throw, did the imports resolve. That work is mechanical and legible to the agent. So it lives there.

The agent boots inside the VM with a CLAUDE.md that hardcodes a five-step browser check it must complete before reporting any UI change as done (src/core/vm-claude-md.ts lines 273-283):

1. Navigate to http://localhost:5173 via Playwright MCP.
2. Take a snapshot to verify the DOM rendered correctly.
3. Check browser_console_messages for runtime errors.
4. If the page is blank, verify the component is imported in App.tsx.
5. Do not report completion until the browser shows the expected result.

Line five is the one that matters. It moves the cheap, mechanical part of review (did it render, did it throw) out of human review time and into agent generation time. What is left for the human is the part that always needed a human anyway: does this match what I actually wanted. That part the non-coder can do.

What this honestly does not solve

The structural workaround buys you the prototype layer. It does not buy you the production layer. Three places where it goes thin, in decreasing order of how often they bite:

Off-screen logic. A scheduled worker, an API endpoint with no caller yet, a piece of state that never renders. The browser check sees a green page; the broken behavior lives elsewhere. There is no escape from reading that diff if you want to be sure.
Coarse turns. The unit is one turn, not one change. If a single prompt produced five logically distinct edits, you cannot bisect inside the turn. You can revert the whole thing or accept the whole thing.
Quietly bad data. The screen renders. The numbers underneath are wrong. The form validates and stores the wrong field. No amount of preview-watching catches this. Eventually you owe somebody who can read code a real pass.

That is the boundary. Inside it, a non-coder gets a real review loop for the first time. Outside it, the original bottleneck is still waiting.

One thing to take away

When the next vibe coding tool you try shows you a session-wide diff and asks if it looks good, notice that you have just been handed a review unit you cannot read. The fix is not to learn to read it. The fix is to insist on a review surface that maps to what you actually know: the sentence you wrote, the screen in front of you, a one-click way back if those two disagree. Anything past that line is real engineering, and it still costs real review skill. Pretending otherwise is how prototypes ship with credentials in localStorage.

What did you build?

mk0r is open at mk0r.com. No account, no signup, no review skill required to start.

Want to walk through this on a call?

Bring an app idea, see the per-turn review loop in action, ask whatever you want about how the sandbox is wired.

Frequently asked questions

What is the review skill bottleneck in vibe coding, in one paragraph?

It is the gap between who vibe coding attracts and what shipping code rewards. The audience is mostly non-coders, indie builders, and product people who want to skip writing JavaScript. Shipping code, in the traditional sense, rewards somebody who can read a diff and notice that the AI quietly stored API keys in localStorage, swallowed an error, or rewrote half the file when the prompt asked for one bug fix. If you cannot read code, you cannot do that review. So the audience that benefits most from generation is the one least equipped to review it. The bottleneck is not speed; it is skill, and it sits on the wrong side of the workflow.

How is this different from the 70/30 review balance or Sonar's verification gap?

Those frames are about volume. The 70/30 split (AI does 70 percent of the writing, human does 30 percent of the reviewing) assumes the human is capable of doing the 30 percent. Sonar's January 2026 State of Code Developer Survey, which produced the often-cited 96 percent and 48 percent numbers, surveyed more than 1,100 working developers, people who do read code. The review skill bottleneck is the prior step: what if the person on the receiving end of the AI output is not a developer at all? Most consumer-grade vibe coding flows quietly fail at this step and call it a feature.

Why does the existing advice not solve this?

Most advice on AI-coding review boils down to: read the diff, run the tests, audit before you ship. That advice is correct and useless if you cannot read the diff. You can hand a non-coder a 200-line unified diff and they will scroll through it the way I scroll through a Stripe receipt. The fix is not to lecture them. The fix is to give them a review surface that maps to something they actually understand. For most non-coders that something is the live screen and the sentence they typed five seconds ago.

How does mk0r change what the reviewer is looking at?

Every successful agent turn writes one git commit inside the sandbox. The commit message is the first line of your prompt, trimmed and sliced to 120 characters (src/app/api/chat/route.ts line 1112-1116, calling commitTurn at src/core/e2b.ts line 1895). The Version History UI then renders one card per turn, and the card's primary content is that prompt (src/components/version-history.tsx line 119, `<div className="text-sm text-zinc-700 truncate">{e.message || "(no message)"}</div>`). The reviewer reads their own English, not the diff. If the screen does not match the sentence, they hit Revert. The review skill required is reading your own prompt, which everybody has.

Where does the agent do part of the human's job?

Inside the VM, the CLAUDE.md the agent boots with (src/core/vm-claude-md.ts line 273-283) tells it to do a five-step browser check after any UI change: navigate to http://localhost:5173 via Playwright MCP, take a snapshot to verify the DOM rendered, check the browser console for runtime errors, verify the component import if the page is blank, and do not report completion until the browser shows the expected result. That last line is the load-bearing one. It moves the cheap mechanical part of review (did it render at all, did it throw) from human time into the agent's own turn, so the part left for the human is the only part a non-coder can do anyway: does the screen match what I asked for.

Doesn't this just hide bugs behind a clean screen?

Yes, for a class of bugs. A UI can render fine while the data is wrong, the storage is insecure, or the math is silently rounding the wrong way. If you ship a serious app this way, you will eventually need real code review, the kind that requires reading code. The structural workaround here only buys you the prototype layer honestly: you get a working preview, a per-prompt rollback, and an agent that has at least run the page once. It does not promise production correctness. That is the next layer up, and it still costs review skill.

What about long prompts that get truncated to 120 characters?

They become worse commit messages than something the agent could have written. The slice at src/app/api/chat/route.ts line 1115 (`.slice(0, 120)`) optimizes for the Version History card fitting on a phone screen, not for archive readability. If you write a 600-character paragraph, the card shows the first sentence and forgets the rest. The trade-off is real. The honest framing: the message is good enough to find a turn again from the history, not to reconstruct what you wanted three weeks later.

Can a non-coder actually catch a bad turn this way?

For visible regressions, yes, and quickly. The Revert button on each Version History card calls revertToSha (src/core/e2b.ts line 1951), which checks out the target tree as a new commit, so a bad turn is one click to roll back. The reviewer's loop is: read the prompt, look at the preview, hit Revert if it went sideways. For invisible regressions (something the new turn broke off-screen, a future feature that quietly stopped working) you still need real review skill or a real test. Honest version of the story: the bottleneck is shrunk for the visible 80 percent and unchanged for the invisible 20.

Is this just generic 'AI does its own QA' wrapping?

No, the load-bearing piece is the unit. Most builders show you a final preview after a multi-turn session and call that the review surface. Here the review surface is per-turn: one card, one prompt, one diff under the hood. The CLAUDE.md browser check is supporting infrastructure, not the headline. The headline is that the review label is the sentence you wrote, so somebody who cannot read JavaScript still has a real review unit.

Where does this break for mk0r specifically?

Three places worth naming. First, the in-VM browser check helps for UI changes but does nothing for backend logic with no UI yet (a worker that runs on a schedule, an endpoint with no caller). Second, the prompt-as-message gets coarse when one prompt produces five logically distinct edits; you cannot bisect inside the turn. Third, anything below the visible surface (data integrity, auth, secret storage) still demands code review, the real kind. The structural workaround buys you the prototype loop, not the production loop.