Guide

The vibe coding verification gap, and one concrete way to close it

Sonar put a number on the gap in January 2026: 96% of developers do not fully trust AI-generated code, but only 48% always verify it before commit. Everybody quotes the survey. Almost nobody points at a concrete mechanism that closes the gap for one specific layer. Here is the one mk0r ships.

Matthew Diakonov, Written with AI

Published May 11, 20266 min

Direct answer (verified 2026-05-11)

The vibe coding verification gap is the distance between how much code AI is generating and how much of it humans are actually checking. The phrase comes from Sonar's 2026 State of Code Developer Survey, published January 8, 2026, polling more than 1,100 developers globally.

The headline numbers: 96% of developers do not fully trust that AI-generated code is functionally correct, but only 48% state they always check their AI-assisted code before committing it. Werner Vogels calls the same phenomenon "verification debt." That is the gap.

Source: sonarsource.com press release, 2026-01-08.

“AI has created a critical trust gap between output and deployment.”

Tariq Shaukat

CEO, Sonar (announcing the 2026 State of Code Developer Survey, January 8, 2026)

Why this gap is hard to close from outside the loop

Most advice for the verification gap lives outside the generation loop. Run a linter. Add a SAST pass. Stand up a security review before merge. All of these help, but they share the same shape: a reviewer or a checker arrives after the model has finished, with only the final artifact in hand. By that point the model has lost the context that produced the change, and the diff has bundled up two or three intents into one blob. The cheap moment to verify, the moment the model still knows what it just tried to do, has already passed.

The other shape that helps is making the model itself the first verifier. Not in a vague "think step by step" sense, but in the literal sense: give it a tool that runs the code, a checklist to follow, and a rule that says it cannot declare the turn done until the checklist passes. The cheapest possible reviewer is the one already holding the prompt.

The five lines that try to close the gap for one product

In mk0r, the agent runs inside an E2B sandbox with a real Chromium on Chrome DevTools Protocol port 9222, wired to Playwright MCP. The rule that turns that capability into a habit is hardcoded into the agent's CLAUDE.md, the global instruction file the model reads at the start of every session.

The exact text lives at src/core/vm-claude-md.ts lines 273-283. The header is "Browser Testing" and the body is five short steps, written below verbatim:

From src/core/vm-claude-md.ts (lines 279-283)

Navigate to http://localhost:5173 via Playwright MCP.
Take a browser_snapshot to verify the DOM rendered.
Check browser_console_messages for runtime errors.
If the page is blank, verify the component is imported in App.tsx.
Do not report completion until the browser shows the expected result.

Step 5 is the closer. The previous four steps are a procedure; step 5 is a precondition on the word "done." The agent cannot finish a turn until the snapshot agrees with the prompt.

Line 283

“Do not report completion until the browser shows the expected result.”

src/core/vm-claude-md.ts line 283 (the in-VM agent's CLAUDE.md)

What the protocol catches, and what it misses

A snapshot plus a console-message check is not the same thing as test coverage. It catches a specific, very common class of bug: code that parses fine, type-checks fine, even runs without throwing, but does not produce the result the user asked for. The page is blank because a component never got imported into App.tsx. The click handler is wired to the wrong state. The fetch goes to a path that does not exist. Console errors are loud about all of these. A rendered snapshot is loud about the blank-page case.

It does not catch the classes of bug that vibe-coding security audits keep finding. A SQL injection vector renders fine. A missing auth check produces no console error. A race condition under load will not show up in one navigation. The pages on this site about security pitfalls and missing API auth are the honest counterweight to this one. The browser check is necessary for closing the everyday rendered-result gap; it is not sufficient for closing the security and concurrency gaps. Those need a different layer.

The other limit worth naming: a system-prompt rule is a strong nudge, not a hard gate. The model can skip a step when it is confident, and the rule does not enforce itself. What makes it stick in practice is that the same VM is wired so the agent has the tool the rule asks for, in the same loop, with no extra permissions. Friction is what kills checklists. There is none here.

Why the protocol lives in CLAUDE.md and not in the system prompt

The actual system prompt mk0r sends to the agent is one short paragraph at the top of src/core/e2b.ts (the DEFAULT_APP_BUILDER_SYSTEM_PROMPT constant). It establishes identity, project layout, and the fact that Playwright MCP is available. Everything else, including the five-step browser check, lives in CLAUDE.md files inside the VM filesystem.

That separation matters. The system prompt is a contract between the platform and the model. CLAUDE.md is a contract between the user-context (their project, their preferences, their memory) and the model. Putting verification rules in CLAUDE.md means a project owner can edit them per-project, audit them as code, and treat them like any other repo file. The closer for the verification gap is not magic; it is a file you can read and change.

Want me to walk through your vibe-coded prototype with you?

Twenty minutes, your project, an honest look at what your verification surface is catching and what is leaking through. I will show you the same line of CLAUDE.md while we do it.

Frequently asked questions

What is the vibe coding verification gap, in plain words?

It is the gap between how much code AI is producing and how much of it is actually being checked. Sonar's January 2026 State of Code survey of more than 1,100 developers found that 96% do not fully trust that AI-generated code is functionally correct, but only 48% always verify their AI-assisted code before committing it. Werner Vogels framed the same idea as 'verification debt': the surge in output has moved the bottleneck from writing code to reviewing it. The label 'verification gap' is the everyday name for that distance.

Where did the term 'verification gap' come from?

It was named publicly in a January 8, 2026 Sonar press release reporting the 2026 State of Code Developer Survey. The CEO line that travelled the most was Tariq Shaukat saying 'AI has created a critical trust gap between output and deployment.' The same release tied the framing to Werner Vogels and the idea of verification debt. From there it spread through dev-news roundups and security-audit coverage of Lovable, Replit, and other vibe-coding platforms.

Why is the gap bigger for vibe coding specifically than for general AI assistance?

Two reasons. First, vibe coding skews to people who do not read the generated code at all; the practice, as Karpathy described it, is to describe and accept. Second, the typical builder UI shows you a final preview, not a series of small diffs, so even a willing reviewer cannot cheaply spot what changed. The gap is not just 'people are lazy', it is also 'the tools do not give you a small enough unit to verify'. mk0r's per-turn git commits address the second half; the in-VM browser check addresses the first.

What does the mk0r agent actually do to verify a turn before saying it is done?

There is a 'Browser Testing' section in the agent's CLAUDE.md at /Users/matthewdi/appmaker/src/core/vm-claude-md.ts lines 273-283. It tells the in-VM Claude Code agent that after any UI change it must navigate to http://localhost:5173 via Playwright MCP, take a snapshot, check browser_console_messages for runtime errors, confirm the component is imported in App.tsx if the page is blank, and 'do not report completion until the browser shows the expected result.' The Playwright MCP server is wired up in src/core/e2b.ts (the buildMcpServersConfig function) against a real Chromium running on CDP port 9222 inside the sandbox. So the agent has both the instruction and the tool to run it.

Is a five-step CLAUDE.md instruction actually enough to close the gap?

Honestly, no, not on its own. A system prompt rule is a strong nudge, not a hard gate. The model can skip a step under pressure, and a snapshot only catches what the snapshot covers (rendered DOM, console errors, blank-page failures). What it does close is the most common class of vibe-coding bug: the code looks plausible, compiles, and ships, but the actual rendered page is broken or the click handler never fires. The harder classes of bug, missing auth checks, race conditions, security holes, are not closed by this. Those still need a human reviewer or a real test suite.

Why not just generate tests, the way Sonar's coverage suggests?

Auto-generated tests help and mk0r encourages the agent to write Playwright specs alongside features. But tests answer 'does this behave as the model imagined it should' which is the wrong frame for the verification gap. The gap is about 'does this behave as the user actually wanted'. The browser-check protocol bridges that because the model has to look at the rendered result, in a real Chromium, with the user's prompt in mind, before it can claim done. The two layers stack: protocol catches the silly stuff, generated tests catch regression on the next turn.

How does the gap show up in production when nobody closes it?

Two visible patterns. One: the security-audit shape, where a recent audit of public Lovable apps found a meaningful slice with critical vulnerabilities (SQL injection, path traversal, privilege escalation). Two: the silent-broken-feature shape, where the demo runs but the button does nothing, or the form posts to a nonexistent endpoint, and the user only finds out after sharing the link. The first is what security teams worry about; the second is what indie hackers actually hit on weekend projects. The browser-check protocol mostly helps with the second.

What if I don't trust the agent to self-verify, can I see what it saw?

Yes. mk0r streams a live JPEG screencast of the in-VM browser to the chat panel at roughly 15 frames per second, so the agent's navigation and clicks are visible while they happen. Every successful turn also commits the resulting tree to a real per-turn git history (src/app/api/chat/route.ts line 1008, src/core/e2b.ts line 1773), so a 'git show' of any SHA tells you what changed for that prompt. You can either watch in real time, or read the diff after the fact. The reviewer-side gap closes from both ends.

Other guides on the verify-before-you-ship problem.