Argument

Why AI debugs better than it generates

Generation is open-ended. Debugging has a target. The same model writes the buggy first draft and fixes the bug on the next turn, because the second turn has something the first did not: an error to point at. The interesting part is what a tool puts in the gap between those two turns.

M
Matthew Diakonov
7 min
Direct answer (verified 2026-05-04)

AI debugs better than it generates because debugging has a verification target (an error, a stack trace, a blank screen, a failed snapshot) and generation does not. The model is the same on both turns. The difference is that on the debug turn the model knows whether it succeeded. On the generation turn it does not. Production AI coding tools encode this by wrapping every generation inside a closed-loop verification step that reads the result of the edit before declaring the turn done.

The asymmetry is named in Microsoft Research’s debug-gym writeup. The version of the loop wired into mk0r lives in src/core/vm-claude-md.ts lines 278 to 283.

The verification gap, in one sentence

Generation has no error message. There is no signal the model can read that says “the function you just wrote is wrong.” At best it has the user’s prompt, its own training, and an opinion about whether the output looks plausible. Plausible is not a target. So generation is fundamentally a forward search against a fuzzy criterion, and the model has to imagine its way into a working state from zero.

Debugging is the opposite shape. There is a working state minus one observable failure: a thrown exception, a failed assertion, a missing element in the rendered DOM, a console.error in the previous run’s log buffer. The criterion is concrete. The search space collapses from “all possible programs that match this prompt” to “the smallest edit that makes this one error stop happening.” That is a different problem and the model is much better at it.

This is why the same model can ship a buggy first draft and then fix the bug confidently on the next turn. It is also why the gap widens with the model’s size. Bigger models do not just generate better; they make better use of an error message when one is available. So the production lever is not always “use a bigger model.” It is “put more error messages in front of the model you already have.”

How mk0r encodes the asymmetry, in five rules

Every session in mk0r runs Claude Code inside an E2B sandbox. The agent gets a project-level CLAUDE.md at /root/.claude/CLAUDE.md, compiled from src/core/vm-claude-md.ts. Lines 278 to 283 of that file inject five sentences directly into the agent’s instructions, under a heading called ## Browser Testing. Here they are, lifted verbatim:

After UI changes

  1. Navigate to http://localhost:5173 via Playwright MCP
  2. Take a snapshot to verify the DOM rendered correctly
  3. Check browser_console_messages for runtime errors
  4. If the page is blank, verify the component is imported in App.tsx
  5. Do not report completion until the browser shows the expected result

The fifth sentence is the one that changes behavior. Without it, the agent’s exit condition is “I edited a file.” Edited is not done. With it, the exit condition becomes “the browser shows the expected result.” That promotes the verification step from optional to load-bearing. The first four rules tell the agent how to perform the verification. The fifth tells it that skipping the verification means the turn is not over.

Five sentences, fewer than 60 words. They turn one-shot generation into a closed loop because they tell the model where to find the target it does not have at the start of the turn.

The error channels the agent reads

The five rules only work if there is something for browser_console_messages to return. Two pieces of source make sure there is.

At src/core/vm-scripts.ts line 254 the VM declares a rolling logBuffer capped by LOG_BUF_MAX. Line 268 then reassigns console.error to a function that pushes every error string into the buffer before forwarding to the original handler. So every error the previewed app prints, from any source, is still in the buffer when the agent looks at it on the next turn.

At lines 1577 to 1580 in the same file, the iframe bridge listens for Vite’s vite:error event and posts a hmr:error message up to the parent page via window.parent.postMessage. That gives the parent UI an immediate signal about HMR failures, separately from anything the user sees on screen.

Both channels exist before the agent ever runs a verification step. They are not built when needed; they are sitting there waiting. The five rules in CLAUDE.md just tell the agent the channels are there and which one to read when.

Same generation. Different turn structure.

The agent edits files in response to the prompt. It sees its own diff. It does not see the running app. It reports completion the moment the edit is written. If the edit broke something, the user finds out next, not the agent. The next turn becomes a debug turn driven by a human-typed complaint, with the cost of one full round trip.

  • Exit condition is 'I wrote the diff'
  • No verification target available to the model
  • Failure is reported by the user on the next turn
  • Costs at least one extra round trip on misses

The blank-page rule, on its own line, on purpose

One of the five rules looks like a footnote and is doing the most work: If the page is blank, verify the component is imported in App.tsx. It exists because blank is the silent failure mode and Vite does not throw for it. The most common shape of a broken vibe-coded app is: the agent created a new component file in src/components/, wrote the JSX, exported it, and forgot to add the import to App.tsx. The bundler is happy. The console is quiet. The page is an empty div.

Without the rule, the agent reads an empty snapshot, looks at the console (clean), and burns a turn searching for a runtime error that does not exist. With the rule, “empty snapshot plus clean console” collapses immediately to “edit App.tsx imports.” HMR fires, the snapshot returns content, the turn ends.

The rule is short on purpose. Models follow short, declarative, imperative-mood instructions placed near the role description more reliably than they follow paragraph-level guidance later in the prompt. So the rule is one sentence, in active voice, with one trigger and one action.

5

Interactive debugging involves generating actions at each step that trigger feedback from the environment, with this feedback helping the agent make new decisions and requiring dense data like the problem description and the sequence of actions leading to the solution.

debug-gym, Microsoft Research, 2025

The counter-case: when the asymmetry breaks

The asymmetry only holds when debugging has a target. There is one important class of failure where it does not: code that runs cleanly and does the wrong thing. A sort that returns the wrong order. A form that submits to the wrong endpoint. A calculation that is off by one. None of those throw. The snapshot looks reasonable. The console is quiet. The agent reports completion correctly under the five rules and the bug ships.

That is a real limit and worth being honest about. The snapshot rule catches a chunk of it (a button labelled “Continue” instead of “Submit” shows up in the snapshot text), but a logic bug below the visual surface does not. The remediation in our experience is two-layered. Layer one: the rules above keep the agent honest about what it can already verify cheaply. Layer two: the per-turn commit graph (every prompt is its own commit, sliced on fork at src/core/e2b.ts line 1676) makes a logic bug that ships in turn N undoable in turn N+1 with byte-exact precision. Verification handles the loud failures. Undo handles the quiet ones.

The framing “AI debugs better than it generates” is not a claim that AI debugs perfectly. It is a claim about which side of the loop has a target. Where there is a target, the loop works. Where there is not, the tool needs a different affordance, like a clean undo, to handle the residual.

The minimum-viable version for any agent harness

You do not need a sandboxed VM and a browser bridge to capture the asymmetry. The shape transfers. Three pieces, in order.

  1. Capture stderr. Redirect the runtime error stream into a buffer the agent can read between turns. Any agent framework supports this. Without it, the model has no target on turn two either, and the asymmetry collapses.
  2. Block exit on edit-only. Treat “I wrote the diff” as not-done. Require the agent to run something (the build, the test suite, an endpoint request, a smoke check) before reporting completion. The blocking step is what forces the verification step to happen.
  3. Write the rule in plain English. Embed the verification step as a literal sentence in the agent’s instructions, in imperative mood, near the top of the prompt. “Do not report completion until the test passes” works better than three paragraphs of guidance three layers deep. This sounds soft and it is the lever that actually changes behavior.

That is the whole asymmetry, productized. Generation gets one shot. Debugging gets a loop. The tool’s job is to put the loop in the right place.

Want to see the verification loop in action?

A short call where we walk through the five-rule block, the snapshot return, and the console buffer the agent actually reads between turns.

Frequently asked questions

Why does the same model that wrote the bug fix it on the next turn?

Because the second turn has something the first turn did not: an error to point at. Generation starts from a sentence and has to imagine the working state. Debugging starts from a working state minus one observable failure (a stack trace, a blank screen, a failed snapshot, a console.error) and only has to close that one gap. The model is the same. The substrate is different. On turn one it is guessing. On turn two it has a target.

Is this a property of the model or the harness around it?

Both. The model is genuinely better at constrained tasks than open-ended ones, that is true on every benchmark from SWE-bench to HumanEval. But the gap widens or shrinks depending on whether the harness around the model captures error signals and feeds them back. A pure chat completion has no harness, so you lose half the advantage. An agent that can run the code, read the console, snapshot the DOM, and re-edit the file gets the full asymmetry. The five-rule post-edit block in mk0r (vm-claude-md.ts lines 278-283) is the cheapest version of that harness for a UI app.

What is in those five rules, exactly?

Five lines, copy-paste from the source: navigate to http://localhost:5173 via Playwright MCP, take a snapshot to verify the DOM rendered correctly, check browser_console_messages for runtime errors, if the page is blank verify the component is imported in App.tsx, and do not report completion until the browser shows the expected result. The last sentence is the load-bearing one. It forbids the agent from saying 'done' on the basis of having edited a file. Edited is not done. Browser-confirmed is done.

Why is the 'page is blank, check App.tsx' rule on its own line?

Because blank is the silent failure mode and Vite does not throw for it. When the agent generates a new component and forgets to import it from App.tsx, the bundler is happy, the console is quiet, and the page is just an empty div. With no rule, the agent burns a turn looking for a runtime error that does not exist. The single sentence at vm-claude-md.ts line 282 turns that into a one-shot fix: see blank, edit App.tsx imports, HMR fires, snapshot returns content, done. It is the rule that costs nothing to read and saves the most turns.

How are errors actually captured for the agent to read?

Two channels that arrive in the same buffer. The VM startup script at src/core/vm-scripts.ts line 268 reassigns console.error to a function that pushes every error string into a rolling logBuffer (declared at line 254, capped by LOG_BUF_MAX). Independently, the iframe bridge at lines 1577 to 1580 listens for Vite's vite:error event and posts a hmr:error message up to the parent. When the agent's next turn calls browser_console_messages via Playwright MCP, what it reads is the contents of that buffer. Errors that happened during the previous render are still there.

What about generation failures that do not produce an error?

Two answers. First, those are the cases where the asymmetry breaks down. If the generation looks correct and runs cleanly but does the wrong thing, debugging has no target either, and the agent is back to guessing from words. Second, the snapshot rule (line 280) catches a chunk of those. A DOM snapshot is not just 'did the page boot'. It is what elements are there, with what text and structure. A button that should say Submit and says Continue shows up in the snapshot. A form that should have three fields and has two shows up too. Snapshot diff against intent is a soft target, but it is more of a target than nothing.

Does this mean every generation should be wrapped in a debug loop?

For UI work in a sandboxed app, yes. The cost is one extra round trip to the browser and one Playwright snapshot per turn. The benefit is that 'I edited the file' stops being the agent's exit condition. For pure backend code without a runnable target, the loop has to substitute something else (run the test suite, hit the endpoint, diff the output) but the principle is the same: the agent is not done when it edited, the agent is done when something it can read confirms the edit landed.

Why is HMR_WAIT_MS = 800 the heartbeat number?

Because it is the deadline the parent page gives the iframe to repaint after a code change. At src/components/phone-preview.tsx line 24 the constant is set to 800. The parent posts a refresh nonce, starts a timer, and waits. If the iframe fires hmr:after inside 800 milliseconds the timer cancels and HMR survived. If 800 milliseconds pass without a hmr:after, the parent gives up on HMR and hard-reloads the iframe. That number is the system's tolerance for 'I generated and the change did not visibly take.' Anything past it gets treated as broken and recovered. Anything inside it gets treated as fine. The number itself is the verification target for the previous edit.

How many turns does this typically save?

We do not have a public benchmark we can quote. Anecdotally, the most common pattern we see in PostHog session replays is: turn one writes the new component, turn two would have been a manual user complaint ('it does not show up') except the agent caught the missing App.tsx import inside the same turn because the snapshot returned an empty body. So one rule that fits in one line collapses what would have been a two-turn round trip into a single turn. The compounding effect over a long session is large, but the single-instance saving is one whole turn.

Does this work without sandbox-style execution?

Less well. The verification gap shrinks the moment the agent loses the ability to actually run the thing it generated. A pure chat-with-codegen has no execution, so there is no target on turn two either, and you are back to guessing. The reason vibe coding feels different from a chat completion is that the harness exists. mk0r runs every session inside an E2B sandbox, which is the cheapest place to put a runnable preview the agent can poke at. The sandbox is what turns 'AI debugs better than it generates' from a property of the model into a property of the product.

What is the smallest version of this loop a non-sandbox tool can adopt?

Three pieces. First, capture stderr and console errors into a buffer the agent can read between turns (any agent framework supports this). Second, treat 'I edited the file' as not-done and require the agent to run the build or the test before reporting completion. Third, write the verification step as a literal sentence in the agent's project-level instructions, in plain English, in imperative mood. That third piece sounds soft and it is the one that actually changes behavior. Models follow short, declarative rules embedded near the role description more reliably than they follow paragraph-level guidance later in the prompt.

mk0r.AI app builder
© 2026 mk0r. All rights reserved.