Field assessment

Current AI app makers are great at first drafts. Iteration is where most fall apart.

The first prompt in any modern builder lands beautifully. The honest evaluation is the fifth prompt, and the tenth. Here is the structural test that separates the builders that actually iterate from the ones that re-roll, plus what mk0r's source says about its own answer.

Matthew Diakonov, Written with AI

Published May 9, 20268 min

Direct answer (verified 2026-05-09)

First-draft generation is mostly solved across the field. Iteration is where most current AI app makers fall apart. The dividing line is whether the builder edits files on disk with byte-exact recovery (good iteration) or re-runs prompts and hopes the bytes line up (lossy iteration). Many tools that look great on the first prompt are quietly the second.

Verified by reading the public route tables of major AI app makers and the appmaker source at src/core/e2b.ts (lines 1759, 1800, 1815, 1855) and src/app/api/chat (route.ts plus seven sibling endpoints, no /regenerate).

0/regenerate routes in mk0r's chat API

0iteration routes (undo, redo, revert, history, cancel, mode, model)

0anonymous trial turns (was 2, raised after telemetry)

0line of e2b.ts where every turn becomes a git commit

Why the first draft is misleading

The first prompt in any modern AI app maker has a tailwind. The system has a clean slate, an empty repository, fresh context, and exactly one chance to be impressive. The model wasn't handed any prior state to preserve. It wasn't asked to leave anything alone. It was asked to write the whole thing. Of course it lands. That demo, the one in the launch tweet, was always going to look great.

What you don't see in the demo is the fifth follow-up. By prompt five the model is being asked to keep the rest of the app stable while changing one specific thing. The cost of getting it wrong is not just the bad change, it's losing the four previous turns' worth of work. That's a different problem from generating from scratch, and it's the one most builders haven't fully solved.

The iteration evaluation has to look at what happens after that first impressive turn. Specifically: can you back up cleanly? Does undo produce the actual previous bytes, or just something close? When you change one thing, does it change exactly that thing, or does it quietly restyle three others? Does the agent remember what it built two turns ago, or is it re-reading everything from scratch every time?

The structural test: edit-on-disk vs re-roll

There's one architectural choice that determines whether a builder is good at iteration. Where does the source of truth live? On disk, as a real working tree the agent edits in place? Or in a chain of prompts the system re-runs to reproduce the current state?

Two ways an AI app maker can be wrong on prompt 5

Each turn, the system replays the prompt list against the model. To go back, it asks the model to produce a version close to the prior turn. To change one thing, it appends a new prompt to the list and lets the model regenerate. Tiny phrasing differences in the prompt cascade into different file structures, and 'undo' is approximate.

Undo is reproduce-from-prompts, not byte-exact
Cross-turn drift compounds: the model 'forgets' details from earlier turns
A 'small' change can quietly restyle three things you didn't mention
/regenerate is a first-class route because it has to be

The route-table test you can run in 30 seconds

The fastest way to grade a builder on iteration is to read its public API. Open the chat or session endpoints. The presence of certain routes tells you which model the builder is operating on. The absence of others is louder.

What the chat API of an AI app maker looks like, by iteration model

src/app/api/chat/
  route.ts             # the only edit path (one prompt, one turn)
  regenerate/          # rebuild from the prompt list
  restart/             # throw away history, start over
  reroll/              # one more shot at the first draft

# undo? redo? revert? history? not in this directory.
# the source of truth is the chain of prompts, not the bytes on disk.

-50% fewer lines

The right column is the actual route layout of mk0r's chat API today, in src/app/api/chat. There is no /regenerate, /restart, or /reroll. There are seven different endpoints for moving around in time. The product was rebuilt away from one-shot, and the route table records the decision.

When you evaluate any other AI app maker, do the same thing. Open the docs, find the API surface, and check what's present and what's conspicuously missing. A builder that calls itself good at iteration but only ships /generate and /regenerate is grading itself on a softer rubric than this one.

“Two turns was way too tight, users were getting blocked before they finished their first prompt cycle.”

src/app/api/chat/route.ts:21-22 (the comment that justifies ANON_TURN_LIMIT = 6)

What "good at iteration" actually requires under the hood

In mk0r, the iteration loop has four parts, each written down by file path. Per-turn git commits live in commitTurn at src/core/e2b.ts line 1759: every successful turn runs 'git add -A' and 'git commit -q -m <your prompt's first 120 chars>' inside the VM, and pushes the resulting SHA onto a session-scoped history stack.

The undo path is revertToSha at line 1815. It runs 'git checkout <previous-sha> -- .', stages the result, and creates a fresh commit with the message 'Undo to <short-sha>'. The undo is itself a commit, which is why undo of undo, then a redo, then a different forward step all behave exactly like a real editor would.

History forks on line 1800. The slice 'historyStack.slice(0, activeIndex + 1)' drops everything past the active pointer when a new turn lands mid-history. The future you abandoned is gone, and the new turn becomes the head. Without that one line, 'undo and try again' would either fail (because the future leaks back in) or keep both branches alive with no clear story for which is current.

The agent context is the fourth piece. mk0r runs against a single long-lived ACP session that's created once on prewarm and reused for every prompt for the rest of the sandbox's life. The agent's conversation buffer plus the working directory it's been editing are what state looks like across turns, so it doesn't re-read the codebase from scratch on every prompt unless it chooses to.

The honest counterpoint

The architecture above is necessary but not sufficient. A builder can have all four pieces in place and still be mediocre at iteration if the agent itself is bad at writing surgical diffs, or if the prompt budget is tight enough that you run out before the loop closes. mk0r raised the anonymous trial from 2 turns to 6 specifically because telemetry showed the loop wasn't closing at 2. Your mileage on any builder still depends on the underlying model's ability to read the current state and edit it precisely. That part is hard, and not every builder gets it right even when the substrate is right.

The other counterpoint: for genuinely throwaway prototypes, the whole iteration story is overkill. If you want a tip calculator and you're going to ship one screen and never touch it again, the Quick Haiku mode in mk0r exists for that case, and so do most competing one-shot HTML builders. The grading scale changes only when you keep coming back to the same project past the first session. That's when iteration becomes the actual product.

“The thing nobody tells you about AI app makers is that the first prompt is the easy demo. What broke me was prompt 7, when the agent quietly restyled the navigation while changing a single button color. I needed to go back, exactly, and I couldn't.”

A maker who tried four AI app builders before mk0r

paraphrased from a Twitter reply on this exact topic

The five-minute test you can run on any builder

Generate a small app from one sentence. Save the result.
Submit four follow-up prompts that each change one thing: rename a button, change one color, add one field, remove one screen.
After each turn, ask the builder if it can show you a diff against the previous turn. A real diff with file paths and line numbers means the source of truth is on disk, and iteration will scale.
Try to undo to turn 2 and then submit a different turn 3. If the abandoned future cleanly disappears and the new turn 3 lands cleanly on top, the builder has thought about fork-on-undo. If the old future leaks back in, it hasn't.
Look at the response time on turns 2 through 5. If they're all roughly the same as turn 1, the agent is re-reading the project from scratch every prompt. If turns 2 onward are noticeably faster than turn 1, the agent has a persistent session and is editing on top of state.

You can run this on mk0r right now. No account required. The bytes you generate are real, the diffs are real, and the undo is byte-exact.

Want to see the iteration loop in action on a real app idea?

Bring a prototype you keep getting stuck on past the first draft. We'll run the route-table test on whatever tool you're using now, then walk the same idea through the mk0r loop.

Frequently asked questions

Are current AI app makers actually bad at iteration, or is that overblown?

First-draft generation in 2026 is genuinely impressive across the field. You type a sentence, you get a working mobile-shaped app in under 30 seconds. That bar is now table stakes. Iteration is the part that did not catch up at the same speed. The honest read is that most builders are fine for the first 1 to 3 turns, then start drifting around turn 4 or 5: a button you didn't mention gets restyled, a feature you added two prompts ago quietly disappears, an undo brings you back to something close to the prior version but not exactly it. Whether you call that 'mid' or 'fine' depends on how long your iteration loop runs before you ship. For a one-screen weekend prototype, fine. For anything you keep tweaking past the first session, it shows.

What is the actual mechanism that separates good iteration from bad iteration?

Whether the builder treats your app as a stable artifact on disk that the model edits, or as a prompt history the model re-runs each turn. In the first model, the source of truth is the bytes on disk after the last accepted turn. The model reads the current state, writes a diff, the diff lands, you keep going. Undo means walking the bytes backward, exactly. In the second model, the source of truth is the chain of prompts. To go back, the system has to ask the model to produce a version close to where you were. Tiny phrasing differences cascade into different file structures, and there is no way to be byte-exact. If the second-model builder calls itself 'good at iteration,' it is grading on a softer rubric than the first.

How does mk0r handle iteration past the first prompt?

Every successful agent turn ends by running 'git add -A' and 'git commit -q -m <your prompt's first line>' inside the sandbox VM, then pushes the resulting SHA onto a per-session history stack. The function is commitTurn at src/core/e2b.ts line 1759. Undo is revertToSha at line 1815, which runs 'git checkout <previous-sha> -- .', stages the result, and creates a brand new commit on top. The undo is itself a commit, so undo of undo also works, and the original timeline is preserved in 'git log' regardless of which way you walked it. The history forks at line 1800: if you submit a new prompt mid-history, the slice 'historyStack.slice(0, activeIndex + 1)' drops the abandoned future. That's the small detail that lets you actually try a different idea after backing up.

What does the chat API directory look like, and why is that the test?

Open src/app/api/chat in the appmaker repo. The folder contains route.ts (the main one-prompt-one-turn-one-commit endpoint) and seven subfolders: /undo, /redo, /revert, /history, /cancel, /mode, /model. There is no /regenerate, /generate, /restart, or /reroll. Once you have committed a turn, the cheapest path the API offers is to iterate on top of it. There is no one-click 'throw it all away and start fresh' route because the product was rebuilt away from that pattern. The route table is the test because it's the public surface area: a builder that ships /regenerate as a first-class endpoint is telling you, structurally, what its iteration story is.

Why was the anonymous trial bumped from 2 turns to 6?

Because telemetry showed users at 2 turns were getting blocked before they finished their first prompt cycle. The constant lives at src/app/api/chat/route.ts line 23 as ANON_TURN_LIMIT = 6, with the comment on line 21-22: 'Two turns was way too tight, users were getting blocked before they finished their first prompt cycle.' Six is the floor for a useful evaluation: one prompt to seed, two or three to refine, one to undo a misstep, one to retry. Below 6, you're effectively grading the first draft alone, which for any non-trivial app is the wrong sample size. This is also a small piece of evidence that the team takes the iteration loop seriously; the first version was a one-shot trial, and it failed the actual users.

Are some current AI app makers genuinely good at iteration?

Some are getting closer. Tools that keep a real filesystem under the hood and have invested in surface area for going backward and forward in time are in the better camp. Lovable's branching version history is the closest cousin in spirit to what mk0r does. Replit Agent's multi-step planning before code generation reduces the 'it changed three things I didn't ask it to change' problem on a single turn, even though it doesn't fully solve cross-turn drift. The thing many builders still keep vague, that mk0r writes down by file path, is the mechanism: per-turn git commits, an explicit active-index pointer into a history stack, fork-on-undo via a one-line slice, and a single long-lived agent session that holds context across turns. If a builder can't tell you how those four pieces work, it's not actually good at iteration, it's good at the first draft.

What about a really long, detailed first prompt? Is that a one-shot path that wins?

Sometimes. A 400-word first prompt is closer to a tightly scoped iteration than a true one-shot, because you're doing the iteration in your head before sending. The catch is that long prompts hit the same drift problem as long context: the model loses track of details from the front of the message by the time it reaches the back. In practice, a 200-word prompt followed by three 20-word follow-ups outperforms a single 400-word prompt, because each follow-up gets the model's full attention on a small change. This is also why a builder that's only good at the first draft can feel impressive for a few minutes and then disappoint: the demo prompt was always going to land. The fifth follow-up wasn't.

What's the fastest way to find out if a given AI app maker is good at iteration?

Run this five-minute test. (1) Generate a small app from one sentence. (2) Submit four follow-up prompts that each change one thing: rename a button, change one color, add one field, remove one screen. (3) After each turn, ask the builder if it can show you a diff against the previous turn. If it can show a real diff with file paths and line numbers, the source of truth is on disk; iteration will scale. If it can only re-show the prompt history, or if 'undo' produces something that looks similar but not identical, the source of truth is the prompts, and iteration is going to drift the longer you stay in the session. (4) Try to undo to turn 2 and then submit a different turn 3. If the future you abandoned cleanly disappears and the new turn 3 lands cleanly on top, the builder has thought about fork-on-undo. If the abandoned future leaks back in, it hasn't.

Does any of this matter for a one-screen disposable prototype?

Not really. If the app you want is small enough to describe in one sentence and you do not care if v2 diverges from v1, every modern builder will do fine. The Quick Haiku mode in mk0r exists for that exact case, fast and disposable. The iteration story matters when the app has multiple screens, when you have a brand voice that has to stay consistent across turns, when state has to survive a refresh, or when you're going to come back tomorrow and keep working. That's where 'good at first drafts' starts to feel like 'mid at the actual job.'

More on the part of AI app making everyone underestimates