AI app builder vs screenshot to code: one is a transcriber, one is a constructor.
The two get listed side by side in nearly every roundup, as if they were doing the same job with a different first input. They are not. Screenshot-to-code is a single vision-model inference that ends the moment the HTML is on screen. An AI app builder is an open chat against a running sandbox. mk0r is unusual because it accepts a screenshot as one input among many in that chat.
Direct answer, verified 2026-05-04
Screenshot-to-code converts one image to a static HTML file in a single inference, with no follow-up turn and no runtime. An AI app builder is a conversational session against a running dev server that wires up state, data, and a public URL. Source on the canonical screenshot-to-code repo: github.com/abi/screenshot-to-code. mk0r is in the second category but accepts screenshots as input (see src/app/api/chat/route.ts lines 268-340 in the repo).
What screenshot-to-code is doing under the hood
The shape of these tools is small and easy to read. The reference implementation, the one most others copied, is abi/screenshot-to-code. The frontend is React + Vite, the backend is FastAPI, and the interesting line is the call to a vision-capable model with the uploaded image in the prompt. The model returns code. The code lands in a panel. You can pick HTML/Tailwind, React, Vue, or Bootstrap as the output flavor. That is the whole loop.
What it does well: it gets the layout. Spacing, hierarchy, color blocks, button shapes, basic typography. If the input is a clean mockup with labels, the output will look strikingly close on the first try.
What it does not do: persist anything across turns. Run the generated code. Provision a database. Wire up email. Give you a URL. Read a follow-up message. Edit the file in place. The image you upload is consumed by one inference and is not available again. If you want a second pass, you start over.
What an AI app builder is doing under the hood
An app builder is not a single model call. It is an orchestration layer in front of a chat-style agent that can edit files, run commands, and watch the result. mk0r's shape: the browser sends a message to /api/chat. That route resumes (or creates) an E2B VM, talks to an agent running inside the VM via a JSON-RPC bridge, and streams the agent's tool calls back to the UI. The agent is editing a real Vite + React + TypeScript project on port 5173. The same VM has a wildcard-DNS public hostname. Your URL exists from the first turn.
That is structurally different from a vision-model call. It has state across turns, it has filesystem persistence, it has a process that can run, it has an HTTPS endpoint, and the agent can do things with all of those.
The trade is the boot cost on turn one. A new VM has to come up. The startup script has to run. That is seconds you do not pay in a single-inference tool. After that, every turn is a delta on the running app, not a regeneration of the whole thing.
Anchor fact
When you drop a screenshot into mk0r, the same file goes into two places: the vision context of the agent (as inline base64) AND the VM filesystem at /app/uploads/<name>.
See src/app/api/chat/route.ts lines 303 to 313: an { type: "image", data, mimeType } prompt block goes to the agent for vision, then a second { type: "text" } block tells the agent the image was saved at a path it can also Read or copy into public/. Screenshot-to-code tools do only the first half. The image is gone after one turn.
The two flows, drawn out
Sequence one is a screenshot-to-code call. One image, one model response, one file. Sequence two is a single turn inside an mk0r session, after the VM is already booted. Same first input, different machinery.
screenshot-to-code, one inference
mk0r, one chat turn with an attached image
Where the categories really diverge
I find this easier to think about with the comparison broken into independent dimensions, not as one ordered ranking. Each tool wins cleanly on different axes.
First-input shape
Screenshot-to-code: one image is the contract. mk0r: text, image, or both, in the same turn. The image is optional and stackable with prose.
What you take away
Screenshot-to-code: a static HTML file. mk0r: a running app at <vmId>.mk0r.com plus a GitHub repo you can clone.
Iteration model
Screenshot-to-code: re-upload the image, re-run. mk0r: send a new chat message; the agent edits the running project in place.
Layout fidelity, first turn
This is where screenshot-to-code is genuinely good. A vision model trained on UI is hard to beat on raw spacing-and-color reproduction.
Behavior, beyond turn one
Outside the app-builder category by definition. Forms that submit, data that persists, emails that send, auth that gates: those are sessions, not inferences.
How mk0r treats a screenshot as a piece of context, not a contract
Picture a single chat turn. You drag a Figma export into the input, type two sentences, and hit send. Three things happen at once inside the VM, then the agent gets the prompt with all of them already in place.
One image, two destinations
The dual destination is the whole point. The agent can see the layout (vision) and also reach the file by path (Read, Bash, copy into public/), so the image becomes both a design reference and a deployable asset. None of that exists in a screenshot-to-code call.
What you can paste in, and what mk0r does with it
All of these are valid first inputs in the chat. The pattern is the same regardless of which one you choose: the file goes to vision and to disk, then the chat keeps going.
Concrete numbers, on the dual-destination pipeline
The relevant constants live in the same handler file. None of these are guesses; they are the literal values in src/app/api/chat/route.ts.
Max inline image size per attachment
Prompt blocks pushed per image (image, then path)
File path on disk: /app/uploads/<name>
Re-uploads required for follow-up turns
Vite port the agent is editing into
Public proxy port behind <vmId>.mk0r.com
When screenshot-to-code is genuinely the right pick
I am not going to pretend the category should not exist. There are real cases where the one-shot, no-runtime shape is the right fit. Three I keep running into:
- You have an exact mockup and you want the corresponding HTML to paste into a marketing site that already has its own framework and deploy. The static output drops in cleanly.
- You are doing a one-page email template and want the markup as a starting point. No state, no auth, no domain.
- You are studying how a vision model handles a tricky layout. The minimal surface area lets you read the prompt, swap models, and compare outputs without an orchestration layer in the way.
Anything past those, and you are asking for something the category does not include. That is when the answer rotates to an app builder.
When the right pick is mk0r, specifically
The reason I am writing this from inside mk0r and not a generic comparison post: most app builders do not accept image input as a first-class thing. They accept a prompt and start coding. mk0r accepts the screenshot, saves it, looks at it, and keeps the chat open. If your starting point is "I have this mockup, but I also need the form to actually email me", you do not want to split the work across two tools.
The honest limit: mk0r's vision is whatever the underlying agent model can do, which today is Claude. It is not specialized on UI reproduction the way a vision model fine-tuned on screenshot pairs would be. If pixel-perfect reproduction of a complex dashboard is the entire job, a focused screenshot-to-code tool may still beat the generalist on the first turn. After turn one, the comparison stops being meaningful, because there is no turn two.
The verifiable parts, if you want to read along
- src/app/api/chat/route.ts lines 159 to 166: the request shape, including
attachments. - src/app/api/chat/route.ts lines 268 to 340: the per-attachment loop, the 20 MB image limit, the dual destination (vision context + filesystem).
- src/app/api/chat/route.ts lines 303 to 313: the two prompt blocks pushed per image (
{ type: "image" }then{ type: "text" }). - github.com/abi/screenshot-to-code: the canonical reference for the screenshot-to-code category. Read
backend/prompts/__init__.pyfor the actual prompt wrapper around the image. - The same
src/app/api/chat/route.tslives in the mk0r repo; the line numbers above are stable as of 2026-05-04.
Drop a screenshot, type one sentence about how you want it to behave. The chat stays open after the first reply.
Open mk0rWant to see the dual-destination flow live?
Book 20 minutes. We'll drop a screenshot in front of you, watch the file land at /app/uploads/, and iterate on the running app from the same chat.
Frequently asked questions
What is screenshot-to-code, exactly?
Screenshot-to-code is a category of tools where you upload one image (a Figma export, a competitor screenshot, a hand drawing) and a vision model returns an HTML/Tailwind, React, or Vue file that visually resembles the input. The canonical open-source example is github.com/abi/screenshot-to-code, which uses a Claude or GPT-class vision model. The output is static markup, not a running app, and there is no follow-up turn unless you start over.
What is an AI app builder?
An AI app builder is a chat session that builds and serves an actual application. You describe what you want, the agent writes code, a runtime serves it, and you iterate by sending more messages. The output is not a file you copy out; it is a process running somewhere with a URL you can share. mk0r lives in this category. So do tools like Lovable, Bolt, v0, and Replit Agent.
Does mk0r accept screenshots as input, or only text prompts?
Both. The chat input on mk0r.com supports image attachments. Internally, src/app/api/chat/route.ts (lines 268-340) handles attachments by writing the file to /app/uploads/<name> inside the VM AND pushing an inline base64 image block into the prompt sent to the agent. So you get the vision-model behavior of screenshot-to-code, plus the iteration loop of an app builder, in the same turn.
If I drop a screenshot into mk0r, what is different from running screenshot-to-code on the same image?
Three things. One, the agent has more than HTML to write into; it can edit a Vite + React + TypeScript project that is already running on port 5173 with Postgres, Resend, and PostHog wired in. Two, the screenshot persists at /app/uploads/<name>, so the agent can re-read it on later turns or copy it into the public folder as an asset. Three, you get follow-up messages: "now wire the form to Resend," "now persist the entries to Postgres," without re-uploading anything.
When is screenshot-to-code the right tool?
When you have one specific image, you want a static-HTML approximation, and you do not need behavior. Marketing pages, email templates, a one-screen mockup you will hand to a designer. The output is fine for those. The moment you need state, data persistence, or a hosted URL, you have left the category and you want an app builder.
Can I just paste a screenshot of a competitor app and have mk0r build the whole thing?
Visually, you can get a credible first pass; the vision model will reproduce the layout. The behavior will not be there, because the agent has not seen what your competitor's API does or what their data model looks like. You will spend the iteration turns describing the missing logic. That is the work, and that is why an app builder is not a one-shot tool.
How do mk0r and screenshot-to-code differ as projects?
screenshot-to-code (github.com/abi/screenshot-to-code) is a small tool: a React + Vite frontend, a FastAPI backend, and a vision-model call that returns code. mk0r is a Next.js orchestrator that talks to an E2B sandbox running a real Vite + React dev server, with Postgres, Resend, and PostHog provisioned per session. The orchestration layer is most of the surface area in mk0r; the model call is most of the surface area in screenshot-to-code.
How fast is the first response, in each category?
Screenshot-to-code typically returns a single HTML file within one model inference, usually a few seconds. mk0r's first turn is slower because it boots an E2B sandbox, runs the startup script (Xvfb, Chromium, Vite, the agent bridge), and only then begins generating. You are paying the boot cost once per session in exchange for an actually-running app at the end of it.