Guide

Vibe coding model comparison: why Haiku wins most loops in 2026

The honest answer, up front: for the loop vibe coding actually runs (one sentence in, app out, talk to refine) Claude Haiku 4.5 is the right default. Sonnet 4.6 takes over when state has to cross screens. Opus 4.7 only earns its keep on a planning turn or a cross-file debug. Every other comparison ranks these three on benchmarks that grade a different loop. This one grades the loop you are in.

M
Matthew Diakonov
9 min read
4 labels

In the mk0r picker, Haiku is labeled 'Scary' and Opus is labeled 'Smart'. The labels are an opinion about the loop, not the leaderboard.

src/components/header.tsx, lines 768 to 773

Direct answer (verified )

For vibe coding (one prompt, watch the app build, refine with words), pick the cheapest fast model that can hold the work in its head. That is Haiku 4.5 for single-screen apps, Sonnet 4.6 once state crosses screens, Opus 4.7 only on planning and hard debug turns. Per-token pricing verified against the official Claude API pricing table on May 7, 2026.

  • Haiku 4.5: $1 input, $5 output per million. Sub-second first-token latency. Right default for first drafts and small edits.
  • Sonnet 4.6: $3 input, $15 output per million. 1M-token context. Right when state crosses files or screens.
  • Opus 4.7: $5 input, $25 output per million. New tokenizer can use up to 35% more tokens for the same text. Right for one planning turn, then drop back.

The four labels in the picker, and what they hide

Open the model picker in the mk0r header. Four entries. The labels read like adjectives chosen by someone who used the product, not by Anthropic's marketing. They are. The map lives at src/components/header.tsx and looks like this:

src/components/header.tsx

That four-entry map is the entire opinion the product has about model choice. Each label is a working hypothesis, refined by watching anonymous users who came in for one app and stayed long enough to bump into the second screen problem.

FeatureLabelPicker key
defaultFastThe model the agent picks when nothing is forced. Same family as Haiku for anonymous sessions, can fall to Sonnet on a paid session.
haikuScaryClaude Haiku 4.5. The label is dramatic on purpose. First HTML token lands in the preview iframe in under a second on a warm sandbox.
sonnet[1m]Fast+Claude Sonnet 4.6 with the 1M-token context variant. The picker only unlocks this for an authenticated session that is entitled or routing inference through a connected Claude.com plan.
opus[1m]SmartClaude Opus 4.7 with the 1M-token context variant. Same gating as Sonnet. The label is 'Smart' because Opus plans before it writes; that planning costs latency and dollars.

The numbers behind the labels

Per-token pricing from the Anthropic pricing page, May 7, 2026. Output rates matter more than input rates for vibe coding because the agent writes a lot more HTML than the user types prompts.

FeatureOutputInput
Claude Haiku 4.5$1.00 / MTok$5.00 / MTok
Claude Sonnet 4.6$3.00 / MTok$15.00 / MTok
Claude Opus 4.7$5.00 / MTok$25.00 / MTok

All three models include a 1M-token context window at standard per-token rates on the [1m] variants. Cache reads bill at 0.1x base, so a long conversation history bills at one tenth of a fresh prompt. Source: platform.claude.com/docs/en/about-claude/pricing.

0x
cheaper input + output, Haiku vs Opus
0M
token context on the [1m] variants
0%
more tokens, Opus 4.7 tokenizer overhead
0x
cheaper on cache reads (0.1x base)

Why the leaderboard is wrong for this loop

Terminal-Bench, GDPval, SWE-Bench. Every public ranking grades the same shape: a fixed task, a single shot, a graded result. Opus tops every one because it spends more compute reasoning before it writes, and that compute pays off when the task does not move.

Vibe coding is not that shape. The user types a sentence. HTML streams. The user reads. The user types another sentence. The artifact moves on every prompt. The thing that decides whether the loop converges is iteration count per unit of human attention. A model that produces a 7-out-of-10 in one second beats a model that produces a 9-out-of-10 in fifteen, because the human gets four shots at the seven in the time the nine takes, and the fourth shot is usually closer to what they wanted.

The leaderboard does not measure that. The picker labels do. ‘Scary’ is a label that only makes sense if you have watched a non-developer's face when an HTML page shows up before they have finished reading the prompt they typed.

The free model is not a budget choice. It is a design choice.

The free tier on mk0r is pinned to Haiku, but not because Haiku is the cheapest option that meets the bar. Haiku is the model that makes the loop work for someone who has never coded. The pricing math is downstream of that decision. The gate is one constant and one if-block:

src/app/api/chat/model/route.ts

An anonymous visitor on mk0r runs Haiku against the host's shared ANTHROPIC_API_KEY and the host absorbs the per-token cost. A signed-in visitor on a paid mk0r session can flip to Sonnet or Opus. A signed-in visitor with a connected Claude.com OAuth token routes inference through their own plan, which means Sonnet and Opus turns count against the rolling 7-day cap on Pro or Max instead of an mk0r line item. Three different billing paths, one model picker, four labels.

How to actually pick, in four steps

  1. 1

    Sentence one to draft one

    Haiku 4.5. The first prompt should land HTML in the iframe before you have decided what the second prompt is.

  2. 2

    Fix one screen, change one color, swap one library

    Stay on Haiku. The change is local; the model does not need a plan, it needs to type quickly.

  3. 3

    Add a second screen with state that crosses files

    Switch to Sonnet 4.6 (Fast+). This is the threshold where multi-file coordination starts beating raw speed.

  4. 4

    Refactor a working app, plan a feature, debug a behavior across files

    Switch to Opus 4.7 (Smart). Use it for the planning turn, then drop back to Haiku once the plan is in the agent's context.

Heuristics worth keeping

Eight one-liners that survived contact with real builds.

Speed beats horsepower for first drafts
Horsepower beats speed for refactors
Iteration count beats per-turn quality
The cheap model lets you fail more
Failing fast is the loop
Plan turn on Opus, write turns on Haiku
Long context fixes 'forgotten earlier' bugs
Streaming HTML is a UX feature, not a model feature

The Opus tokenizer footnote nobody is reading

One paragraph of detail you will not find in another model comparison. The Anthropic pricing page notes, in a small box near the model table, that Opus 4.7 ships with a new tokenizer that may use up to 35% more tokens for the same fixed text compared to previous Claude models. Headline rate dropped (Opus 4.7 is $5 input and $25 output per million versus Opus 4.1 and Opus 4 at $15 and $75) so the per-text cost still dropped meaningfully, but the multiplier the rate card implies is not the multiplier you actually pay.

For a vibe coding session that resends a system prompt plus the conversation history on every turn, the practical effect is roughly a 2x cost reduction over Opus 4.1, not the 3x the rate card suggests. With prompt caching active and most of the input hitting the 0.1x cache-read rate, the gap closes again because both models read from cache at the same multiplier. Net: Opus 4.7 is a real upgrade on quality and a real price drop, just not as dramatic as a naive read of the table implies.

The real comparison axis: streaming HTML

One thing none of the public comparisons measure: how a model behaves when its output is being rendered live. Haiku writes well-formed HTML in roughly the order a browser wants to render it. The first <header> shows up in the preview iframe before the model has finished writing the closing </body>. That is the perceptual experience the ‘Scary’ label is pointing at.

Opus generates better HTML on average. It also pauses to think mid-stream more often, which means the iframe sits half-rendered for a beat while a non-developer wonders if the page is broken. Sonnet is in the middle. Same model family, very different feel inside an iframe that is updating token by token.

For a builder where the preview is the product, this perceptual axis matters more than the benchmark axis. The benchmark grades the final string. The user grades the experience of watching it appear.

Want to see the picker in motion?

Book 15 minutes, build one app on Haiku, then redo the same prompt on Sonnet and Opus. The labels make sense after one sitting.

FAQ

Frequently asked questions

Which AI model is best for vibe coding?

For the first 80% of vibe coding (one sentence in, mobile app out, refine by talking) Claude Haiku 4.5 is the right default. It is fast enough that the first HTML token lands in the preview iframe inside a second on a warm sandbox, and it is cheap enough at $1 input and $5 output per million tokens that the loop tolerates a lot of throwaway turns. Sonnet 4.6 takes over when the app crosses two screens and state has to live somewhere, because the multi-file coordination Sonnet does outweighs the speed Haiku gives up. Opus 4.7 only earns its keep when the agent has to plan a refactor or debug a behavior across multiple files; on a single-screen build it slows the loop without adding output the user can tell apart.

What does the model picker in mk0r actually expose, and what do the labels mean?

src/components/header.tsx defines a four-entry MODEL_LABELS map. 'default' shows up as 'Fast', 'haiku' as 'Scary', 'sonnet[1m]' as 'Fast+', and 'opus[1m]' as 'Smart'. The 'haiku' label is dramatic on purpose: it is the only model an anonymous or free-tier user can run, and the landing copy advertises sub-30-second builds against it. The '[1m]' suffix on Sonnet and Opus selects the 1M-token long-context variant. Anthropic includes the 1M context window at standard per-token rates on Sonnet 4.6, Opus 4.6, and Opus 4.7, so the 'Smart' label does not carry a context premium; it carries a per-token premium and a latency premium.

Why does Haiku win for vibe coding when benchmarks rank Opus higher?

Because the benchmarks grade a different loop. Terminal-Bench, GDPval, and SWE-Bench grade single-shot agentic execution against a fixed task. Vibe coding is not single-shot; it is a back-and-forth where the human types a sentence, watches output stream, types another sentence, watches output stream, and decides the app is done when it looks right. The thing that decides whether the loop converges is iteration count per unit of attention. Haiku at 5x cheaper on both input and output, with first-token latency well under a second, lets the human take more shots before they get bored. The output of any single shot is usually worse than what Opus would have written, but the human gets four shots for the price of one, and the fourth shot is usually closer to right than the one Opus produced. The benchmarks do not measure that.

When should I switch from Haiku to Sonnet?

Switch when the change you are asking for has to coordinate across files, or when state has to survive across screens. Concretely: 'add a settings screen and remember the toggle on the home screen' is a Sonnet prompt; 'change the home screen header to teal' is a Haiku prompt. Sonnet 4.6 at $3 input and $15 output per million tokens is 3x the per-token cost, but the work is qualitatively different because Sonnet plans the cross-file edits in its head before writing them. Asking Haiku to do the same thing produces output that compiles but loses the toggle on a route change.

When does Opus actually pay off?

Two cases. First, the planning turn before a real refactor: 'we have three screens, the data is in localStorage, I want to move it to a server component with a Postgres backend, write me a plan and a list of files to change'. Opus is good at producing the plan because it spends more compute reasoning before it writes. Once the plan is in the conversation, you usually drop back to Haiku or Sonnet to do the actual edits, because the plan is the bottleneck, not the typing. Second, hard debugging: 'this animation drops a frame on iOS Safari but not Chrome, what is happening'. Opus is meaningfully better at the kind of cross-context inference where the answer is not in any single file. On a fresh single-screen build, neither of those cases applies, so Opus is paying for compute the user cannot tell apart from Haiku.

Does the new Opus 4.7 tokenizer change anything practical?

Yes, by a non-trivial amount. Anthropic's pricing page notes that Opus 4.7 uses a new tokenizer that may use up to 35% more tokens for the same fixed text compared to previous models. The headline per-token rate dropped (Opus 4.7 is $5 input and $25 output per million versus Opus 4.1 and Opus 4 at $15 and $75) so the net effect is still a real price drop, but the multiplier is not the full 3x you would estimate from the rate card alone. For a vibe coding session where you are passing a system prompt plus conversation history on every turn, the practical effect is that an Opus 4.7 session is roughly 2x cheaper than the same session on Opus 4.1, not 3x.

What happens to model choice if I connect my Claude.com Pro plan to mk0r via OAuth?

The model gate at src/app/api/chat/model/route.ts unlocks the same way as for an mk0r-billed paid session, but the inference for every turn is billed against your Claude Pro plan instead of mk0r's shared API key. The 'Fast', 'Scary', 'Fast+', and 'Smart' labels all still apply; what changes is who pays the per-token cost. With the OAuth path, Sonnet and Opus turns count against the rolling 7-day weekly cap on your Pro or Max plan, not against an mk0r usage line. Heavy users who already pay Anthropic for Claude Code typically pick this path because it makes builder turns flat-rate.

Why does the picker label Haiku 'Scary' and Opus 'Smart'? That is the opposite of how everyone else ranks them.

The labels are an opinion about the loop, not the leaderboard. Haiku is 'Scary' because it streams an HTML page before you finish reading the prompt you wrote, which is genuinely unsettling the first time a non-developer sees it; the speed crosses a threshold where the artifact feels less like 'I asked an AI for code' and more like 'I asked a thing for an app'. Opus is 'Smart' because it produces the kind of output you would have to read through to evaluate; the win is reasoning quality, not perceptual speed. The picker is telling you a true thing about what each tier feels like to use, not what each tier scores on a benchmark.

Is there a vibe coding workflow that uses three models in one session?

Yes, and it is the workflow most heavy mk0r users settle into. Use Haiku for the first three to five prompts, until you have a working prototype that compiles and renders. Switch to Sonnet for the screen-2-and-beyond work where state has to live somewhere. Switch to Opus for one prompt: a planning turn that reads the conversation history and writes a plan for the next chunk of work. Then drop back to Haiku to execute that plan. The pattern works because the expensive part of vibe coding is figuring out what to ask for; once you know what to ask for, Haiku can usually write it.

Try the picker yourself

No account, no setup. Type a sentence. Watch the model render an app. Switch the label and watch how the same prompt feels different.

Open mk0r
mk0r.AI app builder
© 2026 mk0r. All rights reserved.