Guide

Vibe coding model switching cost: the cache tax nobody counts

Every comparison of Haiku, Sonnet, and Opus prices the per-token rate. The cost that actually shows up on the bill when you flip models mid-session is different. Anthropic's prompt cache is keyed on (model, prefix). Switch the model and the cache for that prefix is dead. The very next turn bills full input rate on every token you had cached. The session keeps going, the VM does not restart, no signup screen appears. The cache just quietly evaporates and you pay for the rebuild.

M
Matthew Diakonov
7 min read

Direct answer (verified )

The rate-card delta is the small number. The big number is prompt-cache invalidation. Per the Anthropic API pricing page, cache reads bill at 0.1x base and cache writes at 1.25x base. Switching models flips the cache key, so:

  • The next turn reads zero tokens from cache (full 1x input rate on the entire prior conversation, not 0.1x).
  • The same turn writes a fresh cache prefix for the new model at the 1.25x cache-write rate.
  • Net: a 100k-token cached conversation costs roughly $0.01 to read on the old model and $0.30 + $0.375 = $0.675 to redo on Sonnet, before the new model writes a single output byte.

That is the "switching cost." It is one-time per switch, lands as a spike on the very first turn after the flip, and scales linearly with how long the conversation already is.

0.1x → 1x

When the picker label flips, the VM, the sessionId, the file state, the agent's plan, all of it survives. Only the (model, prefix) tuple changes. That is enough to kill the cache.

src/core/e2b.ts:1953-1966 (setSessionModel)

What the switch actually does

In mk0r, switching models is one POST request. The handler at src/app/api/chat/model/route.ts authenticates, gates non-Haiku models behind a subscription check, then calls setSessionModel(sessionKey, modelId), which forwards to the ACP subprocess running inside the user's VM. The implementation is the entire reason switching is cheap to do at the wire level and expensive to do at the cache level:

src/core/e2b.ts

No new VM. No new sessionId. The agent keeps every file it has written, every tool call in its history, every part of the plan it is mid-execution on. From the agent's perspective, only one thing changed: the model that gets the next prompt. From Anthropic's billing perspective, the new model has never seen this conversation before, so it has nothing to read from cache.

The switch in sequence

One model flip, end to end

UI/api/chat/modelVMACPPOST /api/chat/model { modelId: 'sonnet[1m]' }setSessionModel(sessionKey, 'sonnet[1m]')POST /session/set_model (same sessionId)200 OK, same VM, same conversation historyok: trueLabel flips from 'Scary' to 'Fast+'Next prompt: 'add a settings screen'First input read: cache MISS (new model key)

The last line is the part nobody draws. The ACP reports OK, the UI label flips, the next prompt lands, and only then does Anthropic charge the rebuild.

The four numbers that make the bill

For a worked example, assume the conversation has 100,000 tokens of input history already cached on Haiku, and the user flips the picker to sonnet[1m] before sending the next prompt. The rates are pulled from Anthropic's pricing page as of May 11, 2026.

FeatureChargeTokens
Cached input (Haiku, before switch)100,000 tokens$0.01 ($1 / MTok × 0.1 cache-read)
Fresh input (Sonnet, first turn after switch)100,000 tokens$0.30 ($3 / MTok × 1.0 fresh)
Cache write to seed the new prefix100,000 tokens$0.375 ($3 / MTok × 1.25 cache-write)
Net tax over a hypothetical 'no switch' Haiku turn+$0.665 on one turn, before the new model writes a single byte

Two thirds of a dollar on a single turn before the new model has written a byte of output. On a typical session, that one switch can equal 15 to 25 cached Haiku turns' worth of input cost. It is one-time per switch, but each switch pays it again from scratch.

The hidden multipliers

0xinput cost premium on the first turn after switch (cache miss vs 0.1x cache hit)
0xcache-write multiplier you pay to seed the new model's prefix
0Mtokens of cached history that get invalidated when the (model, prefix) key changes
0 refundsAnthropic does not refund the dead cache; the prior cache writes are sunk cost

Each of those numbers is mechanical, not contested. Cache reads bill at 0.1x base, so a cache miss is by definition a 10x premium over a hit. Cache writes are 1.25x, which is the price of seeding the new prefix so the FOLLOWING turn on the new model can hit cache. The session retains millions of tokens of state in long-context mode, but none of that state is portable across models. And Anthropic does not refund the cache writes you already paid for on the abandoned model; those are sunk cost the moment you flip the picker.

The latency tax (the other thing you pay)

Cached input does not just bill at 0.1x. It also serves faster, because the prefix is pre-encoded server-side. On a 100k-token prefix, the time-to-first-token gap between a warm cache and a cold cache is typically several hundred milliseconds, sometimes a full second or more on the largest prefixes. The streaming preview in mk0r renders HTML as the model emits it; the iframe needs the first bytes fast for the loop to feel like a loop instead of like a compilation step.

When the picker label reads "Scary" for Haiku, that label is specifically pointing at the warm-cache, sub-second first-token experience. Flip the model, lose the cache, lose half of what made that label scary. The next turn on the new model has to encode the prefix from cold, which is exactly the work the cache existed to skip. The user does not see a different number; they see a slower first byte. The label still says "Fast+" or "Smart," but the felt speed on that one turn is not what the label implies.

When the tax is worth paying

The point is not to never switch. It is to switch deliberately, at the moments where the new model adds more value than the rebuild costs. Four heuristics that survive contact with real builds:

FeatureVerdictPattern
Switch once at sentence oneZero taxThe cache hasn't formed yet. Pick the model you want and let the first cache fill on that model.
Switch after a 5-prompt warm-upModest taxConversation is maybe 30k tokens. Rebuild cost is small dollars. Worth paying if the next turn really needs the upgrade.
Switch on every prompt to A/B modelsWorst caseEvery switch invalidates. You're paying full input rate on the entire growing history every other turn. Just don't.
Switch to Opus for a planning turn, switch backPay twiceCache invalidates on the way out and on the way back. Plan once, write on the original model, do not flap.

The worst pattern is the "A/B same prompt across models" loop, because every flip pays the rebuild from scratch and none of the cached prefixes ever earn back their cache-write premium. The best pattern is the opposite: Haiku for warm-up, one deliberate flip to Sonnet at the second-screen threshold, one deliberate flip to Opus for a planning turn, then back to Sonnet (not Haiku) to execute the plan on a cache that has already been paid for once.

The output drift cost (the one you cannot see in dollars)

The fourth cost is qualitative and gets ignored in every comparison because it does not show up on the invoice. When the new model reads the conversation history fresh, it interprets your prior prompts through its own lens. Haiku and Sonnet often reach the same conclusion. Opus sometimes re-plans your earlier work and suggests refactors you did not ask for, because that is what Opus does: it spends compute reasoning before writing.

On a vibe coding loop the user has been steering by feel, that re-interpretation can feel like the agent "forgot" what you were going for. It did not forget; it has the entire history. It is just running that history through a different brain that weights the constraints differently. The fix is not to avoid the upgrade; it is to know that the first prompt on the new model is load-bearing. Spend that prompt re-anchoring the goal, not pushing the next feature.

Want to see the cache invalidation live?

Book 15 minutes. Build a 5-prompt app on Haiku, flip the picker, watch the next turn's first-token latency jump and the input-token count spike. The behavior is more obvious than any spreadsheet.

FAQ

Frequently asked questions

What is the literal cost of switching AI models in a vibe coding session?

The headline cost is the per-token rate difference (Haiku to Sonnet is roughly 3x input and 3x output, Sonnet to Opus is another 1.67x). The cost almost nobody counts is the prompt-cache invalidation. Anthropic's prompt cache is keyed on the tuple (model, prefix). Switching models flips the key, so the next turn reads zero tokens from cache and writes a brand-new prefix at the cache-write rate (1.25x base). For a 100k-token conversation history, that rebuild alone is the difference between $0.01 and $0.30 of input on a single turn. The rebuild is one-time per switch, but it lands as a discrete spike on the very first turn after you flip the label.

Why is the prompt cache per-model and not per-session?

Because the cache stores model-specific activations, not the raw prompt text. When Anthropic serves a cache-read at 0.1x base, it is skipping the prefix-encoding work for that specific model on that specific prefix. A different model has a different tokenizer (Opus 4.7 ships a new tokenizer that can use up to 35% more tokens for the same fixed text), different weights, and different activation shapes. There is nothing in the Sonnet cache that Haiku can reuse and vice versa. So the cache must be partitioned by model. The session, the conversation history, and the file state all survive the switch in mk0r (the VM and ACP subprocess are reused, only the modelId flips), but the cache is gone.

How does mk0r actually switch models? Does it boot a new session?

No. It calls POST /api/chat/model with a sessionKey and modelId. That handler calls setSessionModel in src/core/e2b.ts, which sends POST /session/set_model to the ACP subprocess inside the same VM. The same sessionId continues. The same git repo, the same files on disk, the same conversation history. Only the model name flips. From the user's perspective, the picker label updates and the next prompt routes through the new model. From a billing perspective, the new model has never seen this prefix before, so it has nothing cached.

How much cached history does a normal vibe coding session accumulate?

More than people expect. The system prompt for an mk0r agent is roughly 8k to 12k tokens out of the gate (tool definitions, file-write protocol, model-specific instructions). Every turn appends the user prompt, the agent's tool calls, the file contents read or written, and the agent message. By the third or fourth prompt, the conversation often sits at 30k to 60k cached input tokens. By the time someone has built a working two-screen app, 100k is a normal range. Once the [1m] long-context variant is on, sessions can run into several hundred thousand tokens of cached history before any compaction. The cache invalidation tax scales linearly with that number.

Is the latency cost real, or is the cache hit only about money?

It is both. Cached input does not just bill at 0.1x; it also serves faster, because the prefix is already encoded server-side. On a 100k-token prefix the time-to-first-token (TTFT) difference between a warm cache and a cold cache is typically several hundred milliseconds, sometimes over a second on the largest prefixes. For a builder where the user is watching HTML stream into an iframe, that gap is felt. The 'Scary' label in the mk0r picker is literally pointing at sub-second first-token latency on Haiku with a warm cache. Switch models, lose the cache, lose half of what made that label scary.

Can I avoid the switching cost by routing some turns through one model and some through another in parallel?

Not really, and not in mk0r as it ships today. The conversation history is one linear context. The agent maintains its plan, its file map, and its in-progress edits in that context. Sending one turn to Haiku and the next to Sonnet does not give you two parallel sessions; it gives you one session that flaps between models, with the cache dying on every flap. The pattern most heavy users settle into is the opposite: pick the model that fits the next 5 to 10 prompts, stay on it, let the cache earn its keep. Switch deliberately, not opportunistically.

Does the cache rebuild make the cheaper model net more expensive than the expensive one?

It can, for a short window. Imagine you have 80k cached tokens on Sonnet and you switch to Haiku for one quick edit. The headline savings on the edit itself is real (Haiku is 3x cheaper input, 3x cheaper output). But the cache rebuild for that one Haiku turn bills 80k tokens at full Haiku input ($0.08) plus the cache-write premium ($0.10) just to seed the new cache. If the edit turn itself is only 2k output tokens (around $0.01 on Haiku), the rebuild dwarfs the work by an order of magnitude. The cheap model is still cheaper at steady state, but the switch eats most of the first-turn savings.

Does the cost survive an OAuth Pro plan? My turns are flat-rate.

The dollars are absorbed by Anthropic, but the rate-limit accounting still ticks. When a Claude.com Pro or Max plan is connected via OAuth, every turn counts against your rolling 7-day weekly cap. A cache miss bills tokens differently than a cache hit against that cap. Switching models mid-session burns through cap faster than running a stable session, even if no per-token line items show up on your invoice. The optimization for the OAuth path is the same as the metered path: pick the model that fits the upcoming chunk of work, let the cache form, stay on it.

Is there a model-switching workflow that pays for itself anyway?

Yes, one. Use Haiku for the first three to five prompts, until you have a working prototype that compiles. Switch ONCE to Sonnet when state has to cross screens. Switch ONCE to Opus for a planning turn, then switch back to Sonnet (not Haiku, you've already paid for the Sonnet cache) for the execution. Three switches in a session, each one earning more value than the rebuild cost. The anti-pattern is the same loop run with five or six switches because the user wanted to A/B the same prompt across models. That run pays the rebuild tax five or six times and gets no compounding value, because each new prefix is dead the moment the next switch happens.

Try the picker yourself

No account, no setup. Pick a label, ship a prompt, watch the iframe render. Flip the label after 5 prompts and watch what happens to the next turn.

Open mk0r
mk0r.AI app builder
© 2026 mk0r. All rights reserved.