AI velocity in legacy codebases: the curve inverts
Everyone selling an AI coding tool charts velocity going up. The one published randomized trial on the question charts it going down. On mature codebases the people on those tools knew best, AI made them slower, not faster, and they did not notice. The good news is the curve is not a mystery once you name the three taxes that bend it, and there is one workflow move that resets it.
Direct answer (verified )
Does AI speed up coding in legacy codebases? Sometimes. Often not. The curve flips negative as a codebase ages, grows, and gets familiar to its maintainers.
The strongest evidence is a July 2025 randomized controlled trial from METR. Sixteen experienced developers worked through 246 tasks in open-source repos with 22k+ stars and over a million lines of code each, on which they had an average of five years of prior contribution history. With AI tools allowed (mostly Cursor Pro on Claude 3.5/3.7 Sonnet, the frontier at the time), they took 19% longer to finish the same work than without.
The same developers predicted a 24% speedup beforehand and reported a 20% speedup afterward. The felt experience and the measured experience pointed in opposite directions. That gap is the actual problem statement of this whole guide.
“When AI is allowed, developers take 19% longer to complete issues, a significant slowdown that goes against developer beliefs and expert forecasts.”
METR, 'Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity', July 10 2025
The three taxes that bend the curve
The METR paper does not assign blame to one mechanism. The breakdown that lines up with what people actually report is three taxes that compound. None of them are dramatic alone. Stacked, they account for the slowdown.
The context tax
Every turn, the model has to re-read existing files, types, call sites, and conventions to figure out where the new code should land. On a 1M-line repo with deep imports, a meaningful fraction of every prompt's token budget goes into orienting the model, not producing output. Cursor and Copilot mitigate this with codebase indexing, but the index is a coarse summary; when the model needs the exact shape of a function five files away, it still has to read it. The developer pays the wall-clock cost of every one of those reads, every turn.
The drift tax
Mature codebases carry conventions that are not written down: the way errors propagate, where state lives, which folder owns which concept, how the team names things, the lint rules that are aspirational vs the ones that actually run in CI. AI tools produce plausible code that drifts from the convention because the convention is not in the training data and is only partially in the immediate context. The developer catches the drift on review and fixes it. Each catch is small. Across 200 edits in a session, the cumulative friction is large. In the METR study, developers accepted fewer than 44% of AI generations, and the rejected ones still cost time to evaluate.
The refactor tax
When the right change touches three or four sites and the model only edits the two it can see, the result is a half-refactor. The code compiles. The tests sometimes pass. The third site breaks two weeks later when something downstream calls it. The human now has to chase the rest, and the chase is the worst kind of work: low-novelty, high-care, hard to delegate back to the same AI that produced the half-change. This is the tax that is least visible at the moment it accrues and the most painful when it lands.
The maximum-velocity point of any project's life
If the three taxes grow with codebase age, the cheapest place to spend a coding turn is in a codebase that has none of them. Zero existing files to re-read. Zero implicit conventions to drift from. Zero foreign call sites to break with a half-refactor. Every token the model produces is new code, not orientation overhead.
In mk0r that point is literal, not metaphorical. The very first thing every session does, before the user has typed a single prompt, is run an init script that ends with one empty commit:
That commit message, 'Initial template', is the starting point of every app a user builds. The author is mk0r agent <agent@mk0r.com>. The tree is a Vite + React + TypeScript + Tailwind v4 project. No remote is configured (the VM's CLAUDE.md spells this out: Shell: Debian with bash, Node 20, npm, git (no remote configured)). There is no "attach a legacy repo" path because the entire product is built on the premise that the velocity you want lives on the other side of this commit, not before it.
The triage that matters
The mistake worth not making is treating "use AI" and "rebuild greenfield" as opposed positions. They are the same position at different points on the curve. The skill is knowing which slice of work is sitting where.
Load-bearing legacy. Core domain logic, anything with a real database schema, anything regulated, anything paying users hit. Build it in place. Use an AI assistant as a reviewer, not an author. Plan for the 19% tax. This is the work where the slowdown is the price of not blowing up production.
Isolated UI surface. A settings panel, a marketing micro-site, an internal tool that wraps an API you already own, a customer-facing demo of a feature your real app does not have yet. Build it greenfield. Ship it as a standalone page or embed the iframe. The legacy taxes you would have paid to make this live in the main app are real money. Spend it elsewhere.
Speculative. Work where you do not know yet whether it deserves to live. Build it greenfield because cheap greenfield prototypes are how you find out whether the idea is good. The cost of porting a working prototype back into the main tree later is almost always less than the cost of the legacy taxes you would have paid to build the same exploration in place, only to throw it away when it turns out the idea was wrong.
The hour you have before the slice becomes legacy
The honest limit on the greenfield reset is that greenfield is only greenfield for a while. The first few hundred lines fit comfortably in the model's working context. The conventions start to crystallize on prompt three or four. By the time you have a working two-screen app, you have built a small codebase, and the same three taxes start accruing inside it. The curve does not stay flat forever. It just resets when you start a new tree.
That is fine. The reset buys you roughly an hour of compounding velocity, which is more than enough to ship a real self-contained slice. Use it to ship the slice, not to attempt the whole product in one go. The discipline is to keep the greenfield slice scoped tight enough that you finish it before the curve bends back.
What this is not
This is not an argument that traditional engineering is dead, or that legacy code should be deleted on sight. Most software that matters in production runs in old code, written by humans who knew exactly what they were doing, accreted layer by layer because each layer solved a real problem. AI tools work fine inside that code; they just do not deliver the speedup chart the demos imply.
The argument is narrower: for the subset of work that does not need to be load-bearing in the legacy tree, the greenfield starting point is dramatically less taxed, and the product category that exposes that starting point as a one-click thing is worth using deliberately. mk0r is one of those tools. Bolt, Lovable, Replit Agent, and Claude Artifacts are others. The right mental model is not "a no-code competitor." It is "a velocity reset button for the work that qualifies."
Want to talk through which slice of your stack qualifies?
Fifteen minutes. Show me what you are building. I will tell you honestly which parts are good candidates for a greenfield reset and which parts should stay where they are.
FAQ
Frequently asked questions
Does AI actually speed up coding in legacy codebases?
Sometimes. Often not. A July 2025 randomized controlled trial from METR put 16 experienced open-source developers on 246 tasks across repos with 22k+ stars and over 1M lines of code they had contributed to for an average of 5 years. With AI tools allowed (mostly Cursor Pro on Claude 3.5/3.7 Sonnet at the frontier of the time), they were 19% slower than without. Developers predicted a 24% speedup beforehand and still reported a 20% speedup afterward, which is the most uncomfortable finding in the paper. AI helps with onboarding, unfamiliar code, and documentation gaps. It taxes you on familiar mature code where you already have the file map cached in your head.
What are the three taxes the model pays in a legacy codebase?
Context tax: every turn, the model spends a large fraction of its token budget re-reading existing files, type definitions, and call sites to figure out where to land its edit. None of those tokens turn into new code. Drift tax: the existing code has implicit conventions (naming, error handling, layering, lint rules, custom hooks) that are not written down. The model proposes plausible code that drifts from the convention, and the human has to spot and correct the drift turn after turn. Refactor tax: when the right change touches three or four sites and the model only sees two, it ships a half-refactor. The human now has to chase the rest. Each tax is small in isolation. Stacked, they invert the curve.
Why did developers in the METR study still think AI made them faster?
Three reasons that compound. One, the moments where AI is dramatically helpful (a fast scaffold, a perfect autocomplete on a tedious function) are vivid and easy to recall. The moments where AI cost time (the 44% of generations that got rejected after a review pass, the silent refactor that needed cleanup) are diffuse and forgotten. Two, AI changes the felt texture of work, less typing, more reviewing, which feels less effortful and gets mistaken for being faster. Three, developers report what they expect to feel as much as what actually happened. The 20% post-task self-report against a measured 19% slowdown is the entire problem statement: we cannot tell when we are losing time to AI just by trusting our own sense of pace.
Is there a class of legacy work where AI velocity does deliver?
Yes, a few. Reading a codebase you do not own and need to navigate quickly: AI's summarization and call-graph tracing genuinely earn their keep. Writing the first draft of tests for code that lacks them: the existing implementation gives the model a clear contract to copy. Translating between languages or frameworks where the semantics map cleanly (Python script to a TypeScript Node port, for example). Generating migration boilerplate where the pattern is repetitive and the model has thousands of public examples to lean on. The shared property: the task is bounded, the existing code constrains the model in a useful way, and the human is unfamiliar enough that even a 60% solution is a step up.
What does mk0r do differently that resets the velocity curve?
Every session starts with a literal git commit whose message is 'Initial template'. The code is at src/core/e2b.ts lines 776 to 787. The agent boots into a fresh /app directory holding a Vite + React + TypeScript + Tailwind CSS v4 project, runs git init, writes a .gitignore, makes one empty commit, and that is the entire prior context. There is no legacy carrier. There is no remote (the VM CLAUDE.md spells this out on line 319: 'git (no remote configured)'). The context tax is roughly zero. The drift tax is zero because there are no conventions to drift from yet. The refactor tax is zero because there is nothing to refactor. The model's tokens go almost entirely into producing new code. This is the maximum-velocity point of any project's life, and mk0r restarts every session there on purpose.
So is the right move always to throw the legacy code away?
No. The right move is to know which slice of work is sitting on the wrong side of the curve. A new screen with no shared state, a settings panel, a marketing micro-site, an internal tool that wraps an existing API, a sandbox to test a UI idea, a customer-facing demo of a feature your real app does not have yet, are all candidates for greenfield rebuild. They are slices that benefit from the maximum-velocity starting point and that do not need to inherit the legacy plumbing. Core domain logic, anything with a real database schema, anything regulated, anything that already has paying users hitting it, stays put and gets the slow careful AI treatment, with the 19% tax factored in. The skill is the triage, not picking a side once and for all.
How long does the velocity advantage last on a greenfield slice?
Roughly as long as the codebase is small enough to fit in the model's working context with room to spare. In practice that is the first several hundred lines, then a clear knee in the curve. After that the same taxes start accruing: the model has to re-read files, the conventions start to crystallize, the refactor cost climbs. The mk0r answer is not to pretend the knee does not exist; it is to keep the slice scoped tight enough that you ship it before the knee. The product gives you about an hour of compounding velocity before the slice itself becomes a small legacy codebase. That hour is more useful than people expect.
What is the practical workflow for a real product team?
Triage incoming work into three buckets. Bucket one, edits to load-bearing legacy: do them in your IDE with an AI assistant, but treat the AI as a reviewer, not an author, and assume a slowdown. Bucket two, isolated UI work that could live on its own: build it on mk0r as a fresh slice, ship the file as a standalone tool or embed the iframe, do not let it touch the legacy tree. Bucket three, exploratory or speculative work where you do not know yet whether it deserves to live: build it greenfield on mk0r, decide after seeing the prototype whether to port it back or kill it. The cost of porting back is usually less than the cost of the legacy taxes you would pay to build it in place.
More on the loop that makes AI velocity actually pay off.
Related guides
Vibe coding model switching cost
Switching models mid-session does not just bill the rate-card delta. It invalidates the prompt cache, so the next turn pays full input rate on every prior cached token.
AI code accuracy vs prototype velocity
Accuracy and velocity trade off in opposite directions. The right model picker labels them honestly instead of pretending one tool is the best at both.
Vibe coding throwaway prototypes
The disposable first draft is not a failure mode. It is the move that lets you learn what the real version should be, before paying the legacy taxes to build it.
Try the reset button yourself
No account, no setup. Describe a slice in one sentence, watch the agent ship it from a clean Vite + React + Tailwind tree. The first commit is already there waiting for you.
Open mk0r
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.