Format proposal

A vibe coding productivity tournament should score iteration density, not first drafts

Existing tournaments hand a panel an eighteen hour build and rate it on innovation, functionality, and presentation. That measures the wrong unit. The thing vibe coding is actually good at is iteration, and you can score it cleanly if the tool keeps a real commit per prompt.

M
Matthew Diakonov
8 min
Direct answer (verified 2026-05-04)

How to run a vibe coding productivity tournament: pick one brief, start every contestant on the same default template, run for one hour, and score on iteration density (kept commits per minute on the active timeline at the buzzer). Forbid imported boilerplate, do not cap prompt count, exclude no-op turns automatically, and break ties by prompt yield (kept commits divided by prompts attempted). Audit the result by pulling each contestant’s git repo at the buzzer and verifying the commit graph.

The unit is grounded in the per-turn commit graph at src/core/e2b.ts lines 1635-1755. Tournament context cross-checked against the November 2025 LLM-vs-human paper arXiv:2511.20613 and the Rally vibe coding competition format.

The unit every existing tournament misses

The most-cited vibe coding tournaments in 2025 all measured the same thing in slightly different wrappers. Rally Innovation’s competition gave teams 18 hours, allowed AI assistants like ChatGPT, Cursor, and Copilot, and judged on innovation, functionality, code quality, execution, and presentation. Vibe Awards ran a panel arena. The Vibe Code Contest livestream gave contestants one hour to build a web app in a vibe coding tool of their choice and judged the result.

Every one of those formats collapses into a panel rating the final artifact. Two builders can hand in the same first draft, and a panel will not see that one of them got there in three turns and the other in twelve. The skill that vibe coding is specifically good at, the rapid back-and-forth of prompt, look, keep, undo, prompt again, is invisible to a final-artifact judging panel.

A productivity tournament should put that loop on the scoreboard. The unit is iteration density: kept commits per minute on the active timeline at the buzzer. Everything else is a derivative metric or a tiebreaker.

Why “kept” needs a precise definition

The word doing the work in “kept commits per minute” is “kept”. Without a definition, a contestant could spam thirty turns, undo most of them, and still claim a high count. The definition has to live in the tool, not the rulebook.

In mk0r the kept set is a sliced array on the session, computed every time a turn commits. The slice is one line of code and it is the rule the entire scoreboard rests on:

The slice-on-fork rule
// src/core/e2b.ts, inside commitTurn
if (
  typeof session.activeIndex === "number" &&
  session.activeIndex >= 0 &&
  session.activeIndex < session.historyStack.length - 1
) {
  session.historyStack =
    session.historyStack.slice(0, session.activeIndex + 1);
}
session.historyStack.push(sha);
session.activeIndex = session.historyStack.length - 1;

If the contestant has rolled back two turns and then issued a new prompt, the two abandoned commits drop off historyStack the moment the new commit lands. They are still in git’s reflog, but they have left the surface the scoreboard reads from. At the buzzer, historyStack.length minus one (for the initial template commit) is the kept count. No bookkeeping, no judgment call.

Source: src/core/e2b.ts lines 1670-1679. The companion check at line 1648, git diff --cached --quiet, stops empty turns from inflating the count.

The scoreboard you can build from this

Once kept_turns is well-defined, the rest of the scoreboard falls out of three numbers the tool already tracks. Treat them as the minimum a productivity tournament needs to publish next to a contestant’s name.

FieldWhere it comes fromWhat it tells you
kept_turnshistoryStack.length at buzzer, minus 1How many iterations survived
minutesWall clock from first prompt to buzzerDenominator for density
iteration_densitykept_turns / minutesPrimary score
prompt_countPostHog turn_started eventsTotal attempts including cancelled
prompt_yieldkept_turns / prompt_countTiebreaker

Iteration density is the thing on the leaderboard. Prompt yield is what you read after the round to understand the style: high yield means the contestant mostly kept what they asked for, low yield means they explored a lot. Neither is automatically better, but both are visible.

Cancel-mid-stream is a play, not a foul

Most productivity formats penalize give-ups. This one should not. Cancelling a turn early when you can already see the agent going wrong is the cheapest correction in vibe coding. It costs nothing and it shows real taste.

In mk0r, /api/chat/cancel posts to the in-VM ACP cancel endpoint. The agent stops streaming. When commitTurn fires at the end of the turn, the git diff --cached --quiet check at line 1648 passes (the agent never got far enough to change a file), the function prints NOCHANGE and exits without creating a commit. The cancelled prompt is tracked in PostHog as an attempt, but it never enters historyStack.

In the scoreboard above, that means cancelled turns lower prompt_yield and leave kept_turns untouched. A contestant who cancels three bad turns, then lands four good ones in eight prompts will read as 4 kept / 8 attempts (yield 0.5) and 4 kept / 60 minutes (density 0.067). A contestant who let all three bad turns commit, then undid them, then landed four good turns reads the same way at the buzzer, but their commit-time-stamps will tell you they wasted minutes the first contestant did not. Both moves work. The sliced timeline keeps the unit honest either way.

What changes when you swap the unit

Eighteen hours, free choice of any AI tool, judged on innovation, functionality, code quality, execution, and presentation. The judge sees the final artifact only. Two contestants who shipped the same UI score the same, regardless of how many prompts each one burned to get there. Iteration is invisible.

  • Score is opinion, not measurement
  • First-draft luck dominates
  • Cancel and undo are penalized as wasted time
  • Cannot tell a 3-turn build from a 12-turn build

A sample format you can run this weekend

Below is the smallest version of the format that produces a clean number. Two to ten contestants is the sweet spot. Keep the brief tight: a UI with one feature and one stretch goal works.

  1. Set the brief. One paragraph, posted at the start tick. Example: “A mobile-first habit tracker that asks one question per day, shows a streak count, and lets the user reset the streak with a long press.”
  2. Lock the template. Every contestant starts from whatever the tool ships by default. No imported boilerplate, no warm-started agent contexts, no copy-paste from another repo. The initial commit is the floor.
  3. Run sixty minutes. First prompt at start tick, hands off at buzzer. The session stays open after the buzzer for audit, but no further commits count.
  4. Smoke test. Each contestant’s app at the active SHA must boot and load the home screen. Anything that does not is disqualified from the density round (still eligible for a separate showcase round if you are running both).
  5. Pull the graph. An organizer pulls the in-VM /app directory and runs git log --oneline | wc -l. Subtract one for the initial template commit. That is kept_turns. Divide by 60 for iteration_density.
  6. Publish all four numbers next to each name: kept_turns, prompt_count, density, yield. The leaderboard sorts by density. Ties broken by yield.
33/40

The top 5 spots are consistently won by human-coded agents, and the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines.

Can Vibe Coding Beat Graduate CS Students?, arXiv:2511.20613, November 2025

The result you should expect

The arXiv tournament above ran 12 double round-robin tournaments and almost 40,000 matches between 40 LLM-coded agents and 17 human-coded agents on a market-driven strategic-planning task. The top five finishers were all human-coded, and 33 of 40 LLM agents lost to simple baselines. That is a real result, and it tells you something specific: vibe coding is currently weak at long-horizon strategy.

A productivity tournament is the opposite environment. The brief is bounded, the artifact is small, the loop is fast. That is the regime where iteration density rewards what vibe coding is genuinely good at: getting from a sentence to a working UI in fewer turns than typing it would take.

A useful tournament does not pretend the format generalizes. It picks one regime and measures it cleanly. Density is clean because the substrate already produces it (one prompt, one commit, one number) and any third party can audit it from the commit graph alone.

Want to run this format on mk0r?

A short call to walk through the scoring, the audit step, and how to read each contestant's commit graph at the buzzer.

Frequently asked questions

What is a vibe coding productivity tournament, in one sentence?

A timed competition where two or more builders work the same brief in their own tool, and the winner is the one who finishes the most kept iterations per minute. 'Kept' means the commit is still on the active timeline at the buzzer (not undone, not abandoned in a fork). The point is to measure the act of vibe coding, which is iteration, not the first draft a prompt happens to spit out.

Why iteration density and not 'best app at the end'?

Best app at the end is what every existing tournament rates, and it collapses into vibes the moment a judging panel weighs 'innovation' against 'functionality'. Two builders can hand in the same first draft, and the one who got there in three turns has shown more vibe coding skill than the one who got there in twelve. Iteration density makes the difference visible. It rewards taste (knowing when to stop), prompt phrasing (getting it in fewer shots), and rollback discipline (cutting losses fast).

What does 'kept commit' mean exactly, and how do you measure it?

A kept commit is one that is still reachable from the active SHA at the buzzer. If you undo back two steps and prompt off in a new direction, the two abandoned commits stop being kept. In mk0r the active timeline is a sliced array on the session, not the full git reflog: src/core/e2b.ts line 1676 does historyStack = historyStack.slice(0, activeIndex + 1) right before pushing the new SHA, so the abandoned forks fall off the surface the moment you commit to a different path. The kept count is just historyStack.length at the buzzer, minus the initial template commit.

Doesn't this incentivize spam? Just commit garbage as fast as possible?

Two safeguards. First, commitTurn (src/core/e2b.ts line 1635) short-circuits with NOCHANGE when git diff --cached --quiet passes, so a turn that produced no real change leaves no trace. You cannot inflate the count by spamming empty prompts. Second, every brief should require a working endpoint or a passing in-VM smoke test at the buzzer; a stack of 30 broken commits scores zero. The unit rewards iterations that survive both 'this is a real diff' and 'this still works'.

How long should a round be?

One hour is the right unit for a single brief, by analogy with Vibe Code Contest's 2025 livestream format. Long enough that one strong prompt and three rollbacks can fit. Short enough that the rate of kept turns per minute reads as a real signal and not a rounding error. Multi-hour Rally style formats (the 2025 Indianapolis competition gave teams 18 hours) are testing planning, sleep tolerance, and team dynamics, which are different sports.

What about pre-built starter templates? Should they be allowed?

Allow only the template the tool itself ships with by default, identical for every contestant. mk0r initializes one Vite plus React template at /app on session boot via the script in src/core/e2b.ts lines 614-626, captured as the initial commit. That commit is the floor of the timeline and is excluded from the kept count. Imported boilerplates, copy-pasted clones from another repo, and warm-started agent contexts all break the unit because they smuggle work into the count that no prompt produced.

Should you cap the number of prompts?

No. Capping prompts measures the wrong thing again. A builder who burns six prompts to land one good turn has worse density than one who lands three turns in three prompts, and the score should already reflect that. The natural metric is kept_turns / prompt_count (call it 'prompt yield') as a tiebreaker between contestants with similar density. A high yield means you are mostly keeping what you ask for. A low yield means you are exploring a lot, which is sometimes the right move and sometimes a sign of unclear intent.

Where does cancel-mid-stream fit in?

Cancel is the highest-leverage move and a tournament should reward it, not penalize it. /api/chat/cancel posts to the in-VM ACP cancel endpoint. The agent stops, commitTurn short-circuits because git diff --cached is empty, no commit lands, and the active SHA stays where it was. Cancel costs zero kept turns and zero pollution. A scoreboard that lists prompts attempted but only counts kept commits naturally rewards the builder who recognized a bad turn early.

How does this differ from Rally's vibe coding competition or vibeawards.org?

Rally Innovation's 2025 vibe coding competition gave teams 18 hours and judged on innovation, functionality, code quality, execution, and presentation. Vibe Awards runs a similar panel-rated arena. Both are valid event formats, but the unit they score is 'how good is the final artifact', which collapses into the panel's taste. A productivity tournament scores the rate at which a builder produces kept work. You can run both at the same event: panel rating for the showcase round, density rating for a separate speedrun round.

What about the academic finding that LLM agents lose to humans in tournaments?

The November 2025 paper 'Can Vibe Coding Beat Graduate CS Students?' (arXiv:2511.20613) ran 12 double round-robin tournaments and roughly 40,000 matches between 40 LLM-coded agents and 17 human-coded agents on a strategic-planning auction-and-delivery problem. The top five spots were all human-coded, and 33 of 40 LLM agents lost to simple baselines. That paper measures match wins, not iteration density. The two findings are compatible: vibe coding is currently weak at long-horizon strategy yet very strong at fast iteration on bounded UI work. A productivity tournament is the right format to surface the second skill.

Could this work as a self-paced solo benchmark instead of a head to head?

Yes, and that is the easier place to start. Pick three briefs of comparable scope. Time yourself. Track kept_turns, prompt_count, and the timestamps of each commit. Plot kept turns on the y-axis and minutes on the x-axis. Your slope is your iteration density on this brief, in this tool, on this day. Run the same three briefs in a different tool a week later and compare the slopes. That is the cheapest version of the format and the one that actually produces useful data about your own workflow.

What does mk0r record that lets a tournament organizer audit results?

Three things. The historyStack of SHAs is persisted to Firestore on every commit (src/core/e2b.ts line 1681 calls persistSession). PostHog captures version_undo_clicked, version_redo_clicked, and version_revert_clicked events with active and target SHAs, so the rollback pattern is reconstructable from the event stream. And the in-VM /app directory is a real git repo, so an organizer can pull it after the buzzer and verify the commit graph matches the claimed score.

mk0r.AI app builder
© 2026 mk0r. All rights reserved.