Production reality

Google Cloud Claude demo vs reality: five gates a live product has to ship that a Vertex AI showcase cuts

The Vertex AI announcement for Opus 4.7 says “prototype to production faster.” The reality of running a Claude product on Google Cloud is five concrete gates with named files and line numbers. Each one is the code a live product had to write because the showcase did not need to.

Matthew Diakonov, Written with AI

Published May 21, 20267 min

Direct answer (verified 2026-05-21)

A live Claude product on Google Cloud has to ship five gates that showcase notebooks never need: a TTFT watchdog at 15 seconds, a per-turn ceiling under Cloud Run's documented 3600 second synchronous request maximum, a flushServer() call before every response close (because Cloud Run throttles CPU to near zero the moment the response closes), a six-class SDK error classifier, and a stale-session retry that fires exactly once and only when no text has streamed.

Verified against Cloud Run request timeout docs, the Vertex AI Opus 4.7 announcement (April 15, 2026), and mk0r's open source at src/app/api/chat/route.ts (lines 25, 48, 202, 539 to 551, 621 to 653) and cloudbuild.yaml line 41.

3600s

“Cloud Run's documented synchronous request maximum is 3600 seconds. mk0r's per-turn cap is 800. The 2800 second gap is where the rest of the request lifecycle lives: auth, persistence, telemetry flush, and a margin for the stale-session retry. A demo skips all of that.”

cloudbuild.yaml line 41 and src/app/api/chat/route.ts line 25

The frame: a Vertex AI demo is a steady-state slide, a live Claude product is a request lifecycle

Most of the Google Cloud marketing for Claude lives in three places: the Vertex AI announcement pages (Opus 4.7, Mythos Preview, Opus 4.6), the partner pitch at claude.com's Vertex AI page, and the customer logos. They are all good documents. They are also documents about a steady-state surface: prompt caching, provisioned throughput, multi-region endpoints, dedicated capacity, Security Command Center integration, Model Armor. None of that is wrong. It is just a slice of the surface.

The slice you do not see in those documents is the request lifecycle. What does the handler do at second 14 when no token has arrived. What does the handler do at second 800 when the turn is too long. What does the handler do at the moment the response closes and Cloud Run takes the CPU away. What does the handler do when the SDK returns the seventh error class you did not plan for.

The five gates below are mk0r's answers to those questions. All five live in less than 200 lines of code in one file: src/app/api/chat/route.ts.

Gate one. The 15 second time-to-first-token watchdog.

When the user submits a prompt, the most common live failure is not a wrong answer. It is no answer at all. The ACP subprocess accepts the request, then never emits a notification. From the browser the stream is open but empty. The tab is staring at a blinking cursor.

At src/app/api/chat/route.ts line 539 a setTimeout fires 15 seconds after the request begins. If the loop has not seen any notification by then, the code evicts the session, fires the chat_ttft_timeout PostHog event, and sends the user the literal string “Agent did not respond within 15s. Please retry.” That string does not appear in any Vertex AI showcase, because no showcase ever waits 15 seconds for a token that never comes. It exists because in production sometimes the upstream is queued or the local subprocess is wedged, and refusing to time out is worse than admitting it.

What the watchdog actually does

🌐

Browser

POST /api/chat

⚙️

Cloud Run

Edge accepts, streams

⚙️

ACP subprocess

Forwards to Claude

❌

15s watchdog

No notification? evict

✅

Browser

"please retry" rendered

Gate two. The 800 second per-turn ceiling, inside Cloud Run's 3600 second sync max.

Cloud Run documents the maximum synchronous request timeout at 3600 seconds (60 minutes). The deploy step in cloudbuild.yaml passes --timeout=3600 at line 41 explicitly. That is the ceiling. The per-turn cap lives below it, at maxDuration = 800 on src/app/api/chat/route.ts line 25. 800 seconds is 13 minutes 20 seconds.

Two reasons for the gap between 800 and 3600. The first is that the rest of the request lifecycle (auth checks, Firestore writes, PostHog flush, retry margin) needs headroom under the Cloud Run ceiling. The second is product judgment: anything a user will sit through synchronously caps out around 13 minutes. Longer than that and the work belongs on a scheduler, not on a chat handler. mk0r's sandbox has one at/opt/scheduler-mcp.js for that case.

A Vertex AI showcase that shows “an autonomous agent runs for an hour” is either on a longer-lived transport (no browser tab waiting), batched, or cut. Production on Cloud Run cannot have it both ways: synchronous chat tops out at 60 minutes, period.

Gate three. flushServer() before every response close.

This one is the most Cloud Run-specific gate of the five. The comment at src/app/api/chat/route.ts line 202 says it directly: “Cloud Run throttles CPU to ~0 after the response closes, so posthog-node's queued HTTP POST never fires unless we wait for the network call here.”

Translation: on a normal server, libraries like posthog-node queue events and flush them in the background. Background CPU exists. On Cloud Run with default CPU allocation, the moment your response closes, CPU drops to almost nothing for that instance. The flush task that posthog-node scheduled never actually runs. Your telemetry silently disappears.

The fix is one line: await flushServer(); before every return Response.json(...) or every controller.close(). It blocks the response by a few hundred milliseconds in the worst case. The alternative is gates one and four going dark in production because they fire telemetry events that never land.

You will not see this in any Vertex AI showcase notebook because a notebook is not a request handler. The kernel keeps running. CPU is whatever you provisioned. This gotcha is exclusively a Cloud Run problem, and it bites every team that migrates a working handler from a long-running server to serverless.

Gate four. A six-class SDK error classifier.

At src/app/api/chat/route.ts line 48 the type union is six names long:

type ErrorKind =
  | "credit_exhausted"
  | "auth_required"
  | "invalid_request"
  | "image_error"
  | "stale_session"
  | "generic";

The classifier at line 50 maps from the SDK's structured api_retry info (HTTP status, errorType) plus a regex fallback onto one of the six. Each class has a different next action:

credit_exhausted: top up. The SDK's reset-time message is surfaced verbatim so the user knows when service resumes.
auth_required: re-authenticate. Not retry the same prompt.
image_error: resize the image. Not refresh the page.
invalid_request: edit the prompt. The original error message is shown.
stale_session: the only kind the code automatically retries, and only before any text has streamed.
generic: the fallback for anything the classifier did not match.

A page that collapses every failure into one banner pushes every user down the wrong recovery path. The cost of being honest about which kind of failure happened is six branches in the error handler and six rendered states in the UI. No demo pays this cost because no demo ever fails.

Gate five. Stale-session retry exactly once, only before text streams.

Internal errors from the ACP subprocess usually mean the SDK lost track of the session, often after a long idle. The code at src/app/api/chat/route.ts lines 621 to 653 handles this: restart the ACP subprocess, reload the session state, re-prompt once. The retry only fires if textDeltaCount === 0, meaning the user has not seen any output yet.

Once text is flowing, retry corrupts the turn (the user sees half an answer, then a different half an answer). So the retry is a narrow guardrail, not a panacea. The Vertex AI marketing line about “global endpoint with dynamic routing for maximum availability” is true at the transport layer. It does not absolve a Claude-backed product of having to ship its own retry policy at the application layer, because session state lives in the SDK, not in the global endpoint.

All five gates, on a single timeline

Here is the same five gates ordered by where they fire in a single request. The order matters because gate three (the flushServer call) has to wrap every termination path, not just the happy one.

Gate order inside one chat request

Gate one: TTFT watchdog at 15 seconds

If the agent does not emit a single notification within 15 seconds, the session is evicted, the user sees "Agent did not respond within 15s. Please retry." and the chat_ttft_timeout telemetry event fires. Located at src/app/api/chat/route.ts line 539. A Vertex AI showcase notebook is allowed to wait. A browser tab is not.

Gate two: 800 second per-turn ceiling, inside Cloud Run's 3600 second sync max

maxDuration = 800 at src/app/api/chat/route.ts line 25 caps any single chat turn. The choice is anchored to Cloud Run's documented 3600 second synchronous request maximum (cloudbuild.yaml passes --timeout=3600 at line 41). Long agent runs belong on a scheduler, not a chat handler.

Gate three: flushServer() before every close

Cloud Run throttles CPU to near zero the moment the response closes. posthog-node's queued POST never fires unless the handler waits for the network call before returning. The comment at src/app/api/chat/route.ts line 202 spells it out. Without this call, every gate that depends on telemetry (TTFT, credit, retry, classify) goes dark in production.

Gate four: six-class SDK error classifier

ErrorKind = "credit_exhausted" | "auth_required" | "invalid_request" | "image_error" | "stale_session" | "generic" at src/app/api/chat/route.ts line 48. Each class has a separate UI path because the next action is different in each case. A generic "something went wrong" forces every failure mode through the wrong recovery.

Gate five: stale-session retry exactly once, only if no text yet

Internal errors from the ACP subprocess mean the SDK lost track of the session. The code at route.ts lines 621 to 653 restarts ACP and re-prompts once, but only if no text has streamed. Once text is flowing, retry would corrupt the user's turn. The retry is a guardrail, not a panacea.

The Vertex AI pitch, line by line, against what a live product still has to do

The Opus 4.7 announcement page lists six infrastructure benefits. Each one is true, and each one stops short of the gates above. The honest read:

Low latency and high throughput. True at the steady state. Still needs a TTFT watchdog at the application layer for the cases where the steady state is not yet reached.

Provisioned Throughput for reserved capacity. The honest indie alternative is PayGo + an in-app timeout. The showcase does not say that out loud.

Global endpoint with dynamic routing. Solves transport availability. Does not solve SDK-level session staleness. Still needs gate five.

U.S. and EU multi-region endpoints. Useful for compliance. Orthogonal to the per-turn ceiling and the CPU-throttle gotcha.

Model Armor for runtime threats. Security at the model layer. Does not change the six error classes the SDK can still throw.

Security Command Center integration. Observability for the platform. Does not deliver the application-layer telemetry that the flushServer call in gate three exists to protect.

Every bullet is real. Every bullet is also above the layer where the five gates live. A live Claude product on Google Cloud is the sum of both: the platform features Google ships, and the application code you have to ship on top.

The reality checklist for a Claude product on Google Cloud

A TTFT timeout in your handler, not in a load balancer rule
A per-turn ceiling below Cloud Run's 3600s synchronous maximum
A flushServer or equivalent before every response close
Distinct UI paths for credit, auth, image, invalid prompt, stale session, and unknown
A stale-session retry that only fires before any text has streamed
A scheduler for any work that legitimately runs longer than the per-turn cap
An anonymous-rate limit, because anonymous compute is shared compute
A user-visible message when the session is paused or wedged, not a silent timeout

Frequently asked questions

Does mk0r call Claude through Vertex AI?

No, and that is the honest part of this comparison. mk0r calls the Anthropic API directly and runs on Google Cloud Run as the hosting layer. So what you read below is not a Vertex AI review. It is what happens when a Claude-backed product lives on Google Cloud, with all of Cloud Run's request lifecycle constraints stacked on top of the Anthropic SDK's failure modes. The argument generalizes to Vertex AI Claude too, because the Cloud Run gates are independent of which Anthropic-on-GCP path you pick.

Why is the per-turn cap set to exactly 800 seconds?

Two reasons. First, Cloud Run's documented synchronous request maximum is 3600 seconds (60 minutes). The cloudbuild.yaml deploy step at line 41 passes --timeout=3600 explicitly. The 800 second turn cap leaves headroom under that ceiling for the rest of the request lifecycle (auth, persistence, telemetry flush). Second, 13 minutes 20 seconds is the longest single turn a real user will sit through without thinking the tab is broken. Anything longer should be on a scheduler, not a chat handler.

What does CPU throttling after the response close actually break?

Anything fire-and-forget. PostHog's Node SDK queues events and flushes them on its own schedule. On a server with normal background CPU, the flush eventually runs. On Cloud Run, CPU drops to near zero the moment the response closes, so the queued POST never goes out. There is a comment at src/app/api/chat/route.ts line 202 that documents this: "Cloud Run throttles CPU to ~0 after the response closes, so posthog-node's queued HTTP POST never fires unless we wait for the network call here." The fix is a call to flushServer() before the response returns. Vertex AI showcases never mention this because they are not running a request handler, they are running a notebook.

Why classify six different error kinds? Couldn't the page show one?

Because the next action is different for each. credit_exhausted means top up, and the SDK's message contains the reset time. auth_required means re-login, not retry. image_error means resize, not refresh. invalid_request means edit the prompt. stale_session is the only kind where the code automatically retries, and only when no text has streamed yet. generic is the fallback. The ErrorKind union at src/app/api/chat/route.ts line 48 spells out all six. A page that collapses to a generic banner has not paid the production cost yet.

Where does Vertex AI's Provisioned Throughput pitch land for a small product?

Out of reach for indie scale, and that is fine. Provisioned Throughput is Google's pitch for eliminating the PayGo cold start and the saturated-capacity queue. For an early product, the honest path is PayGo and an in-app TTFT watchdog so the user is not staring at a blank cursor when the upstream is queuing. That is what the 15 second timeout at route.ts line 539 is for. It is the indie equivalent of provisioned capacity: you cannot prevent the wait, but you can refuse to hide it.

Is the 800 second turn limit a Claude limit or a Cloud Run limit?

Cloud Run. Claude itself does not enforce a turn-length cap at this granularity. The 800 is a function of two things: the 3600 second synchronous request maximum on Cloud Run (documented), and the product judgment that a turn beyond 13 minutes 20 seconds should not be on a synchronous chat handler at all. Move it to a scheduler. mk0r has one mounted inside the sandbox at /opt/scheduler-mcp.js for exactly this case.

Can a Vertex AI demo accidentally lie about latency?

Not lie, but compress. The Vertex AI announcement says "low latency, high throughput, optimized infrastructure." That sentence is true for the steady-state warm case. It is silent on the cold start of a PayGo region or what happens when a global endpoint reroutes you mid-stream. A live product cannot be silent on those, because the user notices both. The demo author is allowed to wait for the warm path. The live product has to render the cold one.

Want to see all five gates in the actual source?

A 20 minute screenshare of mk0r running on Cloud Run with Claude, with every gate above firing live. Worth it if your team is about to migrate a Claude product to Vertex AI or to Cloud Run.

Same source files, different lens

Keep reading

Tradeoffs

Limits of Agent Demos, written in the source of a live agent product

The companion piece. Five gates that show up in production code that demo clips never need. Same anchor source files, broader lens than just Google Cloud.

Read

Comparison

Google AI Studio Vibe Coding vs mk0r

A different Google AI surface, same demo-vs-reality lens. What Google AI Studio's Build Apps gives you on turn one vs what a backed-by-real-services session looks like.

Read

Plumbing

Claude Code persistent sessions and forking

How Claude session state actually survives Cloud Run restarts, sandbox pauses, and user idle time. The plumbing under the demo.

Read

The frame: a Vertex AI demo is a steady-state slide, a live Claude product is a request lifecycle

Gate one. The 15 second time-to-first-token watchdog.

Gate two. The 800 second per-turn ceiling, inside Cloud Run's 3600 second sync max.

Gate three. flushServer() before every response close.

Gate four. A six-class SDK error classifier.

Gate five. Stale-session retry exactly once, only before text streams.

All five gates, on a single timeline

Gate order inside one chat request

Gate one: TTFT watchdog at 15 seconds

Gate two: 800 second per-turn ceiling, inside Cloud Run's 3600 second sync max

Gate three: flushServer() before every close

Gate four: six-class SDK error classifier

Gate five: stale-session retry exactly once, only if no text yet

The Vertex AI pitch, line by line, against what a live product still has to do

Frequently asked questions

Want to see all five gates in the actual source?

Keep reading

Limits of Agent Demos, written in the source of a live agent product

Google AI Studio Vibe Coding vs mk0r

Claude Code persistent sessions and forking

Comments (••)

Comments ()