Guide

Is Claude AI down, the per-request view a status page cannot give you

Almost every page that answers this question shows a screenshot of status.claude.com, a DownDetector graph, or a live news blog counting affected users. All useful. None of them tell you what Anthropic's API is actually returning while it is degraded, what the SDK does about it without asking, or how a tool that runs on top of Claude makes that visible inside the chat instead of leaving you staring at a hung spinner.

Matthew Diakonov, Written with AI

Published April 29, 20269 min read

Two questions that look like one

"Is Claude down" is a global question. Either the API is up for everyone, degraded for some, or down for everyone, and the answer lives at status.claude.com. That page lists incidents per surface (Claude.ai, the public API, Claude Code) and per model family (Opus 4.7, Sonnet 4.6, Haiku 4.5). It is the right place to look first and the only place where Anthropic itself confirms or denies an outage.

The question a running tool has to answer is different. It is my single request, right now, this attempt: did it land, did it fail, is the SDK already retrying it on my behalf, and how many seconds until I find out which way it goes. A status page cannot answer that. By the time an incident shows up there, your request has already been retried, succeeded, or terminally failed. The local question runs on a different clock.

Most of this page is about that local clock, because almost no other guide on this topic covers it.

The nine error types Anthropic actually returns

The Anthropic TypeScript SDK ships a strict union for everything its server can hand back. You can read it directly at node_modules/@anthropic-ai/sdk/resources/shared.d.mts line 19. There are nine values, and during a real outage you only see two of them most of the time.

errorType	HTTP	What it actually means	Where you see it in the UI
invalid_request_error	400	Malformed input. Tokens too long, image bytes corrupted, schema mismatch.	Generic error pill, retry will not help, the request itself is wrong.
authentication_error	401	API key missing or invalid. Cookie expired on the OAuth flow.	auth_required overlay. User has to re-link the account.
permission_error	403	Key valid but the org or model is not enabled for it.	auth_required overlay. Distinct from billing.
not_found_error	404	Model name typo or beta endpoint misspelled.	Generic error. Implementation bug, not an outage.
rate_limit_error	429	Per-minute or per-day quota hit. resetsAt is set.	credit_exhausted overlay. SDK does not retry on this.
timeout_error	408 / 504	The model started, then a gateway gave up.	api_retry overlay. SDK retries this. Often resolves on attempt 2.
overloaded_error	529	Anthropic infrastructure saturated. Not your account, the whole pool.	api_retry overlay. The most common code during a real incident.
api_error	500 / 502 / 503	Generic upstream error. Could be model, could be gateway.	api_retry overlay. SDK retries with backoff.
billing_error	402	Account balance below the minimum. Paid tier exhausted.	credit_exhausted overlay. User has to top up.

The two you see during an actual incident are overloaded_error (HTTP 529) and api_error (5xx). Everything else either says you, the caller, did something wrong (400, 401, 403, 404), or that your account is at a billing wall (402, 429). Those last five do not move with an Anthropic outage at all. Your tool seeing them more often during a status page red bar is coincidence; they were always going to fire on those specific requests.

line 19

@anthropic-ai/sdk/resources/shared.d.mts

What the SDK does without asking you

On a 529 or a 5xx the SDK does not throw immediately. It re-issues the same request a small number of times with exponential backoff, honoring the Retry-After header when the server provides one. Each retry emits a system event the agent host can subscribe to, with five fields the host can show the user.

The five fields are pinned in src/lib/chat-events.ts lines 85 to 91: httpStatus, errorType, attempt, maxRetries, and retryDelayMs. Those are the bytes that walk from Anthropic, through the SDK, through the agent process, into the chat overlay you actually look at while waiting.

Where those bytes actually travel

The path is short and worth seeing in one frame. The hub is the patched ACP wrapper inside the in-VM agent process; without it, two of the four sources go silent.

api_retry forwarding path

The five fields cross four process boundaries: from Anthropic over HTTPS into the SDK; from the SDK into the agent runtime as a system event; from the runtime into the bridge as an ACP sessionUpdate; from the bridge into the browser as an NDJSON line on the live response stream. By the time they land, they are still the same five integers and one short string that left Anthropic.

One real outage, message by message

The shape below is what a single user prompt looks like during a mid-grade Anthropic incident: two consecutive 529s, the SDK's own backoff, then a clean stream on attempt three. The Bridge row is the line of code that surfaces each step to the user.

One prompt during a 529 wave

The user's screen during this is not blank. The amber retry overlay (defined in src/app/(landing)/page.tsx around line 511) reads: Retrying request, attempt 2/3 on top, then overloaded_error (HTTP 529) - next in 8s below. When attempt 3 succeeds it disappears and text_delta chunks start flowing. The whole episode reads like a polite pause, not a crash.

Why the stock SDK silence is the worst outage UX

Without a host that listens to the SDK's system events, a 529 wave looks like nothing. The SDK swallows the error, sleeps, tries again, and eventually either succeeds (great, no one needed to know) or gives up (now you have a generic timeout message after half a minute of silence). The user's mental model in that window is wrong: they assume the model is thinking, when it is actually waiting.

The patched ACP entry at src/core/vm-scripts.ts (lines 18 onward, with the api_retry forwarder at lines 112 to 137) wraps the agent loop's system iterator and re-emits five event types the published claude-agent-acp drops on the floor: api_retry, rate_limit_event, compact_boundary, tool_progress, and task_notification. Each becomes a sessionUpdate the bridge can route. The reason this matters during a Claude incident: the difference between "model is thinking for thirty seconds" and attempt 2/3 - overloaded_error - next in 8s is the difference between users hammering the retry button (making things worse) and users waiting one beat (which usually works).

The patch is small. The user-experience delta during a real outage is large.

The 15 second watchdog that catches the SDK's blind spot

One more layer, because the SDK is not infallible. There is a class of failure where the agent process never receives the first chunk: the connection establishes, then the underlying socket stalls, then nothing. The SDK does not always notice in time. The host has to.

src/app/api/chat/route.ts line 421 sets ttftMs = 15_000. If the agent emits zero notifications in that window, the bridge evicts the session, sends an error event with the message Agent did not respond within 15s. Please retry. and captures a chat_ttft_timeout event for telemetry. The next request boots a fresh VM. This is the only path that produces a fast failure during a serious Claude incident, because the SDK's own retry budget plus backoff can chew up half a minute on its own.

Numbers worth remembering

Error types

Distinct values in Anthropic's SDK ErrorType union. Two show up during outages, the other seven are caller mistakes.

Watchdog

Time-to-first-token timeout. After this, the bridge evicts the session and lets the next prompt start fresh.

Status code

The HTTP code Anthropic returns for overloaded_error. The one you actually see when status.claude.com goes red.

What this means if you are about to ship something on Claude

The thing that breaks user trust during a Claude incident is not the incident. Outages happen on every API; users understand that. What breaks trust is silence. A spinner that has been turning for twenty seconds with no explanation feels like a bug in your product, even when the actual fault is upstream.

Three things to put in your own app, if you are not using mk0r and are wiring Claude in directly:

Subscribe to the SDK's retry events. The Anthropic SDK exposes them; do not throw the information away. Show attempt N of M and the next delay.
Classify errorType, do not regex the message.The structured field is stable. The message string is not. mk0r's classifyPromptError function shows the shape: status plus type, fall back to regex only when the SDK omits both.
Set a TTFT watchdog. Pick a number you can defend (mk0r picked 15 seconds because that is roughly twice the model's normal first-token latency under load). When it fires, fail loudly and quickly so the user can decide what to do next.

None of this stops a real outage. It changes how the outage feels to the user from this app is broken to this is a Claude wave, the app is handling it.

Where to actually look during an incident

For the global question, three sources, in order:

status.claude.com for official confirmation, per-surface and per-model. This is the record of truth.
downdetector.comand IsDown for user-report aggregations. Useful for spotting an incident before it's posted on the status page; noisier and less authoritative.
Hacker News and the Anthropic Discord for the developer-side explanation, often before the postmortem.

For the local question (whether your specific request is stuck), the answer should be inside your own app. If it is not, that is something to fix before the next incident, not during.

Want a teardown of how your app handles a Claude outage?

Fifteen minutes. Bring the live URL and we will walk through the retry path together, end to end.

Frequently asked questions

Where do I check whether Claude is actually down right now?

status.claude.com is the authoritative source. It lists incidents per surface (Claude.ai, the API, Claude Code, individual model families like Opus 4.7, Sonnet 4.6, Haiku 4.5) and timestamps each incident's start, mitigation, and resolution. DownDetector and IsDown aggregate user reports, which surface trouble faster but are noisier; they pick up local connectivity issues alongside real outages. For a builder, the status page is the right starting point and the wrong stopping point: it tells you whether the API is degraded, not whether your specific request just got rate-limited.

What does 'overloaded_error' mean and why do I see it during outages?

overloaded_error is one of nine error types Anthropic's SDK exposes (defined at @anthropic-ai/sdk/resources/shared.d.mts line 19). It maps to HTTP 529 and means the model server has more concurrent demand than it can serve right now. During the April 28 2026 outage, status.claude.com listed 'elevated errors' across Sonnet 4.5, Haiku 4.5, and the api.anthropic.com surface; underneath that wording, what callers were seeing was a mix of 529 overloaded_error and 5xx api_error responses with the SDK retrying on its own. The wait between attempts comes from the Retry-After header when present, otherwise from exponential backoff inside the SDK.

What is the difference between 429 rate_limit_error and 529 overloaded_error?

429 rate_limit_error means your account or organization hit a per-minute or per-day quota; the resetsAt timestamp tells you when capacity returns. 529 overloaded_error means the shared infrastructure is saturated, which has nothing to do with your quota. mk0r distinguishes them in src/app/api/chat/route.ts at the classifyPromptError function (lines 43 to 82): 402 and 429 collapse into 'credit_exhausted' for the UI, 401 and 403 collapse into 'auth_required', everything else falls through to a generic retry overlay. They look similar in a status page graph and they need different responses from the user.

Does mk0r have a fallback when Claude is down?

There is no fallback to a different model provider. mk0r's system prompt, the patched ACP wrapper, and the in-VM Claude Code subprocess are all tuned for Claude specifically; switching the LLM mid-session would not give you a working app, it would give you a different app. What mk0r does instead is surface every retry attempt by name, so you know whether the problem is transient (an attempt 2/3 with a five-second backoff, almost always recovers) or terminal (auth_required or billing_error, which need user action). Honest about the floor: when Anthropic's API is fully down, mk0r is fully down too. status.claude.com is the truth.

What does the live retry overlay actually show on screen?

Look at src/app/(landing)/page.tsx around line 511. When the SDK reports an api_retry event, the page renders an amber pulsing dot and two lines: 'Retrying request, attempt 2/3' on top, 'overloaded_error (HTTP 529) - next in 8s' below. The numbers come straight from the SDK's emitted attempt, max_retries, error_status, error, and retry_delay_ms fields. Once attempts exceed max_retries the bridge classifies the prompt_error and the overlay flips to a terminal error pill (credit_exhausted, auth_required, or generic).

Why does the stock SDK not show this?

The published claude-agent-acp entry receives the SDK's system events but does not forward several of them onto the ACP session: api_retry, rate_limit_event, compact_boundary, tool_progress, task_notification. The patched entry at src/core/vm-scripts.ts (lines 18 onwards) wraps the AsyncIterable that drives the agent loop and re-emits each of those as a sessionUpdate the host can read. Without the patch, a 529 looks like silence on the wire for up to 30 seconds. With the patch, you get attempt-by-attempt visibility while it happens.

What is the TTFT watchdog and when does it fire?

Time-to-first-token. src/app/api/chat/route.ts line 421 sets ttftMs to 15000. If the agent loop emits zero notifications in the first 15 seconds, the bridge evicts the session, captures a chat_ttft_timeout event, and sends 'Agent did not respond within 15s. Please retry.' This is a separate signal from api_retry, since api_retry fires when the SDK gets back a real HTTP error and decides to retry, while the TTFT watchdog fires when nothing comes back at all. During a Claude outage you typically see the TTFT watchdog after the SDK has exhausted its own retries silently.

How long do typical Claude outages last?

Looking at status.claude.com history for the past two weeks: most incidents marked 'Elevated errors' on a single model resolve in 30 to 60 minutes. The April 28 2026 Claude.ai outage that blocked logins on the web and Claude Code lasted from 17:34 to 18:52 UTC, about 78 minutes. The April 15 2026 widespread incident across the API and Claude Code was longer. The pattern is: model-specific incidents are short, surface-wide incidents (Claude.ai login or api.anthropic.com itself) are the ones that last over an hour. Pick a model family from a different week and the SDK's auto-retry usually masks the first wave of failures.

Can I keep using mk0r if a specific Claude model is down?

Quick mode and VM mode call different model surfaces. Quick mode uses Claude Haiku for the streaming HTML path. VM mode uses whatever model the in-VM Claude Code subprocess defaults to (typically Sonnet, configurable). When status.claude.com lists 'Elevated errors on Claude Haiku 4.5' but Sonnet is fine, Quick mode sees retries while VM mode keeps working, and vice versa. The retry overlay tells you which path is hurting; the status page tells you why.

What should I do when I see the retry overlay?

Wait one cycle. The SDK's default max_retries is small (typically 2 to 3) and the backoff delay is short. If you see attempt 1/3 with a 4-second delay, you are five to ten seconds away from a successful response or a final error. If the overlay flips to 'credit_exhausted' or 'auth_required', the action is on you (top up the API account or re-link the OAuth). If it flips to a generic error and status.claude.com shows an active incident, the action is patience. The thing not to do is mash retry; that just adds new attempt 1/3 timers next to the still-running ones.

Adjacent reading

Guide

Claude AI Apps

What you actually get when an app is built on Claude end to end, from the model picker to the in-VM tool surface.

Read

Guide

AI App Builder No Code

Quick mode versus VM mode and how each one calls a different Claude surface. Useful for understanding which path retry events come from.

Read

Guide

AI App Prototype Limits

The honest ceiling of one-shot Claude prototypes. Where retries help and where they cannot save the request.

Read