Troubleshooting overview

How to read a failure and where to look before you fix it. The surfaces that tell you what happened, the shape of recovery, and a map to the scenario by scenario guides.

A stopped run is not a lost run. When agent work stops, it tells you why: a failed run names the step that failed, a waiting run names what it is waiting for, and the work done before the stop is already saved. Almost every stop has one next move: resolve the cause, then retry or resume.

This page shows you where the product writes those signals. The scenario pages that follow turn each common failure into a short what you see, why it happens, how to fix it.

Where to look

When something stops, five surfaces hold the answer. You rarely need more than the first two.

Five surfaces hold the answer when a run stops. The run and its step timeline is where you start; you rarely need past the transcript.

The run and its step timeline. Start here. Open the run. Its status tells you the shape of the problem at a glance, and the step timeline shows you exactly which step stopped. A failed run names the step that failed; a waiting run names what it is waiting for. This is the first place to look and usually the last.

add_photo_alternate

Screenshot to capture

The Flow runs list with several runs in different states: one failed with a red status pill, one paused at a checkpoint, one blocked with a precondition note, and one completed, each row naming the Flow, who started it, and how long ago

save as: public/docs-media/flow-runs-list-statuses.png

Caption when added: The runs list shows each run status at a glance. The pill on your stopped run is where recovery starts.

The step transcript. When you need the detail. Each step keeps a transcript of what the agent actually did: its file edits and writes, every tool call, the number of turns it took, and the cost in dollars. When a step fails, the transcript is the record of the work up to that point, and the error sits at the end of it. You are reading what happened, not guessing.

The environment-health panel. Some stops are not about the agent at all. They are the platform telling you the work cannot proceed safely until something outside the run is resolved, a credential to reach a repository, or an environment with an open issue. These show up as blockers with a plain reason, and the run waits rather than pressing on. Resolve the blocker and the run can continue from where it paused.

add_photo_alternate

Screenshot to capture

The environment-health panel for a target environment showing one open blocker with a plain reason such as a pending schema migration, the affected environment named, and a control to mark it resolved

save as: public/docs-media/environment-health-blocker.png

Caption when added: An environment blocker holds the run with a plain reason. Resolve it and the run continues from where it paused.

The audit trail. For the question of who did what and when, the audit trail is the record. Every significant action writes an entry with the person, the change, and the time, and an AI-authored action is marked as such. It is the surface an administrator brings up to tie a result back to the action that produced it.

Notifications. You do not have to sit and watch. When a run reaches a checkpoint, hits a blocker, or fails, it can notify you, so the thing that needs attention comes to you rather than waiting to be found.

add_photo_alternate

Screenshot to capture

A Flow run detail page in a failed state: the step timeline down the left with one step marked failed in red and the rest completed, and on the right the failed step expanded to show its error message, its summary, and the transcript of the agent actions that led up to it with turn count and cost

save as: public/docs-media/run-detail-failed-step.png

Caption when added: A stopped run shows which step stopped and why. The transcript is the record of what the agent did up to that point.

The shape of recovery

Almost every recovery is one of four moves. Which one applies depends on the state the run is in.

Move	Use it when	What it does
Retry	A run failed, was cancelled, or was interrupted by a platform update	Starts the run again from where it stopped, on a fresh sandbox if the old one cannot be reused.
Resume	A run is paused at a checkpoint, or blocked on a credential or environment	Continues the run once you have approved the checkpoint or cleared the blocker.
Cancel	A run is still going and you want it to stop	Stops the run and releases the sandbox it was using.
Continue in Chat	A background task is waiting for you	Picks the work up in a chat where you can steer it directly.

In short: if the run stopped on its own, retry. If it is waiting on you or on something outside it, clear that and resume. If it is still running and you want it to stop, cancel.

Each move is a button in the run's header, shown for the state the run is in, so you do not have to remember which verb applies where. A retry and a resume both pick up from the first step that did not finish, so neither repeats a step that already succeeded, and the run's earlier transcript stays on the record either way.

One question resolves the move. If the run stopped on its own, retry. If it is waiting on you or on something outside it, clear that and resume. If it is still running and you want it stopped, cancel.

Find your scenario

Each page below is a set of real failures with the cause and the fix for each. Start with the one that matches where the problem showed up.

bolt

Flows and runs

A run failed, a checkpoint is waiting, a run is blocked or was interrupted, or you hit a run or compute limit.

dns

Sandboxes and hosts

A sandbox will not start or is unhealthy, a host you brought shows as disconnected, or a profile test run failed.

link

Connections and access

GitHub or Azure DevOps authorization failed, a repository could not be cloned, an MCP server will not connect, or a document import needs re-authorization.

lock

Permissions and secrets

A "permission denied" message, or the agent cannot reach a file or a secret it needs.

merge

Pull requests

A pull request could not be created, or a merge would not go through.

Still stopped after the fix

If you cleared the named cause and the run still will not move, the run itself holds what you need to get help. Every run carries its own identifier on its summary card; quote it so anyone looking with you lands on the same run. The statuses reference names every state and what each one means, which is the fastest way to confirm the run is where you think it is. If the state and the fix still do not line up, your account contact can pick it up from the run identifier and the step that stopped.

Why a stopped run is safe to leave

Troubleshooting here is mostly reading rather than rebuilding, by design. Agent work runs in discrete steps, each step records what it did, and the records an agent writes land in your work model as it goes, not in a buffer a failure can wipe. You resolve the cause and pick up from there.

When a run stops, everything before the stop is already saved to your work model. The run holds at the step that stopped, and a retry or resume picks up from there, never from the start.

That is why every stop comes with a reason: a blocker says what it needs, a failed step has an error and a transcript, a checkpoint says what it is waiting for.

For a current user, the move is almost always the same: open the run, read the step that stopped, fix the named cause, and retry or resume. The scenario pages give you the named causes.

For a prospect evaluating the platform, the thing to notice is that failure is a first-class state here, not an afterthought. Work is recoverable because it was inspectable and saved along the way.

play_circle

Sessions

Where runs, tasks, and chats live, and how to reopen them.

account_tree

Statuses reference

Every run, step, and task state, and what each one means.