Troubleshooting overview
How to read a failure and where to look before you fix it. The surfaces that tell you what happened, the shape of recovery, and a map to the scenario by scenario guides.
A stopped run is not a lost run. When agent work stops, it tells you why: a failed run names the step that failed, a waiting run names what it is waiting for, and the work done before the stop is already saved. Almost every stop has one next move: resolve the cause, then retry or resume.
This page shows you where the product writes those signals. The scenario pages that follow turn each common failure into a short what you see, why it happens, how to fix it.
Where to look
When something stops, five surfaces hold the answer. You rarely need more than the first two.
The run and its step timeline. Start here. Open the run. Its status tells you the shape of the problem at a glance, and the step timeline shows you exactly which step stopped. A failed run names the step that failed; a waiting run names what it is waiting for. This is the first place to look and usually the last.
The step transcript. When you need the detail. Each step keeps a transcript of what the agent actually did: its file edits and writes, every tool call, the number of turns it took, and the cost in dollars. When a step fails, the transcript is the record of the work up to that point, and the error sits at the end of it. You are reading what happened, not guessing.
The environment-health panel. Some stops are not about the agent at all. They are the platform telling you the work cannot proceed safely until something outside the run is resolved, a credential to reach a repository, or an environment with an open issue. These show up as blockers with a plain reason, and the run waits rather than pressing on. Resolve the blocker and the run can continue from where it paused.
The audit trail. For the question of who did what and when, the audit trail is the record. Every significant action writes an entry with the person, the change, and the time, and an AI-authored action is marked as such. It is the surface an administrator brings up to tie a result back to the action that produced it.
Notifications. You do not have to sit and watch. When a run reaches a checkpoint, hits a blocker, or fails, it can notify you, so the thing that needs attention comes to you rather than waiting to be found.
The shape of recovery
Almost every recovery is one of four moves. Which one applies depends on the state the run is in.
| Move | Use it when | What it does |
|---|---|---|
| Retry | A run failed, was cancelled, or was interrupted by a platform update | Starts the run again from where it stopped, on a fresh sandbox if the old one cannot be reused. |
| Resume | A run is paused at a checkpoint, or blocked on a credential or environment | Continues the run once you have approved the checkpoint or cleared the blocker. |
| Cancel | A run is still going and you want it to stop | Stops the run and releases the sandbox it was using. |
| Continue in Chat | A background task is waiting for you | Picks the work up in a chat where you can steer it directly. |
In short: if the run stopped on its own, retry. If it is waiting on you or on something outside it, clear that and resume. If it is still running and you want it to stop, cancel.
Each move is a button in the run's header, shown for the state the run is in, so you do not have to remember which verb applies where. A retry and a resume both pick up from the first step that did not finish, so neither repeats a step that already succeeded, and the run's earlier transcript stays on the record either way.
Find your scenario
Each page below is a set of real failures with the cause and the fix for each. Start with the one that matches where the problem showed up.
A run failed, a checkpoint is waiting, a run is blocked or was interrupted, or you hit a run or compute limit.
A sandbox will not start or is unhealthy, a host you brought shows as disconnected, or a profile test run failed.
GitHub or Azure DevOps authorization failed, a repository could not be cloned, an MCP server will not connect, or a document import needs re-authorization.
A "permission denied" message, or the agent cannot reach a file or a secret it needs.
A pull request could not be created, or a merge would not go through.
Still stopped after the fix
If you cleared the named cause and the run still will not move, the run itself holds what you need to get help. Every run carries its own identifier on its summary card; quote it so anyone looking with you lands on the same run. The statuses reference names every state and what each one means, which is the fastest way to confirm the run is where you think it is. If the state and the fix still do not line up, your account contact can pick it up from the run identifier and the step that stopped.
Why a stopped run is safe to leave
Troubleshooting here is mostly reading rather than rebuilding, by design. Agent work runs in discrete steps, each step records what it did, and the records an agent writes land in your work model as it goes, not in a buffer a failure can wipe. You resolve the cause and pick up from there.
That is why every stop comes with a reason: a blocker says what it needs, a failed step has an error and a transcript, a checkpoint says what it is waiting for.
For a current user, the move is almost always the same: open the run, read the step that stopped, fix the named cause, and retry or resume. The scenario pages give you the named causes.
For a prospect evaluating the platform, the thing to notice is that failure is a first-class state here, not an afterthought. Work is recoverable because it was inspectable and saved along the way.