Sandboxes and hosts

Fixing the compute under a run. A sandbox that will not start or is unhealthy, an environment that is degraded or blocked, a host you brought showing as disconnected, and a profile test run that failed.

When a run stops because of the sandbox or the host, not the work, this is the page. Every run executes in an isolated sandbox, and that sandbox runs either on the platform's managed compute or on a host you brought. When the compute underneath a run has a problem, the run cannot proceed, and the fix is here rather than in the run itself. If the run stopped for a reason you have not pinned to the sandbox, start at flows and runs.

Managed compute or a host you brought

Two of the scenarios below depend on which kind of compute is under your run, so it helps to know which you are on. By default a sandbox runs on the platform's managed compute: you do not operate it, and the states to read are the sandbox's own. If you registered a host of your own, a Kubernetes or Docker host, then you also operate the part that checks in with the platform, and that host can show as disconnected on top of anything happening in the sandbox. The sandbox and environment scenarios apply to both; the disconnected-host scenario applies only to a host you brought. Managed compute is always available as the default underneath either way.

A sandbox will not start or is unhealthy

What you see. The sandbox sits in creating without reaching running, or it shows failed. A run that needed it cannot begin or stops partway.

Why it happens. A stall or a failure points at one of a few causes. A sandbox can be slow or unable to come up while its image is pulled or built. The agent's connection inside the sandbox may not have come up in time. A sandbox can run out of memory for the work it was given, which ends it as failed. And a sandbox whose definition drifted from what the platform expects is recreated rather than reused. The state tells you where it stopped: a sandbox moves through creating, running, paused, and either destroyed or failed.

How to fix it. Most start failures clear on a retry, since a slow image pull or a transient bring-up resolves on the second attempt. If it keeps failing, the cause is usually the profile: check that the sandbox profile names an image the platform can reach and the tools the work needs. If a sandbox is ending as failed partway through heavy work, it is likely out of memory, and the fix is to give the profile a larger resource class. The three classes are planning (1 CPU, 2 GiB), development (2 CPU, 4 GiB), and development-large (4 CPU, 8 GiB); moving up a class is the move for builds or tests that need more room. A profile can also set its CPU or memory directly when an admin has configured an override, so an unusual workload is not held to the three classes.

Three things report their own state. The healthy state is green, a warning is amber, and the state that stops a run is red. Reading the right row tells you whether the trouble is the sandbox, a host you brought, or the environment the work targets.

add_photo_alternate

Screenshot to capture

A sandbox detail page showing a failed sandbox: a status pill reading failed, the profile it was launched from, the host it ran on, a last-activity timestamp, and a short reason line such as the sandbox ran out of memory, with a control to retry the run that used it

save as: public/docs-media/sandbox-detail-failed.png

Caption when added: A sandbox detail page names the state and the profile behind it. A failed sandbox is usually a retry or a larger resource class away.

An environment is degraded or blocked

What you see. A run targeting an environment will not proceed, and the environment shows blocked, or it shows degraded with a warning. This is separate from the sandbox's own state: the sandbox can be perfectly healthy while the environment it targets is not ready.

Why it happens. The platform watches the environment the work is headed for and raises a blocker when something has to be in place first, so the work does not run against an environment that cannot accept it safely. A blocking issue makes the environment blocked and parks the run; a warning makes it degraded, which flags the concern without stopping the run. A blocker carries a plain reason, like infrastructure that needs to be in place, a database schema migration, a pipeline environment, a missing secret identity, or a capability the environment does not yet have.

How to fix it. Open the environment-health panel and read the open blocker. Resolve what it names, run the change it is asking for, supply the missing piece, then have the platform re-check; once nothing is still blocking, the run can continue. A degraded environment with only a warning does not stop the run, but the warning is worth clearing for the same reason it was raised. A blocked run waiting on an environment is covered from the run's side on flows and runs.

add_photo_alternate

Screenshot to capture

The environment-health panel for a target environment in a blocked state: a status reading blocked, one open blocker with a plain reason such as a pending schema migration, the environment named, and a re-check control that clears the blocker once the cause is resolved

save as: public/docs-media/environment-detail-blocked.png

Caption when added: The environment-health panel names the open blocker and its reason. Resolve what it names, re-check, and the run continues.

A host you brought shows as disconnected

What you see. A host you registered, a Kubernetes or Docker host of your own, shows disconnected on its detail page, and new sandboxes will not launch on it.

Why it happens. A host stays connected by checking in with the platform. Its detail page shows when it was last seen, and when those check-ins stop for longer than expected, the host moves from registered to disconnected. The usual causes are the operator that drives the host not running, a network path between the operator and the platform that is down, or an operator running with a credential that is no longer valid.

How to fix it. Confirm the operator is running where you installed it, a pod for a Kubernetes host, the local operator for a Docker host, and restart it if it is not. If it is running and the host is still disconnected, check that it can reach the platform over the network. If the operator is running and reachable but the host still does not come back, re-register it to issue a fresh credential and replace the operator bundle; the previous credential stops working once a new one is issued. When the operator checks in again, the host returns to connected on its own, and the last-seen time updates. Managed compute is always available as the default, so a run can fall back to it while you bring a host back.

add_photo_alternate

Screenshot to capture

A registered host detail page showing a disconnected host: a status pill reading disconnected, a last-seen timestamp some minutes ago, the operator version, and guidance to restart or re-register the operator, with a Re-register control

save as: public/docs-media/host-detail-disconnected.png

Caption when added: A host detail page shows when it was last seen. A disconnected host is almost always the operator needing a restart or a re-register.

A profile test run failed

What you see. You ran a sandbox profile test and it came back failed, with a message and a set of checks, one of them marked as the one that failed.

Why it happens. A profile test launches a real sandbox from the profile and runs a sequence of checks against it, so a failure tells you exactly which part of the profile is not right before you depend on it for real work. The checks are: the sandbox provisioned, a basic command ran, the runtime tools the profile declares are present, the connected tools the profile uses are reachable, the model credential authenticates, and the capability bindings resolve. The test reports the check that failed and the output that explains it.

A profile test runs these checks against a real sandbox and reports the one that failed, so the fix points at one part of the profile rather than the whole thing.

How to fix it. Read the failed check; each points at a specific part of the profile.

Provision or exec failed. The image or the basic setup is the issue. Check the profile's image and read the captured output for the error.
A runtime tool is missing. A tool the profile declares was not found in the sandbox. Correct the tool name, or use an image that includes it.
The model credential did not authenticate. The provider key the profile uses was rejected. Check the AI runtime configuration behind the profile.
A connected tool was not reachable. A connected server the profile uses could not be reached or needs sign-in. Reconnect it, or check its address, on connections and access.
A capability binding did not resolve. A credential the profile binds for the agent is not configured. Set up the binding on permissions and secrets, then test again.

add_photo_alternate

Screenshot to capture

A sandbox profile test result panel: an overall outcome reading failed, a list of checks with pass and fail markers for provision, exec, runtime tools, model auth, connected servers, and capability bindings, the failed check highlighted, and a redacted output tail explaining the failure

save as: public/docs-media/profile-test-results-failed.png

Caption when added: A profile test reports which check failed and the output behind it, so the fix points at one part of the profile.

Confirming a profile fix, start to finish

A profile test on the node-build profile comes back failed. The check list shows provision and exec passed with green markers, then runtime tools is marked failed: the profile declares pnpm, and the sandbox did not have it.

The fix is in the profile, not the work. The image the profile names ships npm but not pnpm, so the declared tool was never going to be found. You point the profile at an image that includes pnpm, save, and run the test again.

This time provision, exec, and runtime tools pass, the connected tools resolve, the model credential authenticates, and the capability bindings bind. The overall result reads passed. The profile is safe to launch real work against, and you found that out from a test run rather than from a failed flow.

One test, one named check, one fix. You confirm the profile before a run ever depends on it.

Still failing after the fix

If you cleared what a state named and the sandbox, environment, or host still will not come up, the detail page holds what you need to get help. A sandbox carries its own identifier and the profile it launched from; a host carries its last-seen time and operator version. Quote those so anyone looking with you lands on the same compute. The statuses reference names every state and what each one means. If the state and the fix still do not line up, your account contact can pick it up from the identifier and the failed check or blocker.

Why the compute layer fails loudly

A sandbox, an environment, and a host each report their own state plainly, on purpose. The work an agent does is only as trustworthy as the place it runs, so the platform would rather stop and name a problem with the compute than run against an image that did not build, an environment that is not ready, or a host that is not really there. A failed sandbox, a blocked environment, and a disconnected host are not dead ends; each names its cause and each has a direct fix, and managed compute stays available underneath so you are never without a place to run.

For an administrator, the host and environment states are the ones to keep an eye on. A host's last-seen time and an environment's blockers are where a problem shows up first, and both clear the moment the underlying cause is resolved.

For a prospect, the takeaway is that the compute under a run is observable and recoverable. Nothing fails silently, and a managed default means a problem with a host you brought never leaves you unable to run.

tune

Sandbox profiles

The image, tools, resources, and host a sandbox runs under.

dns

Sandbox hosts

Managed compute and bringing your own host.

link

Connections and access

When the problem is reaching a repository, a tool, or a document source.