Test a sandbox profile
Before an agent runs real work on a profile, boot a throwaway sandbox from it and probe what matters. Test profile reports whether the image builds, the repos clone, the tools answer, the runtime authenticates, and the credentials resolve, so you find a broken profile in a click instead of mid-run.
A sandbox profile is configuration, and configuration can be wrong in ways you only find when an agent is already three steps into a run. A tool that was never installed. A repository the credential cannot reach. A runtime key that expired last week. Test profile is how you find that out first: it boots a one-off sandbox from the profile, runs a set of probes against it, and reports what works and what does not, before any real work depends on it.
This page is the operational how-to: where the action lives, what each probe checks, how to read a passed or failed result, and how the history builds up over time. For the concept of why profiles are testable at all, read the "test a profile" section of sandboxed execution; this page is the test surface itself.
Where the action lives
You run a test from a profile's Overview tab, under Platform → Sandbox Profiles. The Run test button sits at the top of the test panel, and the panel below it is where the result lands and the history accumulates. There is nothing to configure first; a test takes the profile exactly as it stands and tries it. A profile nobody has tested yet says so plainly, "not yet tested, run a test to verify this profile boots," so an untested profile is never quietly mistaken for a healthy one.
Running a test rides on sandbox-profile.use, the same lighter scope that lets you launch a sandbox from the profile in the first place. The reasoning is straightforward: anyone about to rely on a profile should be able to check it themselves, without holding full profile-management rights. Reading the past test runs rides on sandbox-profiles.read, so the trail of "did this profile boot last Tuesday" is visible to everyone who can see the profile.
What a test run checks
A test is not a single yes-or-no. It boots the sandbox and then runs a sequence of probes, each one answering a specific question about whether the profile will hold up under real work. The probes run in order, and two of them are the gate the overall result turns on.
- Provision boots the sandbox from the profile. If the image will not build or the host will not give you a container, the test stops here, because nothing else can be checked. This is the first gate.
- Smoke check runs a trivial command inside the container and waits for it to answer. It proves the sandbox is not just provisioned but alive and able to execute. This is the second gate: if the container cannot run a command, the test fails regardless of anything downstream, because a sandbox that cannot execute is a sandbox that cannot work.
- Runtime tools checks that each tool the profile declared is actually present on the path. A profile that promises Node and the cloud CLI is tested for Node and the cloud CLI, and a missing one is named.
- MCP reachability tries to reach each MCP server the profile wires in over the network, and reports whether it answered. This one is a signal, not a gate: a server that did not respond shows as a warning rather than failing the test, because MCP connections are established when a real run starts and a transient miss at test time is not a broken profile. Servers that run as a local process inside the sandbox are noted as verified-at-launch rather than probed here, since there is no agent process yet to host them.
- Runtime auth confirms the AI runtime is reachable and the key works, with a single minimal call. It runs for the runtimes that authenticate with a key the platform can resolve ahead of a run, Claude and Copilot today, and it is the difference between "the model is configured" and "the model will actually answer." A profile whose key is injected only at launch has nothing to check here and shows no auth row, which is correct, not a miss.
- Capability resolution does a dry run of the credentials and capabilities the profile expects, confirming each one resolves without using it for any real work. It is how you catch a database credential that was never set or a capability binding that points at nothing, before an agent reaches for it mid-task. When a profile is bound to an environment, the credentials that environment governs are part of what resolves here.
The result is passed or failed, one clear outcome, with the per-probe detail behind it. A profile passes when it provisions, executes, and resolves what an agent will genuinely need; the softer signals, like an MCP server that was briefly unreachable or a binding the platform can only confirm at launch, are reported without sinking the run.
A test is built to stay fast on a heavy profile, so the three list-style probes work through a bounded slice: the runtime-tool, MCP, and capability checks each cover up to a set number of entries, and when a profile declares more than that, the result says so plainly ("probed 10 of 24") rather than pretending it checked everything. The capability check also notes whose identity it ran as. A credential that resolves for the person running the test is not a guarantee it resolves for everyone, because some bindings authenticate as the individual launching the sandbox, so the result records the account it evaluated against and flags that a different launcher may see a different answer.
What a test leaves behind
The sandbox a test boots is a throwaway, and the platform treats it as one. It comes up under its own isolation key, separate from any real conversation or flow, runs the probes, and is torn down the moment the probing is done. There is nothing left for you to clean up, and because each test gets its own isolated sandbox, you can run one without disturbing anything else, including another test on the same profile.
What persists is the result, not the container. Each run is recorded with its outcome, its timing, who ran it, and the full per-probe detail, and that record is kept for thirty days. So the sandbox is disposable in exactly the way every sandbox is, and the evidence of what it found outlives it: you get the answer without inheriting the box.
Reading the result
A finished run opens into a detail view that lays out everything the probes found. The header carries the outcome, passed or failed, with the timing and who ran it.
Below it, each probe is its own row, and a row reads one of four ways. Green means it answered. A neutral note marks a check the platform can only confirm once a real run starts, like a local MCP server. A warning marks a soft signal that did not sink the run, like an MCP server that was briefly unreachable. A red row is a real failure, carrying a plain-language message: a failed runtime-tool row names the tool that was missing, a failed capability row names the credential that would not resolve. Only the last of the four is asking you to fix anything, which is what keeps a long result honest instead of alarming.
Two things in that view do more than report. A fix-link sits on a failed probe and takes you straight to the place you would correct it: a capability that did not resolve links to the secret or the MCP configuration behind it, a binding problem links to the profile's Runtime tab. You read the failure and act on it in one move instead of hunting for where it lives.
And a configuration-changed note compares the profile as it stands now against the snapshot the run captured. A green result from last week is only as good as the profile that produced it, so when someone has since swapped the runtime or changed a feature in the image, the run that passed tells you it passed against a profile that no longer exists. The note is how a stale green stops being a false reassurance.
The detail view also carries a configuration-tested panel, a plain summary of what the run actually booted: the provider and host, the runtime and its source, the identity the sandbox ran as, the resource shape, and the counts of features, runtime tools, and MCP servers it found. It is the record of "this is the thing we tested," so a result is never an abstract pass or fail floating free of the configuration it describes. When a run fails partway in, the view expands the last command output from inside the sandbox, with secrets redacted, so a confusing failure has the actual log behind it rather than only a one-line message.
Not every red result is a problem with the profile's contents. A profile pointed at a host that is offline, or one that needs an entitlement your workspace does not carry, fails to provision with a message that names exactly that, the host could not be reached, or the host type is not enabled here, rather than a vague boot error. The test tells the difference between "the profile is wrong" and "the place it wants to run is not available."
The history that builds up
Because every run is recorded, the test panel is also a short history of the profile's health. The latest run sits at the top; below it, a row of pills marks the recent runs green or red at a glance, and a small strip of stats gives you the pass rate, the mean duration, and the current green streak, all computed over the recent runs still on file. A profile that has passed twenty times running reads differently from one that flaps between green and red, and the panel shows you which you are looking at without your having to open anything.
This is the part that makes a test worth keeping rather than a one-time gate. When a profile that has been solid for a month suddenly fails, the history is right there to tell you it was solid for a month, which is often the first clue about what changed.
Catching a broken profile before it bites, end to end
Priya is about to point a week of agent work at the Node 22 . Insights profile, the one Tom standardized, but she added Postgres to it this morning so the integration tests have a database. Before she trusts it, she opens the profile's Overview tab and hits Run test.
The run comes back red. Most of it is green: the image provisions, the container executes, Node and the other tools are all present, the runtime answers. But the capability-resolution probe failed on a row named db-admin, with a message that it did not resolve because no secret is configured. The Postgres feature is in the image; the credential the test database needs is not. She would have hit this halfway through the first run that touched the database, with an agent stuck and no obvious reason.
She clicks the fix-link on the failed row, which drops her straight into Secrets, adds the database credential, and comes back to the profile. The MCP-reachability row had also shown a warning, the Postgres MCP server not answering, but she leaves that alone: it is a soft signal, and the server comes up with a real run. She hits Run test again. This time it is green across the board, the capability resolves, and the run records a fresh pass against the profile as it now stands. The week's work starts on a profile she has proven, not one she is hoping about.
Who can run a test
Running a test is the lighter of the two profile permissions. sandbox-profile.use, the scope that lets you launch a sandbox from a profile, also lets you test it, so the people who depend on a profile can verify it without being the people who manage it. Reading the test history rides on sandbox-profiles.read. Editing the profile to fix what a test found is the heavier sandbox-profiles.manage, covered on sandbox profiles. The split means an engineer can prove a profile is sound before relying on it, even when changing it is someone else's job.
Why testing a profile works this way
The failure a profile test prevents is a specific and familiar one: an agent that gets halfway through real work and stops, because something the environment was supposed to provide was not there. That failure is expensive in a particular way, not because the fix is hard, but because you find it late, with a half-finished run to untangle and no quick signal about which of a dozen profile settings was the culprit. A test moves that discovery to the front, where it costs a click.
The design follows from that. The probes check the things that actually break a run, provisioning, execution, the declared tools, the runtime, the credentials, rather than re-stating the configuration back to you. The two gates are the two failures nothing recovers from, so the outcome is honest about what a green really means. The soft signals are kept soft, because a profile is not broken just because an MCP server blinked. The sandbox is thrown away so a test never leaves a mess, and the result is kept so a test is also a record. And the configuration-changed note exists because the most dangerous test result is a green one that no longer applies.
For a planner, a test is the answer to "is this profile safe to point work at" without asking an engineer. You run it, you read one badge, and you know whether the next launch will start clean or stall.
For an engineer, it is the fast loop you want when you change a profile: edit, test, read the red row, click the fix-link, test again. You find the missing tool or the unset credential in seconds, at your desk, instead of in the middle of an agent's run with the clock going.
For a lead, the history is the signal that a standardized profile is staying healthy. A profile that has passed twenty runs straight is one the team can keep leaning on; a profile that started flapping is one to look at before it costs someone a morning.
For the person who has to account for what an agent will be able to do, a test is evidence captured before the fact: a recorded, time-stamped check of exactly what the profile provisions, which runtime it reached, and which credentials resolved. Each test boots its own isolated sandbox and tears it down, and the trail is kept, so you can show that a profile was sound when work began, not just assert it.
The profile a test runs against: where you fix what a test finds.
The management list for the live sandboxes a profile launches into real work.
The credential and capability policy the resolution probe dry-runs.
When a test run fails, the fix for each check.