The loop
The four moves
Ask: is this behavior covered?
The durable suite is the answer. If the agent just wrote something new, it isn’t covered yet. If it touched existing behavior, the suite already has a test for it.
Not covered → create
Describe the behavior — as a plan file for frontend tests (
planSteps[]), or as code (typically Python) for backend tests — then create and run it:--run --wait chains create → trigger → poll into one blocking command. Exit 0 means the new test passed and is banked.Already covered → rerun
Replay the existing suite so nothing that used to work breaks silently:Frontend reruns replay the saved script verbatim — free unless auto-heal engages.
Why this design works
Why one self-consistent bundle matters
Why one self-consistent bundle matters
An agent reasons over whatever context you hand it. If that context mixes a failing step from one run with source code from a different run, the agent will confidently “fix” the wrong thing.
testsprite test failure get (and test artifact get) return a bundle where every artifact shares one snapshotId — the failing step, its neighbors, the DOM snapshots rendered as text, the test source, and the root-cause hypothesis all describe the same moment. The CLI refuses to stitch data across runs or code versions. That’s what makes the output safe to feed straight into an agent — no dashboard scraping, no manual screenshot-pasting.Coverage compounds
Coverage compounds
Every passing test joins a durable suite — a lasting record of every requirement the agent has ever gotten right, far bigger than any context window. As the project grows, the suite grows with it, and the “already covered?” question gets answered by real, replayable tests rather than the agent’s memory. A regression is caught the next time the suite runs, not when a user reports it.
The cloud is a black box on purpose
The cloud is a black box on purpose
You describe intent; the cloud does the work; you read structured results. Your agent never has to know how the test was driven — only what a real user experienced.Tests run against your live product, not mocks. A frontend test opens a real browser, navigates your app exactly as a user would, and asserts against real behavior. A backend test executes your test code (typically Python) against real API endpoints. This has two consequences:
- No environment setup on your side. You don’t install a browser engine, configure proxies, or manage versions. The cloud handles it.
- Results reflect production reality. If a test fails, something in the real app is wrong — not a test-harness artifact.
The CLI does not support
localhost targets. Testing a localhost app requires the MCP Server, which manages the tunnel for you. See MCP Server.A machine-readable contract
A machine-readable contract
--output json plus stable exit codes form a contract the loop depends on: every command emits the same JSON shape and the same exit codes across releases, so your agent can branch on results without defensive parsing or dashboard scraping. That stability is what makes the loop safe to run unattended.See Output & Scripting for the JSON shape, --dry-run, jq, and branching patterns.Safe retries
Safe retries
Write commands — When you replace backend code, a
project create, test create, test run, test rerun — all carry an idempotency key. The backend deduplicates on this key (time-bounded), so retrying a failed network request never creates a duplicate project, test, or run.The CLI generates a random key per invocation by default. Pin your own key to make a command repeatable with guaranteed idempotency:codeVersion token guards against silent overwrites — see Editing & Deleting Tests.Run-scoped vs latest
Run-scoped vs latest
The CLI gives you two ways to reach failure artifacts:
test failure get follows the latest failing run (which can shift if a Portal or scheduled run fires mid-loop), while test artifact get is pinned to a specific runId and never moves. Which one you pick matters whenever multiple runs might overlap.See Reading Results for the full comparison.Where the CLI fits
The CLI is one of three surfaces over the same backend and data.One Platform, Three Surfaces
See how the Web Portal, MCP Server, and CLI compare.
Schedule creation, billing management, crawl/site discovery, and per-step regeneration stay in the Web Portal. The CLI surface is focused on the test lifecycle: create, run, read, fix, rerun.
Where to Go Next
Key Terms
Projects, tests, runs, statuses, credits, scopes, and failure bundles defined
Quickstart
Walk through your first test end to end in about 10 minutes
Running Tests
Triggering runs, waiting for verdicts, and handling every exit code
Agent Integration
Let your coding agent drive the loop on its own