Category 02 / Deep dive|Last verified April 2026

LLM test automation and agentic testing.

Agentic testing is the 2026 phrase for end-to-end test automation in which an LLM agent reads a goal or natural-language scenario and drives a real browser to validate the application. The agent decides what to click, what to assert, and how to recover when the page does not match expectations.

The category is not new. The novelty is the substitution of explicit selectors and step lists with LLM-driven decisions at run time. This page maps the capability ladder, lists the vendors at each rung, and summarises the trade-offs each vendor documents.

How an agentic test executes.

A typical agentic test takes a goal ("sign up a new user with a marketing email and verify the welcome email arrives within 30 seconds") and decomposes it into observable browser actions. The LLM agent decides, at each step, which DOM element matches the intended action, which side-effect to wait for, and whether the page state is consistent with the goal. The agent loop is similar in shape to general agentic patterns documented for AI coding agents (agent patterns reference).

Two design choices distinguish vendors in this category: where the test logic lives (in source-controlled Playwright or Selenium code, or in vendor-managed prompt and metadata blobs), and where the agent executes (in vendor cloud, on developer machines, or on the team's own CI). Each choice has procurement consequences for lock-in and reproducibility.

A capability ladder for agentic test runners.

The list below is a description of observable runner capabilities, not a measurement of vendor quality. Each level describes what a runner can do; vendors that occupy a level may document strengths or trade-offs against the levels above and below.

Level 1. Recorder + replay.

The agent watches a human walk through a flow once and produces a replayable test. Output is brittle to layout change. Most modern tools have this as a fallback, not a primary mode.

Level 2. Plain-English step list.

The agent ingests step-by-step English ("click sign up, enter email, click submit, assert welcome banner") and resolves each step against the live DOM at run time. testRigor occupies this rung as its core paradigm (testRigor docs).

Level 3. Goal ingestion with intermediate planning.

The agent ingests a goal ("test new-user signup") and plans the steps itself. Momentic documents this mode (Momentic docs).

Level 4. Goal ingestion with durable Playwright output.

The agent plans and executes the test, then commits the resulting Playwright code to the team's repository for re-use. QA Wolf documents this hybrid model: an agentic generation step followed by a human-readable Playwright artefact that runs in the team's own CI (QA Wolf docs).

Level 5. Continuous exploratory agents.

The agent runs continuously against an environment, exploring without a fixed step list, and surfaces unexpected behaviour as candidate test cases. This rung remains experimental; no vendor is known to have shipped a stable production-ready offering as of April 2026.

The lock-in question.

The published trade-off across the category is durability. Tools that emit Playwright or Selenium code (QA Wolf, certain configurations of Reflect) keep the test asset portable: if the vendor relationship ends, the team retains a runnable test suite. Tools that manage tests as proprietary YAML or LLM-prompt blobs (testRigor, Momentic, Functionize) do not offer the same portability.

Vendor documentation should be the source of truth for any specific export claim. Capabilities ship and change. The methodology rules require checking vendor docs before quoting any export feature.

Public benchmarks for agentic browser agents.

Agentic browser agents are evaluated outside the testing community on browser-task benchmarks: WebArena, VisualWebArena, Mind2Web. None of these benchmarks evaluate agentic test runners specifically; they evaluate general-purpose browser agents on tasks like booking flights or filling forms.

Code-side, SWE-Bench evaluates LLM performance on real GitHub issues and is a rough proxy for the kind of code-comprehension a test-writing agent needs (swebench.com). Stanford HELM aggregates a broader set of LLM evaluations including code (crfm.stanford.edu/helm).

What this means in practice: when a vendor publishes a number for "test pass rate on a proprietary internal benchmark", that number is internal and should be read as marketing material unless the methodology and the test suite are public. The site does not reproduce such numbers.

What agentic testing gets wrong.

The published failure modes are well documented. LLM agents over-confidently click the first plausible-looking element, miss conditional flows behind feature flags, and hallucinate states the application does not have. Each vendor publishes its own approach to mitigation, typically a combination of confidence thresholds, retry loops, and human review queues.

See the self-healing tests page for a related category of mitigation: when the locator breaks, fall back to a multi-identifier strategy.

Pricing.

Pricing in the agentic category usually mixes per-test-run and per-parallelisation charges, sometimes with a managed-service component (a human reviewing and triaging failures). Specific vendor pricing is normalised on the pricing comparison page.

Cross-reference

For the broader definition of "agent" used here, see whatisanaiagent.com. For agent-pattern reference, see buildingeffectiveagents.com.