Independent research site. Not affiliated with any vendor named. Benchmarks captured April 2026 on stated repos. Pricing changes frequently -- verify at the source. Affiliate disclosure.

Last verified April 2026

> llm test automation
/ agentic, properly defined

The phrase “agentic testing” is being used loosely by every vendor in the space. We define it precisely with a five-level capability ladder, map current tools to each level, and explain what Level 5 would actually require. The TAM-Eval paper from SANER 2026 is the closest academic benchmark for the full capability spectrum.

> the five-level capability ladder

L0

Traditional automation

Baseline

Scripts written by humans, maintained by humans. Selenium, Pytest, JUnit with hand-authored test cases. AI has no role in authoring or repair. All maintenance burden is on the engineering team.

SeleniumPlaywright (unassisted)Pytest (unassisted)JUnit (unassisted)
L1

LLM-assisted authoring

Common

The LLM helps a human write test code. The human prompts, the LLM suggests, the human reviews and commits. The LLM does not run tests, observe failures, or repair anything. Net effect: faster test authoring with the same maintenance burden.

GitHub Copilot (baseline)CursorClaude Code
L2

Prompt-to-test generation

Current frontier for most teams

The LLM generates test scripts from a high-level prompt or by observing a real browser session. The human describes the scenario; the LLM produces runnable code. Human review is still required before committing. Playwright MCP is the canonical L2 implementation: LLM drives a real Chromium browser, generates Playwright code from observations.

Playwright MCP + CopilotPlaywright MCP + ClaudetestRigor (NLP-to-test)Diffblue Cover (source-to-JUnit)
L3

Autonomous plan-run-heal

Available today (costly)

The agent plans a test, runs it, observes failures, repairs selectors or logic, and re-runs without human input during the run. A human may set the initial goal and review the final output, but the agent handles the execution loop. This is where QA Wolf and Momentic operate as of Q1 2026.

QA WolfMomentictestRigor (with Vision AI mode)
L4

Cross-suite strategy design

Emerging

The agent analyses an entire codebase or product surface area and designs a test strategy: which scenarios to cover, which framework to use, how to balance unit vs integration vs E2E coverage, where the highest-risk code paths are. No commercial tool is fully at Level 4 as of April 2026, but Mabl's latest release includes a limited suite-design recommendation feature.

Mabl (partial, suite recommendation)TAM-Eval research (academic)
L5

Autonomous release-spanning maintenance

Research only

The agent maintains an entire test strategy across software releases autonomously: detecting when new features require new tests, retiring stale tests after feature removals, rebalancing coverage for newly-high-risk areas, and generating tests for regression risks identified from production telemetry. This is a research-level capability as of April 2026.

Open research problemTAM-Eval (SANER 2026) is the nearest benchmark

> which tool is at which level

ToolCapability levelNote
QA WolfL3Full agentic Playwright output, autonomous run-repair cycle.
MomenticL3Goal-to-test, autonomous with proprietary format.
testRigorL2-L3L2 plain-English generation, L3 with Vision AI mode enabled.
MablL2-L4L2 codeless authoring, L3 auto-healing, L4 partial suite design.
Copilot + Playwright MCPL2Prompt-driven browser-observed generation. Human review required.
Diffblue CoverL2RL-driven source-to-JUnit generation. No E2E or run-repair loop.
QodoL1-L2L1 in IDE suggestion mode, L2 in full generation mode.
TestimL2Codeless recorder with LLM heal layer. No autonomous planning.
MeticulousL2Trace capture is automated, but replay analysis requires human review.

> playwright mcp deep-dive

Playwright MCP is a Model Context Protocol server published by Microsoft that lets any MCP-compatible LLM (GitHub Copilot, Claude, GPT-4o) control a real Chromium browser. The LLM can navigate pages, click elements, fill forms, observe the resulting DOM and network state, and generate Playwright test code based on what it sees.

This is the most practical Level 2 bridge available today. It costs only your Copilot or Claude subscription. The output is real Playwright code you own and can run independently in CI. Setup takes roughly 30 minutes following the Microsoft Learn walkthrough. The key advantage over vendor E2E tools: zero lock-in, zero per-run fees, and the resulting tests are portable Playwright files, not proprietary YAML.

The Playwright Healer agent extends the MCP pattern for maintenance: when an existing Playwright test fails with a locator error after a UI change, Healer inspects the current DOM, identifies the most likely new selector, patches the test, and re-runs. It does not generate new tests -- it repairs existing ones. The TestDino MCP server adds centralised failure classification and reporting on top of both generation (MCP) and repair (Healer) layers.

When MCP breaks

  • !Context window limits on very large SPAs: the DOM snapshot for a 200-component React app exceeds the context window of most LLMs, causing the agent to miss elements.
  • !Hallucinated selectors: on pages with unconventional DOM structures, the LLM sometimes generates selectors for elements that do not exist -- the test runs, always passes, and catches nothing.
  • !Healer misfires on intentional UI changes: if a redesign moves the element to a new DOM location, Healer may patch the old locator with a semantically-close-but-wrong one.
  • !Auth and session: MCP does not natively handle OAuth flows or multi-tab sessions without additional configuration.

> faq

What is agentic testing?[+]
Agentic testing is LLM-driven test design, autonomous execution, and self-repair without human supervision during a run. We define it on a five-level capability ladder: Level 0 is traditional scripts; Level 1 is LLM-assisted authoring (Copilot); Level 2 is prompt-to-test generation (Playwright MCP); Level 3 is autonomous plan-run-heal without mid-run human input (QA Wolf, Momentic); Level 4 is cross-suite strategy design; Level 5 is fully autonomous test maintenance across releases (still a research problem).
What is Playwright MCP and why does it matter?[+]
Playwright MCP is a Model Context Protocol server published by Microsoft that lets LLMs (GitHub Copilot, Claude, GPT-4o) drive a real Chromium browser during test generation. The LLM can click, type, observe DOM state, and generate Playwright test code based on actual browser interactions. This is the most practical Level 2 bridge available in April 2026 -- it costs only a Copilot subscription and produces real Playwright code you own.
What is the difference between Level 2 and Level 3 agentic testing?[+]
Level 2 (prompt-to-test generation) requires a human to describe the test scenario, review the output, and trigger the run. The LLM generates; the human validates. Level 3 (autonomous run-and-heal) requires no mid-run human input. The agent plans a test, runs it, observes failures, repairs selectors or logic, and re-runs -- all without a human in the loop. QA Wolf and Momentic operate at Level 3. The distinction matters because Level 2 still requires QA headcount to supervise; Level 3 reduces headcount.
What is the Playwright Healer and how does it work?[+]
Playwright Healer is an experimental agent layer that sits above a Playwright test suite and automatically repairs failing locators when UI changes break them. When a test fails with a locator error, Healer uses an LLM to inspect the current DOM, identify the most likely new selector for the target element, patch the test, and re-run. It is distinct from Playwright MCP (which generates new tests) -- Healer maintains existing tests.
What would a Level 5 agentic testing system look like?[+]
A Level 5 system would autonomously maintain an entire test strategy across software releases: detecting when new features require new tests, retiring stale tests, rebalancing the suite for coverage gaps, and generating new tests for regression risks identified from production error logs. TAM-Eval (SANER 2026) is the nearest academic benchmark for this capability. No commercial tool is at Level 5 as of April 2026.