Last verified April 2026
> llm test automation
/ agentic, properly defined
The phrase “agentic testing” is being used loosely by every vendor in the space. We define it precisely with a five-level capability ladder, map current tools to each level, and explain what Level 5 would actually require. The TAM-Eval paper from SANER 2026 is the closest academic benchmark for the full capability spectrum.
> the five-level capability ladder
Traditional automation
BaselineScripts written by humans, maintained by humans. Selenium, Pytest, JUnit with hand-authored test cases. AI has no role in authoring or repair. All maintenance burden is on the engineering team.
LLM-assisted authoring
CommonThe LLM helps a human write test code. The human prompts, the LLM suggests, the human reviews and commits. The LLM does not run tests, observe failures, or repair anything. Net effect: faster test authoring with the same maintenance burden.
Prompt-to-test generation
Current frontier for most teamsThe LLM generates test scripts from a high-level prompt or by observing a real browser session. The human describes the scenario; the LLM produces runnable code. Human review is still required before committing. Playwright MCP is the canonical L2 implementation: LLM drives a real Chromium browser, generates Playwright code from observations.
Autonomous plan-run-heal
Available today (costly)The agent plans a test, runs it, observes failures, repairs selectors or logic, and re-runs without human input during the run. A human may set the initial goal and review the final output, but the agent handles the execution loop. This is where QA Wolf and Momentic operate as of Q1 2026.
Cross-suite strategy design
EmergingThe agent analyses an entire codebase or product surface area and designs a test strategy: which scenarios to cover, which framework to use, how to balance unit vs integration vs E2E coverage, where the highest-risk code paths are. No commercial tool is fully at Level 4 as of April 2026, but Mabl's latest release includes a limited suite-design recommendation feature.
Autonomous release-spanning maintenance
Research onlyThe agent maintains an entire test strategy across software releases autonomously: detecting when new features require new tests, retiring stale tests after feature removals, rebalancing coverage for newly-high-risk areas, and generating tests for regression risks identified from production telemetry. This is a research-level capability as of April 2026.
> which tool is at which level
| Tool | Capability level | Note |
|---|---|---|
| QA Wolf | L3 | Full agentic Playwright output, autonomous run-repair cycle. |
| Momentic | L3 | Goal-to-test, autonomous with proprietary format. |
| testRigor | L2-L3 | L2 plain-English generation, L3 with Vision AI mode enabled. |
| Mabl | L2-L4 | L2 codeless authoring, L3 auto-healing, L4 partial suite design. |
| Copilot + Playwright MCP | L2 | Prompt-driven browser-observed generation. Human review required. |
| Diffblue Cover | L2 | RL-driven source-to-JUnit generation. No E2E or run-repair loop. |
| Qodo | L1-L2 | L1 in IDE suggestion mode, L2 in full generation mode. |
| Testim | L2 | Codeless recorder with LLM heal layer. No autonomous planning. |
| Meticulous | L2 | Trace capture is automated, but replay analysis requires human review. |
> playwright mcp deep-dive
Playwright MCP is a Model Context Protocol server published by Microsoft that lets any MCP-compatible LLM (GitHub Copilot, Claude, GPT-4o) control a real Chromium browser. The LLM can navigate pages, click elements, fill forms, observe the resulting DOM and network state, and generate Playwright test code based on what it sees.
This is the most practical Level 2 bridge available today. It costs only your Copilot or Claude subscription. The output is real Playwright code you own and can run independently in CI. Setup takes roughly 30 minutes following the Microsoft Learn walkthrough. The key advantage over vendor E2E tools: zero lock-in, zero per-run fees, and the resulting tests are portable Playwright files, not proprietary YAML.
The Playwright Healer agent extends the MCP pattern for maintenance: when an existing Playwright test fails with a locator error after a UI change, Healer inspects the current DOM, identifies the most likely new selector, patches the test, and re-runs. It does not generate new tests -- it repairs existing ones. The TestDino MCP server adds centralised failure classification and reporting on top of both generation (MCP) and repair (Healer) layers.
When MCP breaks
- !Context window limits on very large SPAs: the DOM snapshot for a 200-component React app exceeds the context window of most LLMs, causing the agent to miss elements.
- !Hallucinated selectors: on pages with unconventional DOM structures, the LLM sometimes generates selectors for elements that do not exist -- the test runs, always passes, and catches nothing.
- !Healer misfires on intentional UI changes: if a redesign moves the element to a new DOM location, Healer may patch the old locator with a semantically-close-but-wrong one.
- !Auth and session: MCP does not natively handle OAuth flows or multi-tab sessions without additional configuration.
> faq