Reference / Glossary|Last verified April 2026

AI testing glossary.

Definitions of the terms that come up most often when comparing AI testing tools. Each entry has a fragment URL ( /glossary/#term-id ) so the term can be linked into another page or shared directly.

Where a definition references a category page or a peer-reviewed source, the link is in line.

AI tester: In 2026, the phrase most commonly refers to a category of software that uses machine learning or large language models to generate, execute, or maintain software tests. It does not commonly refer to a human role. The closest human role is test engineer or quality engineer, often abbreviated SDET or QE. See the category overview.
Agentic testing: End-to-end test automation in which an LLM agent ingests a goal or natural-language scenario and drives a real browser to validate the application. The agent makes step-level decisions at run time rather than executing a fixed script. See LLM test automation.
Self-healing test: A test or test runner that recovers automatically when a primary locator (CSS selector, XPath) stops resolving. The runner falls back through alternative identifiers (text, role, accessibility label, multi-attribute fingerprint) and continues. See self-healing tests.
Mutation score: The proportion of synthetic mutants (small syntactic changes to source code) that an existing test suite catches. Mutation testing tools generate the mutants, re-run the suite, and report the kill rate. The metric is more rigorous than line coverage because it measures assertion strength rather than execution. See the MuTAP paper (arXiv:2308.16557).
Flaky test: A test that produces inconsistent results across runs without changes to the system under test. Flake usually arises from time-of-day dependencies, randomness, race conditions, or shared mutable state. End-to-end tests are flakier than unit tests because they exercise more of the system.
Visual regression: A test category that captures a baseline image (or DOM snapshot) of a UI state and flags subsequent renders that differ. Modern tools (Applitools, Percy, Chromatic) use AI-tuned thresholds to suppress trivial differences. See the visual regression section.
Behavioural diff: A related but distinct technique to visual regression. Instead of diffing pixels, the runner records real user sessions and replays them against a candidate build, flagging unexpected behavioural divergence. Meticulous occupies this sub-category (Meticulous docs).
False-positive diff: A flagged visual or behavioural difference that does not represent an actual regression: animation frames, antialiasing, third-party widget changes, etc. False-positive volume is the published trade-off of every visual regression tool. Meticulous publishes a dedicated guide to managing them (false-positive diffs).
Oracle problem: The challenge of deciding what the correct behaviour of a system under test should be. AI test generators face this problem acutely: a generator can produce many candidate tests, but without a clear oracle, the tests may pass on incorrect behaviour. Mutation testing partially addresses the problem by measuring whether tests catch synthetic bugs.
Test impact analysis: A technique that runs only the subset of tests affected by a code change. Reduces CI time but requires reliable mapping between code and tests. Sealights, Microsoft Test Impact, and several language-specific tools occupy this space.
Hermetic test: A test that runs in isolation from external systems (network, time, real database) by using stubs, mocks, or in-memory implementations. Hermetic tests are less flaky and faster but require more authoring effort.
Playwright MCP: Microsoft's Model Context Protocol server for Playwright, exposing browser automation as MCP tools that an LLM client can call. Open source and published by Microsoft (microsoft/playwright-mcp). See Playwright AI.
Model Context Protocol (MCP): An open protocol that lets LLM clients (Claude Desktop, Cursor, others) call external tools and read external resources in a standard way. Anthropic published the specification; many implementations exist. See modelcontextprotocol.io.
RL-based test generation: Reinforcement-learning search over candidate inputs to a method or class, used to produce tests that exercise distinct execution paths. Diffblue Cover is the principal commercial implementation for JVM languages.
LLM-based test generation: Test generation in which a large language model is prompted with code and produces candidate test code. Output quality varies; the MuTAP paper studies how feedback loops can improve mutation score from LLM output.
SWE-Bench: An open benchmark of model performance on real GitHub issue resolution. Used as a proxy for the kind of code-comprehension a test-writing AI agent needs. See swebench.com.
HELM: Stanford's Holistic Evaluation of Language Models. An open framework that aggregates LLM benchmarks across many scenarios, including code. See crfm.stanford.edu/helm.
MuTAP: A peer-reviewed methodology for using mutation testing to improve LLM-generated tests, published in Information & Software Technology (2024, vol. 171, article 107468). The paper is one of the most-cited evaluations of the LLM test-generation paradigm. See arXiv:2308.16557.
AI tester versus traditional test automation: Traditional test automation refers to scripted test suites in Selenium, Playwright, Cypress, JUnit, pytest, etc. AI testers augment these stacks with generation, healing, or visual diffing. Most production teams in 2026 run both: AI for the parts AI is good at, scripted for the rest.
Manual versus AI testing: Manual exploratory testing remains the dominant pattern for new features and high-risk paths in most teams, per Capgemini's World Quality Report (WQR 2025-26). AI is most adopted for test generation and bug triage.
Regression suite: The set of tests run on every release to confirm that working features still work. AI tools target regression suites most directly: generation expands them, self-healing keeps them green, visual regression catches UI drift.
Exploratory testing: Unscripted testing in which a human (typically a skilled test engineer) probes the application looking for failure modes. AI agents have not displaced this practice in 2026, though some products advertise "continuous exploratory" modes. See AI in QA.
Test oracle: The mechanism that decides whether a test pass or fail. In AI test generation, the oracle is often the developer (review) or a mutation-testing harness (objective). See "oracle problem".
Code coverage: The proportion of source-code lines, branches, or paths exercised by a test suite. A long-established metric, but a weaker signal of test quality than mutation score: lines can be covered without being asserted on. See "mutation score".
Test pyramid: A heuristic for distributing test investment: many fast unit tests, fewer integration tests, very few slow end-to-end tests. AI testing tools redistribute the pyramid: AI E2E generation is faster than human E2E authoring, which can flatten the shape.
World Quality Report (Capgemini): Capgemini's annual industry survey of QA practice. The longest-running primary source for adoption-rate claims about AI in testing; the current edition is WQR 2025-26 (17th edition, published November 2025). See capgemini.com/wqr.

Missing a definition? The site's editorial process documents requested additions. The methodology page describes the citation discipline that any new entry follows.