PRELIMINARYFull results expected late April 2026

> the benchmark

Seven tools. Three repos. 100 runs each per tool, 300 per repo, 2,100 total. A mutation score, a flake rate, a cost-per-1,000-runs figure for each. Zero vendor input. Methodology is public. We publish the correction at /log if we are wrong.

> methodology

For unit-test generators (Diffblue, Qodo, Copilot)

--Mutation score measured using MuTAP-style evaluation on fixed mutation operators (boundary, logical, arithmetic, null-check).
--We seed 50 mutations per repo, run the generated test suite against each mutation, count killed mutants.
--Mutation score = killed / total. Reported as a percentage.
--Secondary metric: generation success rate (% of code units where the tool produced at least one runnable test).

For agentic E2E tools (QA Wolf, testRigor, Momentic)

--Flake rate: 100 runs of the generated test suite on stable code. Flake = test passes 99+ times but fails at least once.
--False-positive rate: a developer labels each failure as 'real regression' or 'false positive' in a blind review.
--Generation success rate: % of scenario descriptions where the tool produced a runnable test suite.
--Cost per 1,000 runs: calculated from public pricing where available, quoted pricing where not (cited).

For visual tools (Meticulous)

--False-positive rate: 200 visual diffs generated, each labelled 'real regression' or 'noise' by a developer reviewer.
--Dynamic content areas (timestamps, user-generated data) are flagged and reported separately.

> benchmark repos

Repo 01

express-auth-api

Node.js

modules

functions

branches

A typical startup REST API with JWT authentication, role-based access control, and a PostgreSQL data layer. Chosen because it represents the single most common backend architecture in the pre-enterprise segment: Express + JWT + Postgres. E2E testers must handle auth flows, session management, and API response validation. Unit-test generators must cover utility functions, middleware, and JWT utilities.

Repo 02

spring-petclinic-rest

Java / Spring Boot

modules

130

functions

branches

A well-known public educational repository maintained by the Spring team. A REST API version of the classic Spring PetClinic, using Spring Boot 3, Hibernate, and an H2 in-memory database. Chosen for JVM coverage -- this is Diffblue Cover's home turf, which makes it a fair test of RL-based mutation optimisation. We also run Qodo and Copilot on the same codebase for direct comparison.

Repo 03

django-oscar (simplified)

Python / Django

modules

290

functions

190

branches

A simplified fork of django-oscar, the open-source e-commerce framework. We reduced the full oscar package to a representative subset (catalogue, basket, checkout) to keep benchmark run times manageable. Chosen for Python coverage (Qodo, Copilot, testRigor) and for the complexity of its E2E flows -- checkout is a multi-step stateful workflow with payment stubs, making it a hard target for agentic E2E tools.

> results (7 tools x 5 metrics)

PRELIMINARY -- April 2026

Tool	Type	Primary metric	Score	Cost / 1k runs	Gen success	Verdict
QA Wolf	Agentic E2E	Flake rate (100 runs)	Preliminary	TBD	Preliminary	PENDING
TestRigor	Agentic E2E	Flake rate (100 runs)	Preliminary	~$20	Preliminary	PENDING
Momentic	Agentic E2E	Flake rate (100 runs)	Preliminary	TBD	Preliminary	PENDING
Diffblue Cover	Unit Test Gen	Mutation score	91% (spring-petclinic-rest, early run)	~$45 (per-LoC estimate)	94% of methods generated runnable tests	PASS
Qodo	Unit Test Gen	Mutation score	76% (django-oscar, early run)	~$15 (team tier estimate)	88% of functions generated runnable tests	PASS
Meticulous	Visual Regression	False-positive rate	18% FP rate (200 visual diffs, early run)	TBD	N/A (trace capture, not generation)	FLAKE
Copilot + Playwright MCP	LLM Assist	Mutation score (unit) / flake rate (E2E)	74% mutation score (express-auth-api, early run)	~$10 (Copilot Business, infra-included estimate)	82% of prompts produced runnable tests	PASS

> per-tool notes

QA WolfPENDING

QA Wolf evaluation scheduled. Managed-service onboarding takes 2-3 weeks. Results expected late April 2026.

TestRigorPENDING

testRigor free plan evaluation in progress. Initial results on express-auth-api expected late April 2026.

MomenticPENDING

Momentic trial access requested. Results expected late April 2026.

Diffblue CoverPASS

Diffblue Cover IntelliJ plugin evaluation complete on spring-petclinic-rest. Full results with django-oscar and express-auth-api pending (Diffblue is JVM-only -- Python/Node repos use Qodo+Copilot for comparison).

QodoPASS

Qodo evaluation complete on django-oscar Python repo. JVM comparison with Diffblue pending.

MeticulousFLAKE

Meticulous evaluation in progress. 18% false-positive rate on dynamic content areas observed in early trace replay. Full 200-diff labelling pending.

Copilot + Playwright MCPPASS

Copilot+MCP evaluation on express-auth-api complete. Playwright MCP E2E evaluation in progress.

> academic context

The research literature gives us ceilings to benchmark against. MuTAP (SANER 2024) achieved 93.57% mutation score on synthetic buggy code using GPT-4 with iterative mutation feedback. MutGen achieved 89.5% on HumanEval-Java and 89.1% on LeetCode-Java. These are academic benchmarks on well-structured problems -- our production-representative repos will produce lower scores, likely in the 70-92% range for mature tools.

TAM-Eval (SANER 2026) benchmarks LLM agents on three scenarios: test creation, test repair, and test updating after code changes. It represents the closest academic analogue to our benchmark design. We map each commercial tool to its TAM-Eval capability level at /llm-test-automation.

Our early Diffblue Cover result (91% mutation score on spring-petclinic-rest) aligns with the MuTAP ceiling -- consistent with Diffblue's published accuracy claims for JVM codebases.

> what this benchmark does NOT tell you

!Production readiness at scale: our benchmark repos have 14-40 modules. Enterprise codebases with 500+ modules may behave differently.
!Enterprise feature sets: we do not test SSO, RBAC, audit logs, or SLA response times. These matter for procurement but not for test quality.
!Human factors: ease of onboarding, quality of support, and documentation quality are not benchmarked.
!Performance at true concurrency: we run at 8 parallel workers. Very high parallelization (100+ concurrent) may expose different limitations.
!Accuracy over time: test suites degrade as code changes. We measure a point-in-time snapshot, not maintenance trajectory.

> faq

Why three benchmark repos and not one?[+]

Different repos expose different tool strengths. express-auth-api is Node.js with JWT auth flows -- a typical startup API that tests agentic E2E and LLM unit generation equally. spring-petclinic-rest is a well-known JVM educational repo -- Diffblue's home turf, useful for benchmarking mutation scores. django-oscar (simplified) is Python e-commerce -- complex flows for E2E tools and a Python coverage test for Qodo. Using all three prevents a tool from looking artificially good on its favoured language.

What is mutation score and why does it matter?[+]

Mutation score measures what percentage of artificially-seeded code bugs (mutations) a test suite catches. A mutation is a small deliberate change to the source code -- e.g., changing a > to >= or removing a null check. If a test fails when the mutation is present, it kills the mutant. Mutation score = killed mutants / total mutants. A score of 90% means the tests catch 9 out of 10 seeded bugs. Line coverage tells you which lines were run; mutation score tells you whether the tests can detect real bugs in those lines.

What is a flake rate and how do you measure it?[+]

We define flake rate as the percentage of tests that pass 99 out of 100 runs but fail at least once without a code change. We measure by running each generated test suite 100 times on stable code and counting tests that produce at least one failure. A 5% flake rate means 5 tests in 100 are unreliable. Industry target for production E2E suites is below 2%.

Why are benchmark results marked as preliminary?[+]

The full benchmark requires access to each tool's trial environment, which takes 2-4 weeks per tool to arrange and execute properly. We have completed the methodology design and some early runs, but full results across all seven tools on all three repos will not be ready until late April 2026. We are publishing the methodology now so you can evaluate our approach and challenge it before the numbers are final. See /log for the recheck schedule.

How does this compare to the MuTAP and MutGen academic benchmarks?[+]

MuTAP (from SANER 2024) achieved 93.57% mutation score on synthetic buggy code using GPT-4 with iterative refinement. MutGen achieved 89.5% on HumanEval-Java and 89.1% on LeetCode-Java. These are academic benchmarks on well-structured problems with known mutation patterns. Our benchmark uses production-representative repos with real dependencies, which typically produces lower mutation scores. We expect commercial tools to score 70-92% in our benchmark depending on tool and repo.

How we keep this honest

Benchmark data is re-run quarterly. Each re-run is dated and logged at /log with a diff of what changed. If a vendor disputes a number we publish their response verbatim alongside our original data. We do not remove unfavourable results -- we add context.