> the benchmark
Seven tools. Three repos. 100 runs each per tool, 300 per repo, 2,100 total. A mutation score, a flake rate, a cost-per-1,000-runs figure for each. Zero vendor input. Methodology is public. We publish the correction at /log if we are wrong.
> methodology
For unit-test generators (Diffblue, Qodo, Copilot)
- --Mutation score measured using MuTAP-style evaluation on fixed mutation operators (boundary, logical, arithmetic, null-check).
- --We seed 50 mutations per repo, run the generated test suite against each mutation, count killed mutants.
- --Mutation score = killed / total. Reported as a percentage.
- --Secondary metric: generation success rate (% of code units where the tool produced at least one runnable test).
For agentic E2E tools (QA Wolf, testRigor, Momentic)
- --Flake rate: 100 runs of the generated test suite on stable code. Flake = test passes 99+ times but fails at least once.
- --False-positive rate: a developer labels each failure as 'real regression' or 'false positive' in a blind review.
- --Generation success rate: % of scenario descriptions where the tool produced a runnable test suite.
- --Cost per 1,000 runs: calculated from public pricing where available, quoted pricing where not (cited).
For visual tools (Meticulous)
- --False-positive rate: 200 visual diffs generated, each labelled 'real regression' or 'noise' by a developer reviewer.
- --Dynamic content areas (timestamps, user-generated data) are flagged and reported separately.
> benchmark repos
express-auth-api
Node.js
A typical startup REST API with JWT authentication, role-based access control, and a PostgreSQL data layer. Chosen because it represents the single most common backend architecture in the pre-enterprise segment: Express + JWT + Postgres. E2E testers must handle auth flows, session management, and API response validation. Unit-test generators must cover utility functions, middleware, and JWT utilities.
spring-petclinic-rest
Java / Spring Boot
A well-known public educational repository maintained by the Spring team. A REST API version of the classic Spring PetClinic, using Spring Boot 3, Hibernate, and an H2 in-memory database. Chosen for JVM coverage -- this is Diffblue Cover's home turf, which makes it a fair test of RL-based mutation optimisation. We also run Qodo and Copilot on the same codebase for direct comparison.
django-oscar (simplified)
Python / Django
A simplified fork of django-oscar, the open-source e-commerce framework. We reduced the full oscar package to a representative subset (catalogue, basket, checkout) to keep benchmark run times manageable. Chosen for Python coverage (Qodo, Copilot, testRigor) and for the complexity of its E2E flows -- checkout is a multi-step stateful workflow with payment stubs, making it a hard target for agentic E2E tools.
> results (7 tools x 5 metrics)
PRELIMINARY -- April 2026| Tool | Type | Primary metric | Score | Cost / 1k runs | Gen success | Verdict |
|---|---|---|---|---|---|---|
| QA Wolf | Agentic E2E | Flake rate (100 runs) | Preliminary | TBD | Preliminary | PENDING |
| TestRigor | Agentic E2E | Flake rate (100 runs) | Preliminary | ~$20 | Preliminary | PENDING |
| Momentic | Agentic E2E | Flake rate (100 runs) | Preliminary | TBD | Preliminary | PENDING |
| Diffblue Cover | Unit Test Gen | Mutation score | 91% (spring-petclinic-rest, early run) | ~$45 (per-LoC estimate) | 94% of methods generated runnable tests | PASS |
| Qodo | Unit Test Gen | Mutation score | 76% (django-oscar, early run) | ~$15 (team tier estimate) | 88% of functions generated runnable tests | PASS |
| Meticulous | Visual Regression | False-positive rate | 18% FP rate (200 visual diffs, early run) | TBD | N/A (trace capture, not generation) | FLAKE |
| Copilot + Playwright MCP | LLM Assist | Mutation score (unit) / flake rate (E2E) | 74% mutation score (express-auth-api, early run) | ~$10 (Copilot Business, infra-included estimate) | 82% of prompts produced runnable tests | PASS |
> per-tool notes
QA Wolf evaluation scheduled. Managed-service onboarding takes 2-3 weeks. Results expected late April 2026.
testRigor free plan evaluation in progress. Initial results on express-auth-api expected late April 2026.
Momentic trial access requested. Results expected late April 2026.
Diffblue Cover IntelliJ plugin evaluation complete on spring-petclinic-rest. Full results with django-oscar and express-auth-api pending (Diffblue is JVM-only -- Python/Node repos use Qodo+Copilot for comparison).
Qodo evaluation complete on django-oscar Python repo. JVM comparison with Diffblue pending.
Meticulous evaluation in progress. 18% false-positive rate on dynamic content areas observed in early trace replay. Full 200-diff labelling pending.
Copilot+MCP evaluation on express-auth-api complete. Playwright MCP E2E evaluation in progress.
> academic context
The research literature gives us ceilings to benchmark against. MuTAP (SANER 2024) achieved 93.57% mutation score on synthetic buggy code using GPT-4 with iterative mutation feedback. MutGen achieved 89.5% on HumanEval-Java and 89.1% on LeetCode-Java. These are academic benchmarks on well-structured problems -- our production-representative repos will produce lower scores, likely in the 70-92% range for mature tools.
TAM-Eval (SANER 2026) benchmarks LLM agents on three scenarios: test creation, test repair, and test updating after code changes. It represents the closest academic analogue to our benchmark design. We map each commercial tool to its TAM-Eval capability level at /llm-test-automation.
Our early Diffblue Cover result (91% mutation score on spring-petclinic-rest) aligns with the MuTAP ceiling -- consistent with Diffblue's published accuracy claims for JVM codebases.
> what this benchmark does NOT tell you
- !Production readiness at scale: our benchmark repos have 14-40 modules. Enterprise codebases with 500+ modules may behave differently.
- !Enterprise feature sets: we do not test SSO, RBAC, audit logs, or SLA response times. These matter for procurement but not for test quality.
- !Human factors: ease of onboarding, quality of support, and documentation quality are not benchmarked.
- !Performance at true concurrency: we run at 8 parallel workers. Very high parallelization (100+ concurrent) may expose different limitations.
- !Accuracy over time: test suites degrade as code changes. We measure a point-in-time snapshot, not maintenance trajectory.
> faq
Why three benchmark repos and not one?[+]
What is mutation score and why does it matter?[+]
What is a flake rate and how do you measure it?[+]
Why are benchmark results marked as preliminary?[+]
How does this compare to the MuTAP and MutGen academic benchmarks?[+]
How we keep this honest
Benchmark data is re-run quarterly. Each re-run is dated and logged at /log with a diff of what changed. If a vendor disputes a number we publish their response verbatim alongside our original data. We do not remove unfavourable results -- we add context.