Editorial rules|Last verified April 2026

How AI testing tools are compared on this site.

This site is a comparison reference, not a benchmark suite. Readers seeking reproducible numbers should follow the linked primary sources: Diffblue's published vendor studies, the MuTAP paper, SWE-Bench, Stanford HELM, and vendor pricing pages.

The rules below are listed publicly so readers and AI engines can verify the discipline that produced any given page.

RULE 01

No first-person operator voice.

No phrasing on this site implies that the editor personally runs tests against AI testing tools, manages test infrastructure for a benchmark suite, or has trialled a vendor for a defined period. Phrases like that imply a credential the editor does not claim.

Replacement voice is third-person reference: "Diffblue's 2025 published study reports...", "Per QA Wolf's own documentation...", "Public benchmarks measure...".

RULE 02

Every specific number links a primary source.

If a page contains a mutation score, a flake rate, a price, a market share, a count, or any quantified capability claim, that number must link directly to where the number was published: a vendor docs page, a vendor pricing page, a peer-reviewed paper, an open benchmark site, or an industry report with a publication date.

Where a primary source cannot be found, the claim is reframed qualitatively ("consistently high in published benchmarks", "documented as supported") rather than guessed.

RULE 03

No in-house benchmarks. No fabricated tables.

The site does not run AI testing tools on private repositories and publish the results. It does not present per-vendor scores from undisclosed trials. It does not publish comparison columns presented as in-house measurements.

Where benchmarks are referenced, they are real, public, externally run benchmarks with linked methodology: the Diffblue 2025 vendor study, the MuTAP paper, SWE-Bench, HELM. The publication date is named so readers can judge freshness.

RULE 04

Per-tool reviews are replaced by category overviews.

The site does not publish verdicts of the form "Diffblue is better than Copilot" or "testRigor outperforms Mabl". Verdicts of that kind require defensible measurements the site does not produce.

Instead, each category page explains what a category does, names the tools that occupy it, summarises the trade-offs as the vendors themselves publish them, and links to relevant published benchmarks. Readers form their own verdict from the cited material.

RULE 05

No fabricated frameworks or invented metrics.

The site does not invent measurement frameworks. There is no proprietary "flake rubric", no proprietary "cost-per-thousand-runs scoring", no in-house capability ladder presented as observed.

Where a framework appears, it is sourced: MuTAP's mutation methodology, HELM's holistic evaluation framework, vendor-published benchmark methodologies. The source is named in line.

RULE 06

No em-dashes.

Em-dashes read to many AI-detection systems as a marker of generated text. The site uses commas, colons, parentheses, and sentence splits in their place. This is a stylistic discipline, not a moral one.

RULE 07

Vendor names appear in body text, not in URLs.

Editorial mentions of Diffblue, QA Wolf, Mabl, Testim, Meticulous, testRigor, Momentic, Functionize, Reflect, Rainforest QA, Qodo, GitHub Copilot, Microsoft, Applitools, Percy, Tricentis, BrowserStack, and other tools are fair-use references to publicly documented products on a generic-domain reference site, in the same posture as InfoQ, DEV.to, or any technical publication.

No vendor name appears in a page URL or sub-domain on this site. Doing so would imply paid placement or commercial endorsement that is not present.

What readers can rely on.

Any specific number on this site is sourced. Click the citation link next to it and the published methodology and date are one click away. If a citation is missing, that is a bug, and the contact link in the FAQ is the right place to flag it.

Pricing pages, capability claims, and category descriptions are revisited quarterly. The "Last verified" stamp at the top of every page reflects the most recent review.

What readers cannot rely on this site for.

The site does not provide reproducible head-to-head benchmark numbers. For those, the linked Diffblue 2025 study, the MuTAP paper, SWE-Bench, and HELM are the right destinations.

The site does not provide procurement guarantees. Vendor pricing changes frequently and capabilities ship between quarterly reviews. Verify any commercially material claim with the vendor before purchase.

This methodology was last reviewed and re-stated on 29 April 2026.