What counts as a flaky test?

A test that passes and fails intermittently on the same code. Common causes: timing dependencies, shared mutable state, network instability, locator brittleness, environment drift. Tests that fail for a real reason (the code is wrong) are not flakes; tests that pass when they should fail are also not flakes (they are bugs, but a different kind).

How much do flaky tests actually cost?

The direct cost is CI minutes consumed by retries. The indirect cost is engineer attention diverted to triage and the deploy-velocity penalty when flakes block merges. Google's 2019 paper put the engineering productivity cost at significant levels for a search-scale operation; for a typical 50-engineer company the cost is smaller in absolute terms but still meaningful relative to engineering budget.

Quarantine vs retry vs fix?

Quarantine moves the test out of the blocking path; retry runs it again hoping for a different outcome; fix removes the underlying flakiness. The right strategy is fix when feasible, quarantine when the fix is delayed, retry sparingly because retries hide the underlying signal. Retries-forever is the worst pattern because it normalises flakiness.

Synthesis|Last verified April 2026

The economics of test flakiness: what it costs, what AI does about it.

Flaky tests cost real money in CI minutes, engineer attention, and deploy velocity. AI flake-detection and self-healing tools reduce some of this cost. The marketing claims around flake reduction are often more dramatic than the math supports. This page walks through the real cost components, the tools that address each, and the operational discipline that makes any of it work.

The four causes of flakiness

Timing. A test asserts that a UI element is visible before the page has finished rendering. Sometimes the render finishes in time and the test passes; sometimes it does not and the test fails. Fix: explicit waits with proper polling, not arbitrary sleeps.

Shared mutable state. Two tests share a database, a cache, or a global state. Running test A then B passes; running B then A fails. Fix: hermetic tests with isolated state, or careful ordering and reset.

Network instability. A test depends on an external service that is occasionally slow or down. Fix: mock the external service, or use a stable test-environment replica.

Locator brittleness. A test references a UI element by a selector that sometimes resolves and sometimes does not (often because the locator depends on dynamic content). Fix: more robust locators, or self-healing locator tooling.

Self-healing tools (Mabl, Testim, Functionize, Reflect) address only the fourth cause. The first three remain engineering problems regardless of the AI testing platform in use.

The direct CI cost

A flaky test that fails 5 percent of the time on a 1,000-PR-per-month team consumes a retry on roughly 50 PRs per month. If the test takes 2 minutes to run and the team retries up to twice, that is 50 × 2 × 2 = 200 extra CI minutes per month just for that one flaky test. Across a typical suite with dozens of flaky tests at varying flake rates, the total can run into thousands of CI minutes per month.

At GitHub Actions Linux rates (see AI testing in GitHub Actions for the cost framing), the dollar cost of CI minutes from flaky-test retries is modest in absolute terms but visible on the bill. For teams running Windows or macOS runners or expensive vendor-managed test platforms with per-minute charging, the cost is larger.

The indirect cost: engineer attention

The CI cost is the easy part. The harder cost is engineer attention diverted to triage. When a developer's PR fails because of a flaky test, the developer has to: read the failure, recognise it as a flake (or fail to recognise it), retry, wait again, and either merge despite the original failure or investigate further. Each cycle takes 5 to 15 minutes of focused attention.

At an engineer cost of around $100 to $200 per hour fully loaded, even modest flake rates burn meaningful engineering budget. A 50-engineer team with 1,000 PRs per month and a 5 percent flake-encounter rate is consuming roughly 50 × 10 minutes = 500 minutes (8 hours) of engineer triage time per month. At $150 per hour, that is $1,200 per month, $14,400 per year, before counting the productivity drag from broken flow state.

Google's 2019 paper on flaky tests in continuous integration (research.google) reported the scale of this problem at search-scale operations. For a 50-engineer company the absolute numbers are smaller but the proportional drag is comparable.

The indirect cost: deploy velocity

Flaky tests delay merges. Delayed merges delay deploys. Delayed deploys slow the feedback loop between code change and production validation. The DORA State of DevOps Report (dora.dev) measures deployment frequency as a key devops metric; flaky tests are one of the contributors to slow deployment frequency in many organisations.

The honest framing: deploy velocity is a metric leadership pays attention to, and flake management is a high-leverage way to improve it. The conversation with executives is easier when framed as deploy velocity rather than as test maintenance.

What AI flake-detection actually does

AI flake-detection systems (built into CircleCI, GitLab CI, several commercial platforms) identify which tests pass and fail intermittently across recent runs without code change. The detection itself is a well-defined statistical problem and AI-augmented detection improves on simple heuristics by classifying ambiguous cases more reliably.

The reporting surface (a dashboard of flaky tests, ranked by impact, with suggested actions) makes the flake problem visible to engineering leadership in a way that ad-hoc per-PR encounters do not. This visibility is often the first useful contribution of AI flake-detection; the technical detection is incremental, but the visibility is transformational for organisations where flakes were previously invisible.

Quarantine is the standard operational response: move the flaky test out of the blocking path, file a follow-up engineering task to investigate, keep the pipeline moving. Done well, quarantine is a triage tool; done poorly, it becomes a graveyard of permanently-quarantined tests that no longer test anything.

Retry is a tactical response: run the test again hoping for a different outcome. Sometimes correct (the test was flaky and the retry succeeded); often masking (the retry succeeded by chance and the underlying flake persists). The right discipline is to log retries, treat sustained retry-rates as bugs, and not let retries-forever become normal.

Fix is the strategic response: address the underlying cause. This is real engineering work and is the only response that produces durable improvement.

What self-healing tools do (and do not do)

Self-healing locator tools (Mabl, Testim, Functionize, Reflect) address locator-brittleness specifically. When a primary selector fails, the runner falls back to alternative identifiers and continues the test. This is real value when locator brittleness is the dominant flake cause.

Self-healing does not address timing flakes, shared-state flakes, or network-instability flakes. For teams whose flake profile is dominated by these other causes, self-healing does not move the needle. The honest evaluation involves understanding the team's actual flake profile before assuming self-healing will solve it.

See the self-healing tests category page for the detailed mechanism and Mabl vs Testim for the most common vendor pairing.

The honest budget conversation

For a typical 50-engineer team with moderate flake rates, the realistic numbers are: $1,000 to $5,000 per year in direct CI cost from retries, $10,000 to $50,000 per year in indirect engineer-time cost, modest but real impact on deploy velocity. Total cost of flakes in the $15,000 to $60,000 range annually.

AI flake-detection tools that cost $10,000 to $30,000 per year can pay for themselves on the indirect cost line alone if they cut flake-encounter rates by 30 percent or more. Self-healing tools that cost more (Mabl, Testim, similar) need to deliver larger reductions to justify the cost; they often do for teams whose flake profile is locator-dominated, and often do not for teams whose flake profile is elsewhere.

The honest conversation with finance includes the indirect cost; the conversation that focuses only on CI minutes understates the value of flake management by an order of magnitude.

Frequently asked questions

What counts as a flaky test?: A test that passes and fails intermittently on the same code. Common causes: timing dependencies, shared mutable state, network instability, locator brittleness, environment drift. Tests that fail for a real reason (the code is wrong) are not flakes; tests that pass when they should fail are also not flakes (they are bugs, but a different kind).
How much do flaky tests actually cost?: The direct cost is CI minutes consumed by retries. The indirect cost is engineer attention diverted to triage and the deploy-velocity penalty when flakes block merges. Google's 2019 paper put the engineering productivity cost at significant levels for a search-scale operation; for a typical 50-engineer company the cost is smaller in absolute terms but still meaningful relative to engineering budget.
Does AI flake detection actually work?: Yes, with caveats. Detecting which tests are flaky (versus genuinely failing or genuinely passing) is a well-defined statistical problem and AI-augmented detection improves on simple heuristics. Detecting why a test is flaky and how to fix it is harder; AI can suggest hypotheses but the engineer still does the diagnosis.
Quarantine vs retry vs fix?: Quarantine moves the test out of the blocking path; retry runs it again hoping for a different outcome; fix removes the underlying flakiness. The right strategy is fix when feasible, quarantine when the fix is delayed, retry sparingly because retries hide the underlying signal. Retries-forever is the worst pattern because it normalises flakiness.
Does adopting AI testing reduce flakes?: Self-healing locator tools reduce one specific class of flake (locator-brittleness). They do not address timing flakes, shared-state flakes, or network-instability flakes. Treating self-healing as a complete flake-management solution misses three of the four major flake causes.

Related on this site