The Testing Disease: Why Most AI-Generated Tests Are Theater

I asked an AI to write tests for my code.

It did. Coverage hit 94%. Every test passed. I shipped. A bug came back within the week that any real test would have caught.

When I went back and looked at the tests, I understood immediately. They weren't wrong. They were agreeing. Every test matched the shape of the source code it was testing, because the source was what the AI read to write the tests. The tests were photographs of the code, not interrogations of it.

That's when I realized the problem wasn't the AI. It was the direction of flow.

Tests Are Not Documentation

The way most people think about tests is wrong, and AI makes the wrongness visible.

A test is not documentation of how the code works. If that's what you want, read the code. A test is an adversarial claim. It says: this is what should happen, and I will fail if it doesn't. The word "should" does all the work. The test has to know what the right answer is, independent of what the code produces. Otherwise it's not a test. It's an echo.

When an AI reads source code and writes tests from it, the tests inherit the code's assumptions. If the code has a bug, the AI writes tests that pass against the bug. If the code handles an edge case incorrectly, the AI writes a test that confirms the incorrect handling. There's no outside reference point. The test and the code are the same claim stated twice.

Martin Fowler and Kent Beck called this the test infection. The idea is that once a developer writes their first real test, watches it fail on a real bug, and feels the relief of catching something before users did, they can't stop. They're infected. The disease spreads through the codebase. But the infection doesn't come from the test passing. It comes from the test catching something. Without that moment of catch, tests are decoration.

A test that could never fail is not a test. It's a recording.

Coverage Is a Compliant Metric

Code coverage is the standard answer to "how much do we test?" and it is a terrible answer.

Coverage tells you what lines of code were executed during a test run. It does not tell you whether anything about those lines was actually verified. You can run a function, watch it execute, and assert nothing. Coverage goes up. Quality does not move.

I've seen codebases with 100 percent line coverage and a 20 percent mutation score. Mutation score measures something completely different: if you deliberately change the source code to introduce a bug, do the tests notice? A 20 percent mutation score means that 80 percent of the deliberately-introduced bugs slipped through the test suite undetected. Eighty percent of the tests were performing. They ran. They passed. They would have passed with the bug or without it.

Coverage is a compliant metric. It tells you whether you wrote tests, not whether the tests work. The moment you treat it as proof of quality, you've traded a hard question for an easy one.

Mutation score is harder to game. The only way to raise it is to write tests that actually catch something. A surviving mutant is a hole in the immune system. You can see it. You can fix it. You can measure progress honestly.

The Direction of Flow

Here's the reframe that changes everything.

In the broken flow, specifications are implicit, source code is explicit, and tests are written from source:

Requirements in someone's head → Source code → Tests derived from source → Nothing verifiable

In the correct flow, specifications are explicit and tests come from specs:

Requirements → Specifications → Tests written from specifications → Source code validated against tests

The difference is not cosmetic. In the first flow, the source code is the oracle. The tests defer to it. They can never disagree, because they came from it. In the second flow, the specification is the oracle. The source code is the thing being validated. The tests are instruments measuring whether the code does what the spec says. They can absolutely disagree with the code, and when they do, the code is the thing that's wrong.

This is the rule: tests must verify behavior, not implementation. Behavior is described in the language of the problem. Implementation is the particular shape of the code that happens to produce that behavior. When you test behavior, your tests survive refactoring. When you test implementation, your tests break on every rename.

AI can write tests well, but only if it writes them from specifications. The moment you hand it source code as the primary input, it writes tautologies.

The Biology of a Real Test Suite

If a test is an adversarial claim, a test suite is an immune system. And immune systems have properties that test suites should copy.

The first property is adaptation. An immune system doesn't memorize every pathogen it's ever seen. It recognizes patterns of threat and responds generatively to new ones. Property-based testing works the same way. Instead of writing a test for one specific input, you describe a property that should hold for all inputs. "The output of sort should have the same length as the input." The testing library generates thousands of random inputs, tries the property, and shrinks any failure down to the smallest reproducing case. You're not testing a specific scenario. You're testing a behavioral claim that covers entire classes of scenario at once.

The second property is memory. An immune system remembers what it's encountered. A real test suite remembers what went wrong. Every production incident becomes a test case the next time you infect the codebase. Every vault entry about a weird edge case becomes a spec requirement. The disease gets smarter because the tests remember what the code forgot.

The third property is audit. Your immune system is periodically challenged by real pathogens. Your test suite should be periodically challenged by mutations. Mutation testing introduces synthetic bugs into the source code and measures how many the test suite catches. It's the closest thing to an honest audit that exists in software. If your tests can't catch a mutation on line 147, then a real bug on line 147 will not be caught either.

The fourth property is blast radius. When an immune system fails, the failure is not uniform. Some failures take down the whole system. Some are localized. You triage by impact, not by count. Test suites should do the same. A thousand unit tests on utility functions that never change is less valuable than one integration test on the payment flow. Risk-based prioritization measures where bugs are likely to live, and invests testing effort there first. The strongest single predictor of future bugs in a file is how often that file has changed in the past. The combination of churn and complexity is the most dangerous territory in any codebase.

What to Build

The system I'm calling the Testing Disease has seven phases, mapped to an infection lifecycle.

Infection is the scan. An analysis pass crawls the entire codebase, builds a dependency graph, measures complexity and churn, identifies hubs (files that many others import) and orphans (files imported by nothing). Each file gets a risk score. The output is a manifest: what's in the codebase, what's risky, what matters.

Incubation is the strategy. For each high-risk module, a test specification is generated. The specification describes behavior in the language of the problem, not the shape of the code. Different module types get different specification formats. User-facing workflows get Gherkin. Calculation engines get table-based specs. State machines and hooks get contract specs. Pre-release hardening gets risk-weighted specs.

Outbreak is the execution. An AI executor, separate from the analyzer, reads the specifications and writes tests. The executor never reads the source code as the primary input. It reads the spec, writes tests from the spec, runs them in a sandbox, and iterates until they pass. When the source code contradicts the spec, the test fails and the executor logs it as a discovered bug. It does not modify the source to match the test. It reports.

Mutation is the audit. Stryker mutates the source code systematically and the test suite is re-run. Every surviving mutant is a gap. Gaps are triaged. Real gaps get supplemental specs and new tests. Equivalent mutants are marked. The loop continues until the mutation score clears a threshold based on module risk.

Spread is the incremental update. When the code changes, the disease scans the diff, updates risk scores for changed files, regenerates specs for changed behavior, and writes new tests. Only the affected surface area is touched. Existing tests that still apply are left alone.

Immunity is the report. A production readiness document summarizes the state of every testing layer, identifies unresolved bugs, lists coverage gaps, and issues a verdict. Pass, conditional pass, warn, or fail. The report is produced from data, not prose.

Pandemic is the full reset. When the manifest is stale, the framework has changed, or mutation scores have collapsed, the whole pipeline runs from scratch. Existing tests are not deleted. They're integrated. Conflicts are triaged.

The point of this structure is that each phase has a specific job and a specific gate. No phase produces output that the next phase has to guess at. The analyzer produces a manifest. The strategist produces specs. The executor produces tests. The auditor produces a score. The monitor produces reports. Nothing is implicit.

Claude Analyzes. Codex Executes. Stryker Validates.

This is the division of labor that makes it work.

Claude handles the cognitive work. Reading the codebase. Writing specifications. Analyzing mutation results. Identifying gaps. Producing reports. Claude does not write tests.

Codex handles execution. Reading specifications. Writing tests from specs. Running them in a sandbox. Iterating on failures. Writing supplemental tests to kill surviving mutants. Codex does not modify source code.

Stryker, or whatever mutation tool you use, is the neutral third party. It doesn't care what the tests were written from. It introduces bugs and observes. Its output is the only metric neither Claude nor Codex can game.

This division is important because it enforces the direction of flow. Specifications come from a system that doesn't write tests. Tests come from a system that doesn't write specs. The source code is validated by both, from opposite directions. Nothing is self-referential.

What to Stop Doing

Stop letting AI write tests from source code. If your test generator reads the function and writes tests for it, your tests will never disagree with the function. Generate tests from specifications or requirements, not from the code itself.
Stop treating coverage as a quality metric. Coverage measures whether lines executed. It does not measure whether anything was verified. Stop using it as a gate.
Stop writing tests that pass immediately. If your first test run passes on the first try, the tests probably aren't testing anything. Good tests fail for real reasons while you're writing them. If you never see a test fail, you're not writing tests. You're writing assertions about things you already believed.
Stop mocking everything. Tests with mocks for every dependency test the mocks, not the code. Integration tests, with real databases and real network behavior mocked only at the HTTP boundary, are where confidence actually lives.
Stop measuring test quality by count. Ten mutation-validated integration tests are worth more than a thousand tautological unit tests. Count what catches bugs. Nothing else.

The Principle

Tests written from source code are collaborators. Tests written from specifications are adversaries. Only adversaries prove anything.

A test that could never fail is not a test. A test suite with no mutations surviving is a real immune system. A coverage report is not proof. A specification is not implementation. A bug that ships past green tests means the tests were theater.

Treat testing the way a body treats infection. Scan constantly. Identify risk by pattern, not policy. Build antibodies from the shape of what should happen, not from the shape of what does. Audit the antibodies against mutations. Remember what got through. Harden the response.

Coverage is the tally. Mutation is the mirror. Specifications are the oracle. Build in that order and the tests become real.

Part of a series. The system architecture behind this paper, the commands, the hooks, the spec formats, and the analysis pipeline, lives in a version for the inner circle. That version is injectable into any repo and builds the full pipeline. Reach out if you want the deep layer.

The Testing Disease

Abstract