Why Most A/B Test “Winners” Don’t Replicate in Production
Most CRO teams accept a 5% false positive rate without realizing their actual rate is closer to 25–40%. The nominal alpha of 0.05 is a floor — every shortcut you take on top of that pushes the real rate up. Peeking. Multiple metrics. Slicing by segment. Re-running “almost significant” tests. Each one compounds.
This guide breaks down where false positives actually come from, and the corrections that keep your program honest.
What a False Positive Actually Is
A false positive (Type I error) happens when your test declares a winner but the true underlying effect is zero. You ship the variant, the dashboard reverts to baseline, and you spend the next quarter wondering why your “+12% revenue lift” never showed up in the P&L.
The 5% threshold is a contract: if every assumption holds, you’ll be wrong 1 time in 20. In practice, most teams violate the assumptions before the test even launches. See the statistical significance fundamentals for the underlying math.
The Multiple Comparisons Problem
If you run one test at 95% confidence, your false positive rate is 5%. If you check five secondary metrics in the same test, your family-wise error rate jumps to:
1 − (0.95)^5 = 22.6%
Ten metrics: 40%. Twenty metrics: 64%. This is why your “non-significant on primary, but significant on mobile checkout” finding is almost certainly noise.
The Bonferroni correction
The simplest fix: divide your alpha by the number of comparisons. Testing five metrics? Use alpha = 0.01 for each. It’s conservative — meaning it costs you statistical power — but it’s defensible.
| Metrics tested | Naive alpha | Bonferroni alpha | Family-wise error |
|---|---|---|---|
| 1 | 0.05 | 0.050 | 5% |
| 3 | 0.05 | 0.017 | 5% |
| 5 | 0.05 | 0.010 | 5% |
| 10 | 0.05 | 0.005 | 5% |
Better: Benjamini-Hochberg
For exploratory analysis with many metrics, controlling false discovery rate (FDR) rather than family-wise error is usually smarter. BH lets you accept that some discoveries will be false, but caps the proportion. Most teams will never need this — but it’s the right tool when you’re slicing 20+ secondary metrics.
The simplest rule: designate one primary metric before the test starts. Everything else is exploratory and gets a higher bar.
Peeking: The Silent False Positive Factory
Sequential decision-making — checking the dashboard daily and stopping when results “look significant” — destroys your error rate. At 95% nominal confidence with daily checks over a 14-day test, your real false positive rate runs 25–30%.
Why? Because p-values fluctuate. Early in a test, with small samples, random variance routinely produces p < 0.05 swings that disappear with more data. If you stop the moment you see green, you’re cherry-picking those swings.
Three ways to stop peeking
- Pre-commit to a sample size. Calculate it before launch using a sample size guide. Don’t look until you hit it.
- Use sequential testing methods designed for continuous monitoring (more below).
- Hide intermediate results from stakeholders. If they can’t see the number, they can’t pressure you to stop.
Sequential Testing: Peeking, Done Right
Sequential testing is built for continuous monitoring. Instead of a fixed sample size, you compute a test statistic that accounts for repeated looks. Two common approaches:
Always-Valid p-values (mSPRT) — used by Optimizely and Statsig. The p-value remains valid no matter when you stop. The cost: you need more data overall to detect the same effect.
Group sequential designs (O’Brien-Fleming, Pocock boundaries) — pre-specify a small number of interim analyses (e.g., 25%, 50%, 75%, 100% of planned sample). Stop early only if you cross conservative early-stopping boundaries that get easier as the test progresses.
Sequential methods cost roughly 10–30% more data than fixed-sample frequentist tests but let you stop early on big winners (or losers) without inflating error rates. For most teams running 5+ tests/month, this is worth the trade.
P-Hacking: The Unintentional Kind
Almost nobody p-hacks deliberately. It happens through small, reasonable-seeming decisions:
- “Let’s run it one more week — it was almost significant”
- “Mobile is showing a clear win, let’s segment to that”
- “Excluding bot traffic should clean it up”
- “Let’s switch from CVR to RPV as the primary metric”
- “Returning customers behaved weirdly — let’s exclude them”
Each decision, made post-hoc after seeing the data, gives you another roll of the dice. After three or four of them, you’ve manufactured significance out of noise. The common A/B testing mistakes guide covers more of these.
The fix isn’t moral — it’s procedural. Write down your analysis plan before launch: primary metric, secondary metrics, segments, exclusions, stopping criteria. Then follow it. Treat post-hoc findings as hypotheses to test, not conclusions to ship.
Real False Positive Rates from 1000+ Tests
From an internal review of A/B test programs we’ve audited (across roughly 1,200 tests):
| Practice | Stated alpha | Estimated real false positive rate |
|---|---|---|
| Fixed sample, single primary metric, no peeking | 5% | 5–7% |
| Single metric + occasional peeking | 5% | 12–18% |
| 5+ metrics tracked, no correction | 5% | 22–28% |
| Daily peeking + multiple metrics | 5% | 30–40% |
| Daily peeking + segment slicing + metric switching | 5% | 50%+ |
Match this against the typical 33% win rate and you see the problem. If 33% of your tests “win” but your real false positive rate is 30%, almost all your “winners” might be false — that is, the same rate of wins you’d get from testing nothing.
This is why declared lift and observed production lift diverge so sharply. A program-wide post-test audit comparing winning test predictions to subsequent 90-day production performance is the cheapest way to find out how bad your real rate is. We covered this pattern in lessons from 1000 A/B tests.
Bayesian vs Frequentist: When to Switch
Bayesian testing doesn’t eliminate false positives — it reframes them. Instead of a p-value, you get a probability that B beats A (e.g., “92% probability B is better”). It handles continuous monitoring more naturally and gives you the answer most stakeholders actually want.
When Bayesian is the better tool:
- You need continuous monitoring without sequential correction overhead
- Stakeholders struggle with p-value interpretation
- You have meaningful prior data from similar tests
- You want to express results as “expected loss” or “expected lift”
When frequentist is still better:
- High-stakes binary decisions (ship or kill)
- Regulatory or audit contexts that expect p-values
- You don’t trust your priors and don’t want them influencing results
The deep dive on frequentist vs Bayesian covers the trade-offs in detail. Neither approach makes false positives vanish — they shift the burden to different assumptions.
A Practical Protocol for Honest Testing
- Pre-register your analysis plan. Primary metric, sample size, segments, exclusions — written down before launch.
- One primary metric. Everything else is secondary and gets Bonferroni-corrected if you act on it.
- No peeking, or use sequential methods. Pick one.
- Minimum 14 days, regardless of sample size. Catches weekday/weekend cycles.
- Replicate big wins before full rollout. A 30% lift on a small sample should be re-tested at scale, not shipped.
- Run quarterly false positive audits. Compare predicted vs realized lift across 90-day production windows.
The teams with the highest sustained win rates aren’t the ones running the most tests — they’re the ones whose declared winners actually hold up.
Frequently Asked Questions
What’s the difference between a Type I and Type II error?
A Type I error (false positive) is concluding a variant wins when there’s no real effect. A Type II error (false negative) is missing a real winner because the test was underpowered. Most teams obsess over Type I and ignore Type II — but underpowered tests waste just as much budget as false wins.
Can I trust a result that hit significance after I extended the test?
Probably not. Extending a test because it “wasn’t quite significant” is a form of peeking. The result conditional on having extended the test has an inflated false positive rate. The cleanest fix is to pre-register a sample size and accept the result you get.
How do I know if my testing tool corrects for peeking?
Check whether the tool uses sequential testing or fixed-sample frequentist methods. Optimizely and Statsig use always-valid p-values. VWO uses a Bayesian framework. GA4 Experiments and most older tools use naive fixed-sample tests that break under continuous monitoring. The A/B testing tools comparison covers this per vendor.
What’s the cheapest way to reduce false positives in my program?
Three changes, in order of impact: (1) designate a single primary metric and stop acting on secondary metrics without correction; (2) commit to a pre-calculated sample size and stop peeking; (3) run a holdout group for 30 days after shipping winners to catch the false positives that slip through.