Most A/B testing programs fail for one of three reasons: tests are stopped too early, sample sizes are too small to detect realistic effects, or the program runs in isolation from the business rather than tied to revenue. The articles below address each in detail.

Statistical rigor matters more than test volume. A program running 4 well-designed tests per month with proper sample-size planning and significance discipline will out-perform a program running 12 sloppy tests at 80% power that nobody trusts the results of. Speed without rigor produces a graveyard of "winners" that don't replicate, ship to production, and quietly drag conversion down.

The pieces here cover the methodology layer — [sample size](/blog/ab-testing-sample-size-guide) and MDE calculation, [statistical significance](/blog/ab-testing-statistical-significance) thresholds, [Bayesian vs frequentist](/blog/bayesian-vs-frequentist-ab-testing) tradeoffs, peeking and stopping rules, [multivariate testing](/blog/multivariate-testing-guide), and how to handle failed or inconclusive tests. Plus the program layer — tooling, reporting cadence, hypothesis intake, and what separates a mature experimentation team from a "we run tests sometimes" function.

If you're trying to figure out where to start testing or whether you have the traffic for it, the [free AI CRO audit](/audit) generates prioritized test ideas with predicted sample sizes for your specific traffic.

Articles in this topic (23)
Frequently asked
How long should an A/B test run?

Until it reaches its pre-calculated sample size at the agreed-upon power (usually 80%) and significance level (usually 95%) — not based on day-counts. Practically, this means 2–4 weeks for most pages with 10K+ monthly visitors. Stopping earlier inflates false-positive rates dramatically; the test result you see at day 3 is often the opposite by day 14.

Bayesian or frequentist — which should I use?

Frequentist is the industry default and is what most testing tools (Optimizely, VWO, Convert) use natively. Bayesian gives you probability-of-being-best framing that is easier for stakeholders to interpret but requires more methodological care. For most teams, pick whichever your tool defaults to, learn it deeply, and resist switching mid-program — the discipline matters more than the framework.

Why do my "winning" tests not hold up in production?

Three usual causes: peeking at results before reaching sample size (inflates false positives), running too many simultaneous tests on overlapping audiences (interaction effects), or measuring the wrong primary metric (e.g., click-through rate when revenue is what you actually care about). Audit each before running another test.

See where your store is leaking revenue

Our AI audit scores your store against 48 behavioral-science heuristics and shows you exactly what to fix first — in minutes, not weeks.

Get Free CRO Audit → Book Strategy Call