A/B Testing

How to Run an A/B Test: Complete Guide

By Denys Pankov · January 30, 2026 · 8 min read

Everything You Need to Know About Running A/B Tests That Actually Move Revenue

A/B testing sounds simple: show version A to half your visitors, version B to the other half, see which wins. In practice, most A/B tests fail — not because the ideas are bad, but because the process is broken.

This guide covers the complete A/B testing process from hypothesis to analysis, with the behavioral science and statistical rigor that separates real optimization from random changes.


What Is an A/B Test?

An A/B test (split test) is a controlled experiment where you show two versions of a page or element to different segments of visitors, then measure which version performs better against a predefined metric.

  • Control (A): The current version
  • Variation (B): The modified version with one or more changes
  • Primary metric: The KPI you’re trying to improve (conversion rate, revenue per visitor, add-to-cart rate, etc.)

The A/B Testing Process (Step by Step)

Step 1: Research and Observation

Never start with “let’s test a different button color.” Start with understanding WHY visitors aren’t converting.

Research methods:

  • Quantitative: GA4 funnel analysis, page performance data, device/source segmentation
  • Qualitative: Heatmaps, session recordings, user surveys, customer interviews
  • Heuristic analysis: Systematic UX evaluation against behavioral science principles

Note: Our AI audit engine automates the heuristic analysis step — scanning your pages against 40+ behavioral science tactics and delivering prioritized observations in minutes.

Output: A list of observations — specific, evidence-backed findings about what’s happening on your site.

Step 2: Generate Hypotheses

Every test needs a structured hypothesis:

“If we [specific change], then [metric] will [direction] because [behavioral science reason].”

Good hypothesis:

“If we add the number of 5-star reviews next to the product title on mobile, then add-to-cart rate will increase because social proof (authority bias) reduces purchase uncertainty for first-time visitors.”

Bad hypothesis:

“If we change the button color to green, conversion will increase.”

The difference? The good hypothesis is specific, measurable, and grounded in behavioral science. The bad one is a guess.

Step 3: Prioritize

You’ll generate more test ideas than you can run. Prioritize using a structured framework:

ICE Score (basic):

  • Impact (1-10): How much will this move the metric?
  • Confidence (1-10): How sure are you it will work?
  • Ease (1-10): How easy is it to implement?
  • ICE Score = Impact x Confidence x Ease

AXR Score (advanced — acceleroi’s framework):

Adds heuristic win rates, lever multipliers, and priority boosts to ICE for more accurate prediction. Our AI calculates AXR scores automatically.

Step 4: Design and Develop the Variation

Design guidelines:

  • Change only what your hypothesis specifies — don’t redesign the entire page
  • Keep the change meaningful enough to detect — a single word change on a low-traffic page won’t reach significance
  • Design for both desktop and mobile (or target one device specifically)
  • Consider above-the-fold vs below-the-fold visibility

Development guidelines:

  • Use your A/B testing tool’s visual editor for simple changes
  • Use custom code for structural changes
  • Ensure the variation loads without flash of original content (FOOC)
  • Test across all major browsers and devices

Step 5: Calculate Sample Size and Duration

Before launching, calculate how long the test needs to run:

Key inputs:

  • Baseline conversion rate (your current CVR)
  • Minimum detectable effect (MDE) — the smallest improvement worth detecting (typically 5-20% relative)
  • Statistical significance level — typically 95% (alpha = 0.05)
  • Statistical power — typically 80% (beta = 0.20)
  • Daily traffic to the test page

Example calculation:

  • Baseline CVR: 3.0%
  • MDE: 10% relative (detecting a lift from 3.0% to 3.3%)
  • Significance: 95%
  • Power: 80%
  • Required sample: ~35,000 visitors per variation
  • At 2,000 daily visitors: ~35 days minimum

Note: Never stop a test early because it “looks like a winner.” Early results are unreliable. Run for the full calculated duration AND at least 2 full business cycles (typically 2+ weeks).

Step 6: Launch and Monitor

Pre-launch checklist:

  • QA on Chrome, Safari, Firefox, Edge (desktop + mobile)
  • Verify tracking fires correctly for the primary metric
  • Confirm traffic split is working (50/50 or your chosen allocation)
  • Set calendar reminder for minimum test duration
  • Document the hypothesis, expected outcome, and success criteria

During the test:

  • Monitor for technical issues (broken layouts, tracking errors, JavaScript errors)
  • Do NOT make decisions based on early data
  • Do NOT change anything on the test page
  • Do NOT add more variations mid-test

Step 7: Analyze Results

When to call a test:

  • Minimum sample size reached
  • Minimum duration reached (2+ weeks)
  • Statistical significance reached (95%+ for Frequentist, or strong posterior probability for Bayesian)

How to analyze:

Frequentist approach:

  • p-value < 0.05 = statistically significant
  • Look at the confidence interval — does it include zero?
  • Check for segment-level effects (device, traffic source, visitor type)

Bayesian approach (recommended):

  • What’s the probability that B beats A?
  • What’s the expected revenue impact?
  • What’s the risk of implementing B? (expected loss)

Note: We use Bayesian analysis at acceleroi. It answers the question decision-makers actually care about: “What’s the probability this change will make us more money?” — rather than the Frequentist question: “How surprised should I be if there’s no difference?”

Step 8: Document and Share Learnings

Every test — winner or loser — is a data point. Document:

  • What you tested and why
  • The result (with confidence level)
  • Revenue impact (actual or projected)
  • What you learned about user behavior
  • Implications for future tests

This builds your experimentation knowledge base and improves future hypothesis quality.


Common A/B Testing Mistakes

MistakeWhy It’s a ProblemHow to Avoid It
Stopping tests earlyFalse positives — you implement a change that doesn’t actually workCalculate sample size upfront, commit to the full duration
Testing without a hypothesisRandom changes don’t build knowledge or compound resultsAlways start with research, observation, hypothesis
Testing too many things at onceCan’t attribute results to any specific changeOne hypothesis per test. Use multivariate testing for multiple variables.
Ignoring sample size requirementsUnderpowered tests produce unreliable resultsCalculate required sample size before launch. If you don’t have enough traffic, test higher-impact changes.
Only looking at overall resultsMissing segment-specific effects (mobile might win while desktop loses)Always check device, traffic source, and visitor type segments
Measuring the wrong metricOptimizing CTR might decrease revenue if click quality dropsChoose a primary metric tied to revenue. Monitor guardrail metrics.

A/B Testing Tools Comparison (2026)

ToolBest ForStarting Price
VWOMid-market eCommerce, all-in-one platform$350/mo
OptimizelyEnterprise, feature experimentationCustom pricing
ConvertPrivacy-focused, Shopify integration$299/mo
AB TastyEuropean market, personalizationCustom pricing
Google Optimize (sunset)Was free — now look at alternativesN/A
Shopify native A/BShopify stores, basic testsIncluded in Shopify

How AI Is Changing A/B Testing

The biggest shift in A/B testing in 2026 is AI-powered hypothesis generation and prioritization:

  1. AI-generated test ideas — Computer vision analyzes your pages and suggests experiments based on behavioral science principles
  2. Predictive prioritization — AXR scoring predicts which tests are most likely to win based on historical data
  3. Automated analysis — AI interprets session recordings and heatmaps at scale to identify testing opportunities
  4. Faster learning — AI connects experiment outcomes to heuristic win rates, making future predictions more accurate

This doesn’t replace human creativity and strategic thinking — it supercharges the research and prioritization phases so more time is spent on high-impact work.


Frequently Asked Questions

How long should an A/B test run?

Minimum 2 weeks (to capture weekly patterns), or until you reach the calculated sample size — whichever is longer. Never stop early just because results “look significant.”

What’s a good A/B test win rate?

The industry average is about 1 in 3 tests produce a statistically significant winner. With strong research and behavioral-science-grounded hypotheses, you can push this higher — but losing tests are still valuable data.

How many tests should I run per month?

Depends on your traffic. High-traffic sites (100K+ monthly sessions) can run 4-6 concurrent tests. Lower-traffic sites should focus on 1-2 high-impact tests at a time with larger variations.

Can I A/B test with low traffic?

Yes, but you need to adjust your approach: test bigger changes (higher MDE), use Bayesian statistics, and supplement with qualitative research for directional validation.


Note: Get AI-generated A/B test ideas for your website. Our AI audit analyzes your pages against 40+ behavioral science heuristics and delivers prioritized experiment hypotheses — ready to test.

See where your store is leaking revenue

Our AI-powered audit analyzes your pages against 48 behavioral science heuristics and shows you exactly what to fix first — in under 60 seconds.

Get Instant CRO Audit → Book Strategy Call