A/B Testing

Segmentation in A/B Testing

By Denys Pankov · March 11, 2026 · 10 min read

When Slicing A/B Test Results Helps — and When It Lies to You

“The test was flat overall, but mobile users converted 14% better.” This is one of the most common — and most dangerous — sentences in A/B testing. Sometimes it’s a real insight. More often it’s a post-hoc fishing expedition that produces a false winner you’ll ship to the whole site.

Segmentation isn’t bad. Done badly, it’s the single biggest source of false-positive shipping decisions in mature CRO programs.

22% Family-wise error when testing 5 segments at 95% alpha
3 Maximum segments worth analyzing per test
Sample size needed to detect segment-level effects
50% Of "segment-level wins" that don't replicate

This guide separates the segments worth pre-planning from the ones you should ignore.


Pre-Planned vs Post-Hoc: The Distinction That Matters

A pre-planned segment is one you declared a hypothesis about before the test launched: “We expect this checkout change to help mobile users more than desktop users because the form is more painful on mobile.”

A post-hoc segment is one you discovered after looking at the results: “The overall test was flat, but when we sliced by device, mobile won.”

These look identical in a dashboard. They are not statistically equivalent.

Pre-planned segments have a defined hypothesis space — you’ve committed to checking that segment before the data exists. Post-hoc segments are drawn from an effectively unlimited pool of possible slices. If you check enough segments, you’ll find one that “wins” purely by chance. This is the multiple comparisons problem by another name.

The procedural fix: write down the segments you plan to analyze before launch. Limit it to 2–3. Anything you find post-hoc is a hypothesis for a future test, not a conclusion to ship.


Simpson’s Paradox: When Segments Disagree With the Aggregate

Simpson’s paradox is the situation where every segment shows a positive effect, but the aggregate shows negative — or vice versa. It’s not a paradox so much as a math artifact, and it shows up more often in A/B tests than most CRO teams realize.

A simplified example. You test a new product page with these results:

SegmentControl CVRVariant CVRLift
Mobile (75% of traffic)2.0%2.2%+10%
Desktop (25% of traffic)5.0%5.4%+8%
Aggregate2.75%3.0%+9%

Now imagine the traffic split shifts unevenly between control and variant — variant accidentally gets more desktop traffic:

SegmentControl (mobile-heavy)Variant (desktop-heavy)Apparent lift
Mobile2.0% (80%)2.2% (60%)+10%
Desktop5.0% (20%)5.4% (40%)+8%
Aggregate2.6%3.48%+34%

The aggregate looks like a huge win, but the within-segment effects are unchanged. The “lift” comes from the traffic mix shift, not the variant. This is why sample ratio mismatch (SRM) checks are non-negotiable. If your control and variant aren’t 50/50 to within statistical tolerance, your aggregate numbers are unreliable.

Most modern testing platforms flag SRM automatically. If yours doesn’t, switch tools or run the chi-squared check manually before trusting any aggregate result. The A/B testing tools comparison covers which platforms catch this.


When to Slice by Device, Source, or Customer Type

Three segmentation cuts are worth pre-planning because they have legitimate, predictable interaction effects:

Device (mobile vs desktop)

Justified when the change affects layout, form complexity, or interaction patterns that work differently across screen sizes. A new checkout flow, a redesigned PDP gallery, anything sticky-header-related — pre-plan device segments.

Not justified when the change is universal (copy, pricing, color). Adding device segments just for completeness costs you power without adding insight.

Traffic source (paid vs organic vs direct vs email)

Justified when the change affects messaging-to-page fit. A landing page redesign might help cold paid traffic and hurt warm email traffic (or vice versa). Pre-plan source segments when intent differs meaningfully across channels.

Not justified for site-wide UX changes. The “paid converts differently” effect exists in every test; segmenting on it for everything just adds noise.

Customer type (new vs returning)

Justified when the change affects discovery or familiarity. A new homepage hero matters more to first-time visitors. A loyalty program change matters more to repeat buyers.

Not justified for checkout or conversion-funnel changes where both segments behave similarly.


The Statistical Cost of Multiple Segments

Every segment you analyze multiplies your false positive risk. If you test the primary metric on three segments at 95% confidence with no correction:

1 − (0.95)^3 = 14.3% family-wise error

Five segments: 22.6%. Ten segments: 40%. The fix is either:

  1. Bonferroni correction — divide alpha by the number of segments. For 3 segments at family alpha = 0.05, each test needs p < 0.017.
  2. Higher confidence threshold per segment — use 99% confidence for segment-level decisions, 95% for primary.
  3. Demote segments to “exploratory only” — segment-level findings inform future tests, not ship decisions.

There’s also a power cost. Detecting a segment-level effect at the same effect size requires roughly the same sample size as the primary test, but you only have a fraction of the data in each segment. Realistically you need 2× the primary sample size if you want segment-level statistical power. See the sample size guide for how to budget for this.


Primary Metric Segments vs Exploratory Segments

Make the distinction explicit in your test plan:

Primary metric segments — pre-registered, corrected for multiple comparisons, sufficient sample size budgeted. These can drive ship decisions.

Exploratory segments — anything else you want to look at. These generate hypotheses, never decisions. Findings here go into the research backlog for future tests.

The template:

Primary metric: Add-to-cart rate (aggregate)
Pre-registered segments (2): Mobile vs desktop, Paid vs organic
Exploratory: Everything else (results not used for ship decisions)
Alpha: 0.05 primary, 0.025 per pre-registered segment (Bonferroni)

Stick to this and your declared winners actually hold up in production.


Common Segmentation Pitfalls & How to Avoid Them

PitfallWhat HappensHow to Fix
Post-hoc fishingYou find a “segment winner” that doesn’t replicatePre-register segments before test launch. Treat post-hoc findings as hypothesis, not result.
Over-correcting for multiple comparisonsAlpha becomes so strict (0.01 per segment) that you miss real effectsUse Bonferroni OR demote segment findings to exploratory (don’t ship on them).
Segment size too smallMobile segment only has 50 conversions per arm; too small for stat powerBudget 2x the primary sample if you want segment power. Otherwise segment findings are exploratory.
Simpson’s ParadoxAggregate shows negative, segments show positiveCheck Sample Ratio Mismatch (SRM) before trusting any segment breakdown.
Wrong segment for the changeYou test a button color on device; device segment shows difference but both converted equallyMatch segment to hypothesis. Button color affects all users equally; test it on aggregate, not by segment.

Segments That Almost Never Help

A short list of segments that look promising but rarely produce shippable insight:

  • Hour of day — too noisy, segment populations too small, no actionable lever
  • Browser version — same. Unless you’re debugging a render bug, skip it.
  • First-touch UTM campaign — meaningful in isolation but interacts with everything else; almost always Simpson’s-paradox-prone.
  • Geographic regions smaller than country — variance is huge, samples are tiny.
  • Logged-in vs logged-out for transactional sites — rarely changes the answer for typical CRO tests.

If a segment isn’t tied to a specific, actionable change you’d make, it’s a distraction.


A Practical Segmentation Workflow

  1. Before launch: Write the analysis plan. Primary metric, 2–3 pre-registered segments with hypotheses, Bonferroni-corrected alpha.
  2. At launch: Verify SRM via your testing platform. If imbalanced, fix before proceeding.
  3. During test: Don’t peek at segments mid-test. Treat them the same as the primary metric — no early stopping based on a segment.
  4. At analysis: Report primary first, then pre-registered segments with corrected p-values, then exploratory findings separately labeled.
  5. For exploratory wins: Convert to a new test hypothesis. Do not ship a “mobile-only” change based on a post-hoc finding without re-testing on mobile traffic specifically.

This discipline is what separates programs with 33% sustained win rates from programs with 60% declared wins and 15% replicated lift.


Tools & Platforms with Built-In Segmentation Safeguards

Modern testing platforms help prevent segmentation errors:

PlatformSRM CheckMultiple ComparisonsPre-RegistrationComments
OptimizelyAutomatic flaggingCorrection suggestionsBuilt-inBest-in-class segmentation management
StatsigAutomatic, clear warningsAuto-BonferroniBuilt-inModern, data-science-focused interface
VWOManual chi-squared testNot automatedManualGood reporting, less automated safeguards
ConvertManual processManual BonferroniManualRequires discipline, no platform safety nets
AB TastyManualSuggestions, not forcedManualMiddle ground on automation

Recommendation: Use a platform (Optimizely, Statsig) that automates SRM checks and multiple comparisons warnings. The platform can’t make you pre-register, but it can flag when you’re taking unnecessary statistical risk.



Frequently Asked Questions

What’s a sample ratio mismatch (SRM) check?

A statistical check that your control and variant got roughly equal traffic. For a 50/50 split with 10,000 visitors per arm, the chi-squared test should show p > 0.05. Failed SRM means your randomization is broken — usually a bot filter difference, a redirect issue, or a tracking bug. Most modern platforms (Optimizely, Statsig, GrowthBook) flag SRM automatically.

Should I segment by new vs returning customers?

Pre-plan it when the change affects discovery (homepage, hero, navigation) or familiarity (loyalty, account features). Skip it for checkout or PDP changes where both segments behave similarly. The data is noisier than you expect — returning customers are a small fraction of typical traffic.

How do I budget sample size for segment-level analysis?

Roughly 2× the primary metric sample size if you want adequate power within each segment. If you can only afford the primary sample, treat all segment findings as exploratory. See the sample size calculation guide for the exact math.

What if I see a clear segment effect after the test ends?

Treat it as a hypothesis, not a result. Design a follow-up test specifically targeted to that segment. The original test wasn’t powered to detect segment-level effects, and post-hoc segmentation has an inflated false positive rate. The follow-up test, properly designed, will either confirm or kill the effect.

See where your store is leaking revenue

Our AI-powered audit analyzes your pages against 48 behavioral science heuristics and shows you exactly what to fix first – in minutes, not weeks.

Get Instant CRO Audit → Book Strategy Call