False Positives in A/B Testing: Why Your Winning Tests Aren't Winning
Your test reached 95% confidence. The variant beat the control by 12%. Your testing tool turned green and declared a winner.
You shipped the change.
Two weeks later, conversions are the same. Or lower.
What happened?
You ran a false positive. And it's far more common than the CRO industry wants to admit.
What Is a False Positive in A/B Testing?
A false positive (also called a Type I error) happens when your test declares a winner that isn't actually better.
The result looks statistically significant. The tool says it's real. But the lift was random noise — a coincidence in the data, not a genuine improvement driven by your change.
You acted on it. And nothing improved.
The cruel irony: most teams never find out. They move on to the next test. The "winner" sits in production. The real conversion rate never moved.
Why False Positives Are So Common
Peeking at Results Too Early
This is the single biggest cause of false positives in e-commerce CRO.
You launch a test. Three days in, you check the dashboard. The variant is up 20% at 85% confidence. You wait another day. Now it's at 91%. Two more days: 95%. You call it.
The problem: statistical significance isn't stable when you check it continuously. Early in a test, sample sizes are tiny. Random fluctuations look like trends. Checking every day and stopping when you hit your threshold inflates your false positive rate from 5% to 30% or higher.
This practice is called peeking, and it's the norm at most companies.
Running Tests on Too Little Data
A test with 200 conversions per variant isn't reliable. The math doesn't care that your business is small.
You need enough conversions to detect a real effect versus random noise. The smaller your sample, the wider the confidence interval — and the more likely a random swing looks like a win.
Rule of thumb: run your test until both variants have at least 300–500 conversions. Preferably more.
Stopping When the Result Looks Good (But Not Bad)
Most teams stop tests when the variant wins. But they let losing tests run longer — hoping they'll turn around.
This asymmetry creates bias. You're more likely to stop a test at a lucky spike than at a representative sample. The result: your "winners" are disproportionately lucky flukes.
Segmentation Mining After the Fact
The test overall showed no effect. But then you slice it by device, by traffic source, by new vs. returning — and suddenly mobile users in paid search show a 25% lift.
Exciting. But likely meaningless.
When you cut data into enough segments, some will show significance by chance. The more slices you make post-hoc, the more false positives you generate. This is called the multiple comparisons problem.
The Confidence Level Problem
"95% confidence" doesn't mean what most people think.
It means: if there were truly no difference between variants, you'd see a result this extreme only 5% of the time by chance.
That 5% is your false positive rate — per test. Run 20 tests and you should expect one false positive just from chance.
Most CRO programs run 10–20 tests per month. With sloppy methodology, that false positive rate compounds fast.
And if you're using 90% confidence to "save time"? You're accepting a 10% false positive rate per test. Run 10 tests: one guaranteed false positive, statistically speaking.
How to Catch False Positives Before They Fool You
Pre-register Your Hypothesis
Before you launch, write down:
- What you're testing
- What metric is the primary goal
- What minimum effect size you're looking for
- How long you'll run the test
- How many conversions you need
Commit to this upfront. Don't change the primary metric once the test is running because the current metric doesn't look good.
Use a Sample Size Calculator
Tools like Optimizely's sample size calculator or VWO's calculator will tell you how many conversions per variant you need before your test is worth stopping.
Input:
- Your current conversion rate
- The minimum detectable effect you care about (e.g., 10% relative improvement)
- Your desired confidence level (95%)
Run the test until you hit that number. Not before.
Run for Full Business Cycles
One week of data isn't enough. People behave differently on Monday versus Friday. Different days bring different traffic quality, different intent.
Run every test for at least two full weeks — even if you hit your sample size earlier. This catches day-of-week effects that inflate results on single-day peaks.
Check for Pre-test Bias with A/A Tests
Before running an A/B test, consider running an A/A test: same page, same content, two variants. If your testing tool shows a "winner" in an A/A test, your setup has problems. You're not measuring correctly.
This reveals tracking issues, cookie inconsistencies, or bucketing problems that would corrupt real test results.
Don't Chase Segments Unless Pre-specified
You can segment results — but only if you decided to before the test. "Mobile users who came from paid social and are returning visitors" as a segment is valid only if you planned to measure it.
If you're slicing post-hoc because the overall result was flat, apply a Bonferroni correction or use a stricter confidence threshold (99%) for the segment.
Signs You Might Be Running on False Positives
Ask yourself:
- Do you check test results daily?
- Do your tests often show "wins" that don't hold up when measured in your main analytics tool?
- Have you implemented changes from tests where revenue didn't visibly improve?
- Do you regularly end tests early because they look decisive?
- Do your tests frequently find significant results in very small segments?
If you answered yes to two or more: your testing program likely has a false positive problem.
The Inconvenient Truth About Statistical Significance
Statistical significance tells you the result probably isn't random.
It doesn't tell you the result is meaningful, practical, or durable.
A 2% lift in conversion rate with 99% confidence is still a 2% lift. If that's on 1,000 monthly conversions, you're talking about 20 extra orders. Is that worth the implementation cost? Will it hold as traffic scales or seasonality shifts?
Significance is the starting line, not the finish line.
The questions that actually matter:
- Is the effect size large enough to move the business?
- Does the lift hold across traffic segments, not just in aggregate?
- Is there a plausible mechanism — a real reason why this change should work?
What Good A/B Testing Looks Like
Hypothesis before launch. Know what you're testing and why. Have a clear reason to believe the variant will win.
Fixed sample size. Calculate before launch. Don't stop until you hit it.
Full business cycles. Minimum two weeks. No exceptions.
Single primary metric. One KPI decides the winner. Revenue per visitor or conversion rate — pick one upfront.
Post-test verification. After calling the winner, verify the lift in your main analytics tool (GA4, Segment, etc.). If the tools disagree, investigate before shipping.
Holdout groups. For high-stakes changes, run a holdout — keep 10% of traffic on the old experience for 30 days post-launch. See if the lift holds.
Final Thought
Most CRO programs are running faster than the data can support.
Tests end early. Segments get mined. Flukes get shipped. The dashboard shows a string of wins while the revenue line barely moves.
That's not optimization. That's a false confidence generator.
The fix isn't complicated: slow down, calculate your sample size, run for full cycles, and treat significance as the starting point — not the finish line.
Better methodology won't make your program slower. It'll make your wins actually win.
Is your testing program generating real wins — or just noise?
We audit A/B testing setups and analytics implementations for e-commerce brands. Book a free 45-minute strategy call and we'll diagnose your program before you ship another false positive.
