Skip to main content

A/B Test Sample Size Calculator

Calculate the required sample size and test duration for any A/B test, before you launch. Uses the standard two-proportion z-test formula. Pick your baseline conversion rate, minimum detectable effect, significance level, and statistical power — the tool returns visitors per variation, total required visitors, estimated days, and a feasibility verdict. Includes a shareable URL to send your assumptions to teammates. 100% in-browser, no signup.

%
%

How to Use This Tool

  1. Enter your baseline conversion rate. This is the current control's conversion percentage — the rate you measure today, before any change. If 5 out of every 100 visitors convert, type 5. Pull this from analytics or your A/B testing platform's prior tests on the same page. Lower baselines need bigger samples to detect the same relative lift, so accuracy here matters.
  2. Set your minimum detectable effect (MDE). MDE is the smallest relative lift you want the test to be able to detect with confidence. 10 means a 10% relative improvement (5% baseline becomes 5.5%). Smaller MDEs require dramatically more visitors: 5% MDE costs ~4× the visitors of 10% MDE; 2% MDE costs ~25×. Practical defaults: 10% for most tests, 5% only when traffic is very high, 15–20% for low-traffic sites.
  3. Choose significance level and statistical power. Significance (90 / 95 / 99%) controls false-positive risk — the chance you'll declare a winner that isn't real. 95% is the default; use 99% for high-stakes decisions; 90% only for low-risk early exploration. Power (80 / 90 / 95%) controls false-negative risk — the chance you'll miss a real effect. 80% is the default. Higher significance and higher power both increase required sample size.
  4. Pick one-tailed vs two-tailed. Two-tailed (the default) detects whether the variation differs from control in either direction (better OR worse). One-tailed only detects "better" (or only "worse"). Use two-tailed unless you have a specific reason — one-tailed is ~20% smaller sample but loses the ability to catch surprise downsides.
  5. Enter daily traffic and number of variations. Traffic is the number of visitors per day reaching the test page (not site-wide). Variations defaults to 2 (control + treatment); use 3 for an A/B/C test, 4 for A/B/C/D. The tool divides total required visitors by daily traffic to estimate test duration.
  6. Click Calculate (or press Ctrl/Cmd+Enter). The tool shows visitors per variation, total visitors, days estimate, and a feasibility verdict (green <14 days, yellow 15–30 days, red >30 days). Read the recommendation panel for actionable advice if the test is too long. Click Copy Shareable Link to send a URL with all your assumptions baked in — teammates can open it and see exactly the same calculation.

About A/B Testing & Sample Sizes

A/B testing (also called split testing or randomized experiment) is the practice of comparing two or more versions of a web page, app screen, email, or ad to a randomly-divided audience and measuring which performs better on a primary metric — usually conversion rate. The math underlying it is the two-proportion z-test, a 100-year-old statistical technique that remains the industry standard at every major A/B testing platform: Optimizely, VWO, Google Optimize (RIP), AB Tasty, Eppo, Statsig, GrowthBook. Despite the simplicity of the headline math (compare two proportions, see if the difference is bigger than chance), production A/B testing is notoriously easy to get wrong — sample-size errors, peeking at results, novelty effects, and the multiple-comparison problem all conspire to produce false positives at rates much higher than the headline 5% advertised by 95% significance.

Why sample size matters more than anything else. The single biggest reason A/B tests give wrong answers is undersampling: stopping the test before reaching the statistically required number of visitors. With a tiny sample, random variation between groups is large compared to any real effect, and the difference you observe is dominated by noise. The classic example: 100 visitors per arm, baseline 5% conversion. Even with no real difference between control and treatment, you'll see 4-7% conversion on each arm purely by chance, and the "winning" arm will switch back and forth daily as new visitors arrive. Only by reaching the calculated sample size do you average out that noise enough to detect a real effect. This calculator returns the sample size derived from the standard formula n = ((zα + zβ)2 × (p1(1-p1) + p2(1-p2))) / (p2 − p1)2 where p1 is baseline, p2 = p1 × (1 + MDE), and the z-values come from significance and power tables.

Significance and power, plain English. Significance (also called confidence level) is your tolerance for false positives — declaring a winner when there's actually no difference. 95% significance means a 5% false-positive rate IF you only test once and don't peek at intermediate data. Power is your tolerance for false negatives — missing a real winner. 80% power means you'll catch 80% of real effects of the size you specified as MDE. The trade-off: tightening either knob (95%→99% significance, 80%→90% power) requires a larger sample. Industry-standard defaults are 95% significance + 80% power, matching what every major platform uses out-of-the-box. Increase to 99% for high-stakes irreversible decisions (checkout flow, pricing); drop to 90% only for low-stakes exploration where speed matters more than rigor.

MDE: the most-debated input. Minimum detectable effect determines how small a lift the test can reliably catch. Smaller MDEs cost dramatically more sample because the formula has (p2 − p1)2 in the denominator: halving MDE quadruples sample size; quartering it 16x's it. Setting MDE realistically is hard because you're predicting the size of a change before you've made it. Heuristics: button-color tweaks lift 1-3% (need huge samples to detect); copy changes 3-7%; new layouts 5-15%; checkout flow rewrites 10-30%; entirely new landing pages 30%+. Most product teams set MDE too low and end up with tests that take 12 weeks to run. Better strategy: use 10% MDE as default, raise to 20% for low-traffic sites, only drop to 5% on the highest-traffic pages where the marginal value of small lifts is real.

Test duration constraints. Beyond reaching the statistical sample, real-world tests need to satisfy three other constraints: (1) Run a minimum of 1-2 full business cycles (typically 7-14 days) to capture day-of-week and time-of-day variation. Tuesday traffic converts differently from Saturday traffic. Stopping a test on day 4 because it "hit significance" is one of the most common mistakes in A/B testing. (2) Avoid running so long that seasonality, marketing campaigns, or external events bias the result — typically max 4-6 weeks. (3) Watch for novelty and primacy effects: visitors react differently to a new feature in the first few days, then settle into normal usage. Tests under 7 days often catch only the novelty signal, not steady-state. The feasibility colors in this tool encode these heuristics: green <14 days = test will run cleanly; yellow 15-30 days = practical but watch for confounds; red >30 days = consider raising MDE or accepting a longer wait.

Common pitfalls this calculator helps avoid. (1) Stopping early because the test "hit significance" — this is peeking, and it inflates real false-positive rate from a planned 5% to 14%+ depending on how often you check. Commit to the calculated sample size before launch. (2) Running multiple metrics and reporting whichever wins — this is the multiple-comparison problem; declare your primary metric before launch and treat secondary metrics as exploratory only. (3) Mid-test changes to the variants or audience definitions — resets the experiment and invalidates collected data. (4) Ignoring sample-ratio mismatches (SRM) — if 50/50 traffic split is configured but actuals are 53/47, your randomization is broken and ALL conclusions are suspect. (5) Treating "not significant" as "no effect" — insufficient sample isn't proof of no effect, it's proof you can't tell either way. Re-run with a larger sample or higher MDE.

Bayesian alternatives. Modern A/B testing tools (Eppo, Statsig, GrowthBook) increasingly default to Bayesian methods rather than the frequentist z-test this calculator uses. Bayesian methods treat the true effect as a probability distribution that updates as data arrives, which mathematically permits continuous monitoring (peeking) without inflating false-positive rates. They also produce more intuitive output: instead of "p < 0.05", you get statements like "87% probability that B is better than A by at least 5%". Bayesian shines on low-traffic sites and rapid iteration; frequentist remains the rigorous gold standard for high-stakes confirmatory tests. This tool uses frequentist because (a) the math is well-understood, (b) sample-size formulas are exact and pre-computable, (c) most readers will be using Optimizely, VWO, or similar frequentist platforms. If you're on a Bayesian platform, treat this tool's output as a conservative upper bound on sample size.

At EmproIT, our Conversion Rate Optimization team designs and runs statistically sound A/B tests for ecommerce, SaaS, and lead-gen funnels — rigorous experimental design, hypothesis tracking, sample-size pre-registration, and dashboards that catch sample-ratio mismatches in real time. Pair this calculator with our ROI Calculator to convert lift percentages into projected revenue, our UTM Link Builder to ensure variant traffic is tracked separately, and our Ad Spend Calculator to project the budget needed to drive variant traffic at the calculated rate.

Frequently Asked Questions

What is statistical significance?

Statistical significance is the probability that a measured difference between two groups is real (caused by your treatment) rather than random noise. Expressed as a confidence level: 95% significance means there's only a 5% chance the result happened by accident if there were truly no effect. The complement is alpha (p-value threshold) — 5% for 95%, 1% for 99%, 10% for 90%. Choose your confidence level before running the test, not after. 95% is the default in marketing/CRO; 90% is acceptable for early-stage exploration; 99% is needed for high-stakes irreversible decisions. Higher confidence requires larger samples. This calculator uses the standard two-proportion z-test: n = ((zα+zβ)2(p1(1-p1)+p2(1-p2)))/(p2-p1)2.

What is MDE (Minimum Detectable Effect)?

MDE is the smallest improvement you want the test to be able to detect with confidence. Expressed as a relative percentage of baseline: 10% MDE on a 5% baseline means detecting lifts that take conversion to 5.5%+. Smaller MDEs require dramatically larger samples: detecting 5% lift takes ~4× the visitors of 10% lift, and ~16× the visitors of 20% lift. Practical defaults: button-color tweaks 1-3%; copy changes 3-7%; new layouts 5-15%; checkout rewrites 10-30%; new landing pages 30%+. Start at 10% MDE, drop to 5% only when traffic is very high, raise to 15-20% when traffic is low. Setting MDE too small is the #1 cause of tests that take months and never finish.

One-tailed vs two-tailed: which should I use?

Two-tailed tests check whether the variation is different from control in either direction (better OR worse). One-tailed only checks "better" (or only "worse"). Two-tailed is the default in Optimizely, VWO, Eppo, Statsig because it's conservative and protects against directional bias. One-tailed requires you to commit upfront that you don't care about a worse result — if your variant performs worse, you treat it the same as no effect, which is rarely correct. The math difference: one-tailed needs ~20% smaller sample for the same significance, but loses the ability to catch surprise downside. Use two-tailed unless you have a specific directional hypothesis. This calculator defaults to two-tailed and lets you switch.

What's a good significance level — 90%, 95%, or 99%?

95% is the standard default and what you should use unless there's a clear reason otherwise. It balances false-positive risk (5%) with sample-size feasibility. Use 90% for early-stage exploration with low cost of a wrong call (e.g., minor copy tweaks). Use 99% for high-stakes decisions where rolling out a worse variant would be costly: checkout rewrites, pricing, navigation overhauls, anything affecting site-wide architecture. Trade-off: 99% requires roughly 1.7× the sample of 95%, extending tests from 4 weeks to 7 weeks. Don't change significance levels mid-test — that's a form of p-hacking.

How long should A/B tests run?

Three constraints: (1) reach the statistical sample size shown by this calculator; (2) run a minimum of 1-2 full business cycles (1 week minimum, 2 weeks preferred) to capture day-of-week and time-of-day variation; (3) avoid running so long that seasonality biases results (typically max 4-6 weeks). When the calculator shows 90+ days needed, that test is not realistic — raise MDE, increase traffic, or accept that the question is unanswerable with current resources. Best practice: target 14-21 day tests as the sweet spot. Less than 14 days risks weekend/holiday bias; more than 30 days risks novelty effects and external noise. This calculator color-codes feasibility green/yellow/red along these lines.

What if I don't have enough traffic for a statistically valid test?

Five options for low-traffic sites: (1) Use a higher MDE — test only big swings (20%+) where the effect size is large enough for smaller samples. (2) Use micro-conversions as your primary metric — test on add-to-cart or email signup (10-30× higher base rate, faster signal) instead of final purchase. (3) Use Bayesian / sequential testing methods (Eppo, Statsig, GrowthBook) which can produce decisions earlier with less rigorous controls. (4) Run BIGGER tests less often — 4 strategic tests per quarter on high-impact changes beats 12 small tests. (5) Accept that some questions are unanswerable with current traffic and use qualitative methods (user interviews, session recordings, surveys) instead.

What is the difference between Bayesian and frequentist A/B testing?

Frequentist (the classical method, what this calculator uses) treats the true effect as fixed and asks: "given there is no effect, what's the probability of seeing data this extreme by chance?" That's the p-value. You set a threshold (p<0.05 for 95%), commit before the test, accept the outcome at the end. Pro: well-understood, conservative. Con: requires fixed sample size, can't peek. Bayesian treats the true effect as a probability distribution that updates as data arrives. You ask: "given the data so far, what's the probability that B is better by at least X%?" Pro: allows continuous monitoring, intuitive output. Con: requires choosing a prior (subjective), more complex to explain. Tools like Eppo and Statsig default to Bayesian; Optimizely and VWO use frequentist. Both work; pick based on team familiarity and traffic level.

What is peeking and why is it bad?

Peeking is checking your A/B test results before reaching the planned sample size and stopping early when you see a "winner". It massively inflates false-positive rates: peeking 5 times during a test bumps your real false-positive rate from a planned 5% to ~14%; peeking 10 times pushes it past 20%. Most early stops on "winning" tests are noise, not signal. Why? Because conversion-rate differences fluctuate randomly throughout a test — eventually one variant will look ahead by chance, and stopping there bakes that fluctuation into your decision. To peek safely, use either (1) Bayesian methods (handle peeking by design), (2) sequential testing methods like SPRT or always-valid p-values (Optimizely Stats Engine, Eppo), or (3) discipline. This calculator gives you the planned sample size; commit before launch.

Compound Conversion Rate Quarter After Quarter

Our Conversion Rate Optimization team runs statistically sound A/B tests, implements winning variants, and compounds your conversion rate every quarter.

Let's Talk