Calculate statistical significance, p-value, confidence intervals, and plan your experiments with our comprehensive A/B testing calculator. Includes both frequentist and Bayesian analysis.
You can be 95% confident that Variation B outperforms Control A.
Relative Uplift
+20.00%
Change vs Control
P-Value
0.0101
Significant
Bayesian Probability
99.5%
Chance B beats A
Observed Power
73.0%
Low power
Z-Score
2.5726
Effect Size (Cohen's h)
0.0364
Small
Std Error (Diff)
±0.2720%
CI for Difference
0.17% to 1.23%
95% Confidence Interval
Control Rate
3.50%
350 / 10000
Variation Rate
4.20%
420 / 10000
Our A/B test calculator offers two modes: Analyze Results to evaluate completed experiments and Plan Experiment to determine required sample sizes before you start.
Use our Plan mode before starting your test to know exactly how many visitors you need.
Checking results early and stopping when "significant" leads to false positives.
User behavior varies by day. Run for 1-2 complete weeks minimum.
Isolate variables to understand what caused the change in conversions.
| Metric | Good Value | Interpretation |
|---|---|---|
| P-Value | < 0.05 | Results are statistically significant |
| Bayesian Probability | > 95% | Very likely that B beats A |
| Observed Power | > 80% | Adequate sample size to detect effect |
| Cohen's h | > 0.2 | Effect is practically meaningful |
| SRM P-Value | > 0.01 | No sample ratio mismatch |
Statistical significance measures the probability that the difference between your control and variation is real, not due to random chance. A result is typically considered significant when the p-value is below 0.05, meaning there's less than a 5% probability the result occurred by chance.
95% confidence is the industry standard for A/B testing. This means you can be 95% certain that your results are not due to random chance. For high-stakes decisions (like pricing changes), consider using 99% confidence.
Frequentist testing gives you a p-value and tells you if results are 'significant' or not. Bayesian testing gives you a probability that one variation is better than another (e.g., '95% chance B beats A'). Bayesian is often more intuitive for business decisions.
Run your test until you reach the required sample size calculated before starting. Never stop early just because results look significant - this leads to false positives. Also run for at least 1-2 full weeks to account for day-of-week variations.
SRM occurs when the traffic split between variations is significantly different from what you intended (e.g., 50/50). This often indicates a bug in your experiment setup and can invalidate your results. Our calculator automatically detects SRM.
MDE is the smallest improvement you want to be able to detect. A 20% MDE on a 5% baseline means you want to detect if the new version achieves at least 6%. Smaller MDEs require larger sample sizes.
Statistical power (typically 80-90%) is the probability of detecting a real effect when it exists. Low power means you might miss real improvements. Our calculator shows observed power to help you understand if your sample size was adequate.
Use two-sided (default) when you want to detect any difference - positive or negative. Use one-sided only when you're certain the variation can't perform worse than control and you only care about improvements.