How A/B Test Scores Are Calculated
On this page
The real challenge for us in implementing A/B Testing, as well as for you in interpreting the results, is to ensure that your tests are reliable and the results significant to your business.
A/B Testing is founded in statistics, relying on a sound method, trustworthy, and able to be verified. The concern here is with ensuring unbiased and accurate results. We also use proven mathematics, ensuring that the underlying statistical calculations represent (or reliably predict) real differences between A and B.
The Method and Math at a glance
-
Randomness: The assignment of a user to scenario A or B is purely random.
-
Statistical significance: Our statistical significance threshold is set at (p-value) >= 0.95 (p-value can be described roughly as a confidence score). A test is conclusive (statistically significant) when the significance score is >= 95%.
-
Mathematical formula: The mathematical formula used to calculate the scores is Two-Tailed. This means that we are not assuming anything, especially not that B is better than A. We calculate percentages in both directions: whether A is better than B or B is better than A. By using the two-tailed approach, we avoid making any assumptions about A or B - either one is possible.
-
Relevance Improvement: That said, the concern with A/B Testing is usually improvement: every change to B is usually intended to be better than the current main index (A). In other words, A/B Testing is trying to help us find a better configuration to the one we already have.
Statistical Significance or Chance
When you run your tests, you may get results that show a 4% increase in Conversion or CTR. Statistical significance is concerned with whether the 4% increase is chance or real. The statistical concern is whether your sample group truly represents the larger population: Does the 4% make sense only for that sample group or does it reasonably predict the behavior of the larger population?
If the sample does not represent the larger population, then your results are chance (or luck, or even insignificant). Chance is clearly not good for decision-making. In fact, chance is considered a sampling error. Statistical significance (our confidence rate) is aimed at distinguishing chance from a real probability. When you reach 95% confidence, the difference between A and B is no longer a mere chance happening but is something you can confidently expect (or predict) will happen in the future.
Large, distributed samples.
Large samples of data are necessary for confidence. If we flip a coin 1000 times, we would expect to approach a 50/50 head/tail allotment. In the same way, we know that the more people we ask a question, the closer we’ll get to a consensus of what answers to expect. This consensus will fall within a pattern (the famous bell curve), where there will be an average response, +/- several deviations from that average. The more a response deviates from the norm, the less often we would expect to see it happen again.
The point in creating this kind of distributed baseline is to help stabilize, or fix, variation. We feel more confident with each new user event because we start to see how they all begin falling into the same pattern - very close to the average. Users actions vary according to a predictable distribution. Once this starts to happen, our confidence in the results grows. We can start to feel confident that the average of the sample is indeed the real average that we can expect from the larger population.
Sample diversity.
It is not always the case that random assignment of users to A and B will result in an honest, reliable balance. You may get unlucky. The randomness can start by putting your most devoted clients in A and a completely different set of users in B. This will result in an unfortunate testing scenario. The math, luckily, will not make it easy to come up with a 95% confidence ratio.
But as a test continues to run over time, its random allotment of users will invariably undo this imbalance.
On this last point: Be careful when you test. If you test during a sales campaign, or a major holiday, or some other kind of odd event, this can undermine the reliability of the results.