Bad A/B testing is worse than none at all

The red bottle or the blue bottle.
The red bottle or the blue bottle.
Image: Reuters/Arnd Wiegmann
We may earn a commission from links on this page.

A/B testing is all the rage in online marketing now, but you could be doing it all wrong. And the results of a poor test can do more harm than good, experts say.

The concept of A/B testing is simple: Compare how two approaches—different email subject lines or a new website design, for instance—affect user response, and go with the better one. But A/B testing can actually fool you into picking the less successful option, if you don’t understand the math. Machine learning specialist and VP of data science at Skimlinks Martin Goodson outlines three concepts that “come as second nature to statisticians” but are often forgotten by those selling or using A/B testing (pdf).

1. Do you know the statistical power of your test?

Statistics isn’t wizardry, and not all tests are created equal. Statistical power is the likelihood that a test will detect a difference between two values when that difference truly exists.

But lots of A/B testing doesn’t run for long enough or use large enough sample sizes to have high statistical power. While it’s important to have a set sample size (more on that in a minute), it needs to be large, or you’re unlikely to spot actual trends. Goodson says that a good rule of thumb is 6000 conversion events (e.g., people clicking on a headline).

2. Don’t peek

A cardinal sin of some A/B testing software, Goodson says, is that it monitors the results constantly and stops as soon as a significant result is achieved. This produces false positives 80% of the time. It’s like running a smaller, weaker test—more error prone and an all-around bad idea—except it’s your own fault for being impatient.

3. If uplift degrades over time, take note

In statistics, regression to the mean is a well-known phenomenon. If 100 people are given a quiz on a subject they’re unfamiliar with, for example, some of them will still do well by a combination of knowledge and chance. But if you take the top 50 scorers and test them again, most of them will do worse: The randomness of the testing will make many of them regress to the mean score and lower.

The same is true in A/B testing: If results seem to be less successful in practice, it’s time to reconsider your statistical methods. That winning headline was probably a fluke, and you’re guilty of bad math.


Correction: A previous version of this article stated that Martin Goodson is an employee of Qubit. While he did write the referenced paper on A/B testing during his employment there, he is now employed by Skimlinks.