p-Hacking your A/B tests
Faking results using real experiments
Suppose you wanted to publish a believable report, showing an experimental result you know to be false. The experiment must be statistically sound to be believable. Support in particular you have an experiment which will show the correct result 95% of the time. How do you publish a false result anyway?
Easy: Run that experiment repeatedly, until you just happen to hit that 5% chance of it yielding the false result. It will probably take 15-25 runs. Then publish only that false result.
This is in fact what pharmaceutical companies used to do, to get drugs approved by the FDA1. To stop this behavior, now companies are required to pre-register studies and publish all results2.
1 The United States government agency that approves drugs for sale.
2 Although there are plenty of recent cases where companies were allowed to submit a subset of trials, got approved anyway, and later were found to be ineffective.
This is also what happened in the social sciences, resulting in the Replication Crisis. An experiment would be run once, often with a small number of college students. Occasionally something “interesting” happened, and that was published. Journals don’t publish papers where “nothing interesting” happened, and don’t wait for other teams to reproduce the result to avoid false-positives. Nobel Laureates have admitted that their own work suffers from this problem, and call for “replication rings” to solve the problem. But scientists are people too, and often would prefer fame for an amazing result, instead of waiting to discover that their result is a false-positive.
Before you shake your finger at them, point that finger at yourself. Because you’re doing this too.
Faking all your A/B tests (unintentionally)
To see how this unfolds, consider this simple example: You’re testing whether a coin is fair, i.e. that it comes up heads just as often as tails. Your experiment is to flip it 270 times, measuring how often it comes up heads. According to the binomial distribution, if the coin is fair, 90% of the time you will find between 45% and 55% of the flips to be heads.
So, you run the experiment, and you get 57% heads. You conclude the coin is biased, and you say “I’m 90% sure of that.” Is this the right conclusion? Probably?
Now imagine you have 10 coins, and you want to see whether they are fair. So you run the experiment once per coin. 9 of the tests result in “fair coin”, but one test shows “biased coin”.
Should we conclude that the one coin is biased? Almost surely not. Because even if all coins were fair, we know that 10% of the time the test will incorrectly conclude “biased”. So this result of 9 / 1
is exactly the result we’d expect if all coins were fair.
But wait a second… what if in fact 9 coins were fair but 1 were biased? Then this is also the most likely result! It could have also come up 8 / 2
, but the 9 / 1
result is the most likely.
So: This is the most likely result both if all 10 coins are fair and if only 9 coins are fair.
So… what exactly can you conclude from this result? That this procedure is insufficient, and we need to augment the procedure to correct the issue.
The insight is: You are making exactly this mistake with your A/B tests.
You are running a bunch of A/B tests. You’re looking for (something like) 90% confidence. Mostly they don’t work. Occasionally one works, maybe one out of ten. And you conclude that was a successful test. But this is exactly what we just did with coin-flipping. It doesn’t actually mean the test was successful.
In the real world it’s often even worse, like using confidence of 85% or 80%, where false-positives are much more likely.
Or not even picking a confidence level, not deciding how much N you need to make a conclusion, instead just “running the test until we get a result that looks conclusive.”
This particular error of “stopping when conclusive” is called “p-hacking” by statisticians, and it’s been a well-documented fallacy since the 1950s. The reason it’s a fallacy, is that when N is small, random fluctuations will often cause a result that looks like “90% confidence of a positive result,” whereas if you continue the experiment, the data move back into the territory of “negative result.”
I show some fun real-world examples of p-hacking, and counter-examples when the experiment is done properly, in this video:
Marketers: The accidental p-hackers
Marketers have been making these p-hacking errors in A/B testing for many years. You are too.
There is data. A study3 of more than 2100 real-world A/B tests across Optimizely’s customer base, found a 40% false positive rate. Marketers never knew it; indeed, the Optimizely software said the tests were “significant”! The Marketers never had a chance.
3 Here is the study, and here is a blog post from the author, addressing concerns and caveats.
So, roughly half the time you think “this A/B test was successful,” it isn’t.
This explains another phenomonon that you’re probably familar with if you’ve done a lot of A/B testing:
- You run tests. Sometimes one is significant. You keep that result and continue testing new variants.
- You keep repeating this process, keeping the designs that are “better.”
- Over time… one is 10% better. Another is 20% better. Another 10% better.
- So, that should be 45% better overall.
- You look back between now and months ago when you first started all this… and you don’t see a 45% improvement! Often, there’s no improvement at all.
Why didn’t all those improvements add up? Because they were false-positives.
How to stop fooling yourself
The easiest thing is to run the test again.
If false positives are 1 in 10 at 90% confidence, then you should be able to run a second test, and get the same result.
And don’t stop tests early. I know you’re excited. Just wait.
That’s it? Almost—there’s a much smarter way to go about this. And if you want to keep your job even with the rise of AI, you need to smarter than just running a bunch of variants.
Form a theory, test the theory, extend the theory
Too often A/B tests are just “throwing shit at the wall.” We excuse this behavior with statements like “No one knows which headline will work, it’s impossible to guess, so we just try things.”
Not only is this thoughtless and lazy, it also means you haven’t learned anything, regardless of the test result.
You don’t want to be a random-phrase slinger. AI can do that too, and AI isn’t a good marketer. Instead, you want to create validated learning.
To do this, form a theory, then design experiments to test the theory. Example theories:
- At this point, visitors are ready to buy, so send them down a purchase funnel with a restricted UX.
- Here people want to learn more, so talk about options and let them explore features rather then being crammed down a funnel.
- People are on the fence, so we should be more forceful and confident in our language.
- People can’t see well, especially on mobile devices, so we should have higher contrast colors and less text.
- Pictures work better than paragraphs, especially since people hate reading and half of them don’t speak English natively.
- People are more likely to click things that look like buttons, than that look like links.
- People from marketing channel X are more likely to be in a Y state of mind, and to be excited by Z.
Perhaps some ideas already popped into your mind. Good! Start writing down those theories. Then make designs that would work better if that theory were true.
It’s not “shit at the wall” because this time you have a specific Theory of Customer that your shit is designed to test. And that makes all the difference:
The negative result
So let’s say you pick a theory, run a test, and it fails. Is your theory disproved?
Not quite yet. Perhaps your implementation wasn’t the best manifestation of the theory. Not extreme enough, or had other issues that covered up the good effects. If you feel this might be the case, run a new experiment.
But if you’re still not getting positive results after a few iterations, you have accumulated evidence that the theory is incorrect. That’s called “learning.” Which wasn’t happenned when you “threw shit at the wall.” Now you know that you need to invent a new, different theory and test that.
How useful, and directed.
The positive result
Suppose you had a positive result. Hooray!
Is the theory proven? No, because you read the first half of the article, so you know that positive A/B tests are often false. So, what do you do?
You lean even further into the theory. Run another test that’s even more extreme, or a different form of the same concept.
If the theory is truly correct, that will work again, perhaps even better! If it reverts to nothing, you know it wasn’t a real result.
Now you’re not fooling yourself. You’re finding theories that actually correct. That’s what “validated learning” looks like.
Since you’ve actually learned something, you can extend the theory. What else is probably true? What new designs and text and pictures would leverage those insights even more? Now you might make multiple leaps of improvement, rather than going back to spraying random things on your website.
And you’re a smart marketer that AI cannot replace.
Most theories won’t be right (or at least not impactful enough to matter). Most tests will come up negative. That’s frustrating but also the truth.
You do want the truth…
Don’t you?
https://longform.asmartbear.com/p-hacking/
© 2007-2024 Jason Cohen @asmartbear