At the moment I'm looking at different split testing protocols and trying to decide which one is best. This is going to be quite a big piece of work and I have lots more ideas to look at and "what if's" to answer.
In this post I'm just going to share a couple of charts showing the results of various testing methodologies. I haven't actually tested the tests (so to speak). Instead I have written a simulator which allows me to check lots of different ideas more quickly. The simulator is not 100% accurate (of course) - I'm still figuring out if it is inaccurate in any important ways.
This chart shows what happens when different p values are used in testing. I have also included a random strategy (pick which variation to pause at random) and also a cheating strategy which knows which version is actually the best and always chooses that one.
As you can see the p value used doesn't make much difference unless you are aiming for a very accurate test (p=0.01). This is because the errors made by the less accurate tests are more than cancelled out by the faster decisions.
I first heard about using bandits for website optimisation in Steve Hanov's blog post on the subject. Coding this bandit is relatively easy (labelled as "Naive Bandit" on the chart). Steve does not address when to pause or add new variations. The "ANOVA bandit" is a bandit that pauses poorly performing variations with p=0.1. The "Bandit Always Add" adds new variations at every opportunity and never pauses anything.
For comparison I have included ANOVA with p value 0.1 from the previous chart as well as the Random and Cheating strategies.
So from this little snippet of information I suggest the following: