It started with a tweet…

fwiw, that’s the wrong thing to measure there. If they get one big donator falling into either side of the test it badly skews the numbers.

— dan barker (@danbarker) February 21, 2017

Dan points out that because some donations are a lot bigger than others the results of an A/B test can be skewed if one very big donation coincidentally falls into one test variation.

This is a very fair point, but surely there should be a way to statistically check whether the difference is due to random chance or better performance of a page.

There are many ways this could be modelled; here is my quick attempt to show how it could work on real data.

First attempt

To start with I’m going to try to demonstrate what Dan is talking about. I’ll make some donation data for two identical pages (i.e. an A/A test) and show what kind of result this gives.

Then I’ll add a single large donation to one page; you will see how suddenly a naive approach then concludes that this page is better.

    set.seed(2413)
    conversion_rate <- 0.05
    page1 <- rbinom(1000,1,conversion_rate)*rnorm(1000,mean=10,sd=3)
    page2 <- rbinom(1000,1,conversion_rate)*rnorm(1000,mean=10,sd=3)
    df <- data.frame(Page1=mean(page1),Page2=mean(page2))
    kable(df)
Page1 Page2
0.4861572 0.4927313

By random chance, the donations per visit for page 2 is slightly higher than for page 1

  df <- data.frame(Page1=max(page1),Page2=max(page2))
  kable(df)
Page1 Page2
17.90275 19.54638

But the maximum donation size isn’t that different

To see the chance that one page is better than the other I will use the following Stan model:

data {
  int N1;
  int N2;
  int donated1[N1];
  int donated2[N2];
  real donation_amount1[N1];
  real donation_amount2[N2];
}
parameters {
  real convr1;
  real convr2;
  real donation_mean1;
  real donation_mean2;
  real donation_sd1;
  real donation_sd2;
}
model {
  for (i in 1:N1) {
    donated1[i] ~ bernoulli(convr1);
    if (donated1[i]==1) {
      donation_amount1[i] ~ normal(donation_mean1, donation_sd1);
    }
  }
  for (i in 1:N2) {
    donated2[i] ~ bernoulli(convr2);
    if (donated2[i]==1) {
      donation_amount2[i] ~ normal(donation_mean2, donation_sd2);
    }
  }
}
  

And fit it using the generated data

    library(rstan)
    rstan_options(auto_write = TRUE)
    options(mc.cores = parallel::detectCores())
    model <- [1134 chars quoted with '"']

    d = list(N1 = 1000, N2 = 1000,
             donation_amount1 = page1, donation_amount2 = page2,
             donated1 = as.integer(page1 > 0), donated2 = as.integer(page2 > 0))

    fit <- stan(model_code=model,data=d,iter=10000,chains=4,seed=152,refresh = -1)

Finally, extract the statistic of interest out of the samples

    samples <- extract(fit)
    sum(samples$convr1 > samples$convr2)/20000
## [1] 0.73795

There is a 73% chance that the conversion rate for page 1 is better than page 2.

    sum(samples$donation_mean1*samples$convr1 > samples$donation_mean2*samples$convr2)/20000
## [1] 0.4725

And a 53% chance that page 2 generates more donations per visit than page 1.

Most people either wouldn’t take any action off these numbers or would declare the challenger to have failed with the incumbent page being the winner

Now add a large donation to page 2 and redo the computation

    page2 <- c(1000,page2)

    d = list(N1 = 1000, N2 = 1001,
       donation_amount1 = page1, donation_amount2 = page2,
       donated1 = as.integer(page1 > 0), donated2 = as.integer(page2 > 0))

    fit <- stan(model_code=model,data=d,iter=10000,chains=4,seed=152, refresh = -1)
    samples <- extract(fit)
    sum(samples$convr1 > samples$convr2)/20000
## [1] 0.70045

A 70% chance that page 1 has a better conversion rate than page 2 (this is expected - we’ve only added one extra conversion).

    sum(samples$donation_mean1*samples$convr1 > samples$donation_mean2*samples$convr2)/20000
## [1] 0.0971

But page 2 is 90% more likely to generate more donation dollars per visit!

This is exactly what Dan was talking about - the test result gets massively skewed by a single random donation.

“Skewed” is a bit of a loaded word here, “changed” would be more neutral, but my intuition says that a single donation should not change the result this much.

As I’ve said before, when the model doesn’t agree with intuition one of the following things must be going on:

  1. My intuition is wrong
  2. The model is wrong
  3. All of the above

For the purposes of this discussion, let’s go with option 2 and try to find another model.

Second attempt

Our first model was that, for each page, users convert with probability convr and then those that convert make a donation with an amount drawn from a normal distribution having mean donation_mean and standard deviation donation_sd.

The model fails because the donation amounts are not normally distributed; there are a lot of small donations and a small number of much larger donations. The larger donations distort our estimate of the mean even though there is only a small number of them.

By picking a distribution with heavier tails this problem can be reduced.

Change the model to the following:

data {
  int N1;
  int N2;
  int donated1[N1];
  int donated2[N2];
  real donation_amount1[N1];
  real donation_amount2[N2];
}
parameters {
  real convr1;
  real convr2;
  real pareto_shape1;
  real pareto_shape2;
}
model {
  for (i in 1:N1) {
    donated1[i] ~ bernoulli(convr1);
    if (donated1[i]==1) {
      donation_amount1[i] ~ pareto(0.1,pareto_shape1);
    }
  }
  for (i in 1:N2) {
    donated2[i] ~ bernoulli(convr2);
    if (donated2[i]==1) {
      donation_amount2[i] ~ pareto(0.1,pareto_shape2);
    }
  }
}
  

Instead of a normal distribution, this model assumes that the donation amounts come from a pareto distribution with minimum 0.1 and shape pareto_shape

Fit the model on the data:

    model <- "data {
      int<lower=0> N1;
      int<lower=0> N2;
      int<lower=0,upper=1> donated1[N1];
      int<lower=0,upper=1> donated2[N2];
      real<lower=0> donation_amount1[N1];
      real<lower=0> donation_amount2[N2];
    }
    parameters {
      real<lower=0,upper=1> convr1;
      real<lower=0,upper=1> convr2;
      real<lower=0> pareto_shape1;
      real<lower=0> pareto_shape2;
    }
    model {
      for (i in 1:N1) {
        donated1[i] ~ bernoulli(convr1);
        if (donated1[i]==1) {
          donation_amount1[i] ~ pareto(0.1,pareto_shape1);
        }
      }
      for (i in 1:N2) {
        donated2[i] ~ bernoulli(convr2);
        if (donated2[i]==1) {
          donation_amount2[i] ~ pareto(0.1,pareto_shape2);
        }
      }
   }"
   fit <- stan(model_code=model,data=d,iter=10000,chains=4,seed=152, refresh = -1)
   samples <- extract(fit)
   sum(samples$convr1 > samples$convr2)/20000
## [1] 0.6973
   sum(samples$pareto_shape1*0.1/(samples$pareto_shape1-1) > samples$pareto_shape2*0.1/(samples$pareto_shape2-1))/20000
## [1] 0.4013

The conversion rate stuff doesn’t change, there is still a 70% chance that page 1 has a better conversion rate than page 2

But the donated dollars per visit has changed a lot. There is a 60% chance that page 2 is better than page 1 (compared to 90% on the previous model).

This seems more reasonable to me but in real life there would still be more work to do: I picked the pareto distribution because it is a heavy tailed distribution that I’ve heard of. It might not be the best fit for the real life data - if the tail is too fat this would cause us to reject pages that are actually better and if the tail is too thin we will declare winners that aren’t actually any better.