Blog»2013-07-19-split-testing-simpsons-paradox-and-causation

2013-07-19-split-testing-simpsons-paradox-and-causation

--- title: Split Testing, Simpson's Paradox and Causation tags: testing,R,maths ---

Consider the following split test results:

Treatment Trials Conversions Conversion Rate
A 4974 611 12.2%
B 5026 511 10.1%

This is a convincing victory for variation A (at least 95% confidence). But look what happens when we segment the test results by channel:

Channel Treatment Trials Conversions Conversion Rate
X A 1991 64 3.2%
Y A 2983 547 18.3%
X B 2977 129 4.3%
Y B 2049 382 18.6%

For each channel treatment B has performed better than treatment A. This is an example of Simpson's Paradox which, as Will Critchlow points out, can lead to problems in split testing when traffic from different channels is not quite evenly distributed between the different treatments.

In this blog post I will discuss a potential way around this problem.

1 A very brief introduction to Causation and DAGs.

The nature of cause and effect is a tricky philosophical problem and I am not familiar enough with all the ins and outs to give a succinct treatment here. So here is a whirlwind of bullet points to get started with

  • Causes have effects and vice-versa. An effect cannot be it's own cause
  • If the probability distributions for two events are conditionally independent then there can be no causal link between them.
  • Cause and effect relationships can be modelled as a Directed Acyclic Graph (DAG) where a directed edge from node A to node B indicates a causative relationship between A and B (A causes B in some sense). The graph needs to be acyclic otherwise you might end up with strange loops where A causes B which causes C which causes A.
  • If you have a DAG you can use science to determine if it is a correct model or not (when I say science I mean the methodology of experimentation/observation). But if you don't have a DAG in mind science won't help you come up with one.
  • Instead you can start by assuming edges between all variables and then start removing edges based on conditional independence tests. This gives the skeleton of a DAG (i.e. no directed edges)
  • Some edges can be directed using further conditional independence tests.

I am not doing this justice at all, so back to split testing.

2 Back to split testing

So normally, when we run a split test we have the following causal structure in our heads:

 digraph G {
  size="4,3"
  ratio=expand
  treatment[label="Treatment"]
  conversion[label="Conversion"]
  treatment->conversion
}

This is unrealistic and as we saw in the example it can cause problems.

When we also consider the traffic source there are three new graphs to consider:

 digraph G {
  size="4,3"
  ratio=expand
  treatment[label="Treatment"]
  conversion[label="Conversion"]
  source[label="Source"]
  treatment->conversion
}

The first case is simple; the traffic source makes no difference to the conversion rate and the only thing that does is the treatment group of the user.

 digraph G {
  size="4,3"
  ratio=expand
  treatment[label="Treatment"]
  conversion[label="Conversion"]
  source[label="Source"]
  treatment->conversion
  source->conversion
}

In the second case both the treatment and the source of the traffic influence the conversion rate. Normal statistical tests for significance will still give the right answer here because there is no confoundation between the source and the treatment.

 digraph G {
  size="4,3"
  ratio=expand
  treatment[label="Treatment"]
  conversion[label="Conversion"]
  source[label="Source"]
  treatment->conversion
  source->conversion
  source->treatment
}

It is in the third case where Simpson's Paradox (and general problems with how your split test is setup) occur.

The R library pcalg can be used to infer causal structure from raw data. Installing pcalg is a little more complicated than for most R libraries but there are clear instructions here.

The following code examples create simulated data and then use pcalg to compute the skeleton of the causal graph.

library(pcalg)
completelyRandom <- function() {
  s <- sample(0:1, 10000, T)
  t <- sample(0:1, 10000, T)
  equalConversion <- function(source,treatment) {
    r <- runif(1, 0, 1)
    return(r<0.1) #10% conversion rate for all
  }
  c <- mapply(equalConversion, s, t)
  df <- data.frame(source=s, treatment=t, conversion=c)
  suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
  pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
  plot(pc.fit, main = "No Relationship At All", labels=c("Channel","Treatment","Conversion"))
}
completelyRandom()

2013-07-19-no-relationship.png

The data is random so the skeleton shows no causal relationship between any of the variables.

library(pcalg)
treatmentEffect <- function() {
  s <- sample(0:1, 10000, T)
  t <- sample(0:1, 10000, T)
  convert <- function(source,treatment) {
    r <- runif(1, 0, 1)
    if(treatment) {
      return(r<0.1) } #Treatment 1 10% conversion rate
    else {return(r<0.05) } #Treatment 2 5% conversion rate
  }
  c <- mapply(convert, s, t)
  df <- data.frame(source=s, treatment=t, conversion=c)
  suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
  pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
  plot(pc.fit, main = "Treatment Influences Conversion", labels=c("Channel","Treatment","Conversion"))
}
treatmentEffect()

2013-07-19-treatment-effect.png

This could be used instead of a normal split test - if there is an edge between Treatment and Conversion in the graph then there is a significant difference between treatments.

library(pcalg)
sourceAndTreatmentEffect <- function() {
  s <- sample(0:1, 10000, T)
  t <- sample(0:1, 10000, T)
  convert <- function(source,treatment) {
    r <- runif(1, 0, 1)
    if(source && treatment) {
      return(r<0.2) }
    else if (source && !treatment) {
      return(r<0.1)}
    else if (!source && !treatment) {
      return(r<0.05)}
    else {return(r<0.01) } 
  }
  c <- mapply(convert, s, t)
  df <- data.frame(source=s, treatment=t, conversion=c)
  suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
  pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
  plot(pc.fit, main = "Both Source and Treatment Influence Conversion", labels=c("Channel","Treatment","Conversion"))
}
sourceAndTreatmentEffect()

2013-07-19-source-treatment-effect.png

And finally there is something similar to the case given in the example where the traffic source influences the treatment:

library(pcalg)
biasedTreatment <- function() {
  s <- sample(0:1, 10000, T)
  treatment <- function(source) {
    r <- runif(1, 0, 1)
    if(source) {
      return(r<0.4) }
    else {
      return(r<0.6)}
  }
  t <- sapply(s, treatment)
  convert <- function(source,treatment) {
    r <- runif(1, 0, 1)
    if(source && treatment) {
      return(r<0.2) } #source 1, treatment 1 converts 20%
    else if (source && !treatment) {
      return(r<0.18)} #source 1, treatment 0 converts 18%
    else if (!source && !treatment) {
      return(r<0.03)} #source 0, treatment 0 converts 3%
    else {return(r<0.05) } #source 0, treatment 1 converts 5%
  }
  c <- mapply(convert, s, t)
  df <- data.frame(source=s, treatment=t, conversion=c)
  suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
  pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
  plot(pc.fit, main = "Source Influences Treatment", labels=c("Channel","Treatment","Conversion"))
}
biasedTreatment()

2013-07-19-biased-treatment.png

3 Where to go from here

Running tests using this causal framework is a way to avoid tests being confounded without having to run AABB tests or anything like that.

Run the test data through the skeleton function from pcalg and observe the graph. If there is no link between source and treatment (or whatever additional variables you consider) and the graph has an edge between treatment and conversion then the test is good.

Any link between source and conversion indicates that it might be worth factoring the results by source; it could be that different variations work better with different traffic sources. Sometimes it will be practical to take advantage of this but sometimes it will be over segmentation.

In the case where the source variable is linked with the treatment variable it is necessary to split out the data into a table with more columns (as in the example) before deciding on a course of action. Or, if you are confident that the test is setup properly then just wait for the link to disappear as more data comes in.

Authored by Richard Fergie