Posted on July 19, 2013

Consider the following split test results:

Treatment | Trials | Conversions | Conversion Rate |
---|---|---|---|

A | 4974 | 611 | 12.2% |

B | 5026 | 511 | 10.1% |

This is a convincing victory for variation A (at least 95% confidence). But look what happens when we segment the test results by channel:

Channel | Treatment | Trials | Conversions | Conversion Rate |
---|---|---|---|---|

X | A | 1991 | 64 | 3.2% |

Y | A | 2983 | 547 | 18.3% |

X | B | 2977 | 129 | 4.3% |

Y | B | 2049 | 382 | 18.6% |

For each channel treatment B has performed better than treatment A. This is an example of Simpson’s Paradox which, as Will Critchlow points out, can lead to problems in split testing when traffic from different channels is not quite evenly distributed between the different treatments.

In this blog post I will discuss a potential way around this problem.

The nature of cause and effect is a tricky philosophical problem and I am not familiar enough with all the ins and outs to give a succinct treatment here. So here is a whirlwind of bullet points to get started with

- Causes have effects and vice-versa. An effect cannot be it’s own cause
- If the probability distributions for two events are conditionally independent then there can be no causal link between them.
- Cause and effect relationships can be modelled as a Directed Acyclic Graph (DAG) where a directed edge from node A to node B indicates a causative relationship between A and B (A causes B in some sense). The graph needs to be acyclic otherwise you might end up with strange loops where A causes B which causes C which causes A.
- If you have a DAG you can use science to determine if it is a correct model or not (when I say science I mean the methodology of experimentation/observation). But if you don’t have a DAG in mind science won’t help you come up with one.
- Instead you can start by assuming edges between all variables and then start removing edges based on conditional independence tests. This gives the skeleton of a DAG (i.e. no directed edges)
- Some edges can be directed using further conditional independence tests.

I am not doing this justice at all, so back to split testing.

So normally, when we run a split test we have the following causal structure in our heads:

```
digraph G {
size="4,3"
ratio=expand
treatment[label="Treatment"]
conversion[label="Conversion"]
treatment->conversion
}
```

This is unrealistic and as we saw in the example it can cause problems.

When we also consider the traffic source there are three new graphs to consider:

```
digraph G {
size="4,3"
ratio=expand
treatment[label="Treatment"]
conversion[label="Conversion"]
source[label="Source"]
treatment->conversion
}
```

The first case is simple; the traffic source makes no difference to the conversion rate and the only thing that does is the treatment group of the user.

```
digraph G {
size="4,3"
ratio=expand
treatment[label="Treatment"]
conversion[label="Conversion"]
source[label="Source"]
treatment->conversion
source->conversion
}
```

In the second case both the treatment and the source of the traffic influence the conversion rate. Normal statistical tests for significance will still give the right answer here because there is no confoundation between the source and the treatment.

```
digraph G {
size="4,3"
ratio=expand
treatment[label="Treatment"]
conversion[label="Conversion"]
source[label="Source"]
treatment->conversion
source->conversion
source->treatment
}
```

It is in the third case where Simpson’s Paradox (and general problems with how your split test is setup) occur.

The R library pcalg can be used to infer causal structure from raw data. Installing pcalg is a little more complicated than for most R libraries but there are clear instructions here.

The following code examples create simulated data and then use pcalg to compute the skeleton of the causal graph.

```
library(pcalg)
completelyRandom <- function() {
s <- sample(0:1, 10000, T)
t <- sample(0:1, 10000, T)
equalConversion <- function(source,treatment) {
r <- runif(1, 0, 1)
return(r<0.1) #10% conversion rate for all
}
c <- mapply(equalConversion, s, t)
df <- data.frame(source=s, treatment=t, conversion=c)
suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
plot(pc.fit, main = "No Relationship At All", labels=c("Channel","Treatment","Conversion"))
}
completelyRandom()
```

The data is random so the skeleton shows no causal relationship between any of the variables.

```
library(pcalg)
treatmentEffect <- function() {
s <- sample(0:1, 10000, T)
t <- sample(0:1, 10000, T)
convert <- function(source,treatment) {
r <- runif(1, 0, 1)
if(treatment) {
return(r<0.1) } #Treatment 1 10% conversion rate
else {return(r<0.05) } #Treatment 2 5% conversion rate
}
c <- mapply(convert, s, t)
df <- data.frame(source=s, treatment=t, conversion=c)
suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
plot(pc.fit, main = "Treatment Influences Conversion", labels=c("Channel","Treatment","Conversion"))
}
treatmentEffect()
```

This could be used instead of a normal split test - if there is an edge between Treatment and Conversion in the graph then there is a significant difference between treatments.

```
library(pcalg)
sourceAndTreatmentEffect <- function() {
s <- sample(0:1, 10000, T)
t <- sample(0:1, 10000, T)
convert <- function(source,treatment) {
r <- runif(1, 0, 1)
if(source && treatment) {
return(r<0.2) }
else if (source && !treatment) {
return(r<0.1)}
else if (!source && !treatment) {
return(r<0.05)}
else {return(r<0.01) }
}
c <- mapply(convert, s, t)
df <- data.frame(source=s, treatment=t, conversion=c)
suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
plot(pc.fit, main = "Both Source and Treatment Influence Conversion", labels=c("Channel","Treatment","Conversion"))
}
sourceAndTreatmentEffect()
```

And finally there is something similar to the case given in the example where the traffic source influences the treatment:

```
library(pcalg)
biasedTreatment <- function() {
s <- sample(0:1, 10000, T)
treatment <- function(source) {
r <- runif(1, 0, 1)
if(source) {
return(r<0.4) }
else {
return(r<0.6)}
}
t <- sapply(s, treatment)
convert <- function(source,treatment) {
r <- runif(1, 0, 1)
if(source && treatment) {
return(r<0.2) } #source 1, treatment 1 converts 20%
else if (source && !treatment) {
return(r<0.18)} #source 1, treatment 0 converts 18%
else if (!source && !treatment) {
return(r<0.03)} #source 0, treatment 0 converts 3%
else {return(r<0.05) } #source 0, treatment 1 converts 5%
}
c <- mapply(convert, s, t)
df <- data.frame(source=s, treatment=t, conversion=c)
suffStat <- list(dm=df, nlev=c(2,2,2), adaptDF=FALSE)
pc.fit <- skeleton(suffStat, indepTest = disCItest, p = ncol(df), alpha = 0.05)
plot(pc.fit, main = "Source Influences Treatment", labels=c("Channel","Treatment","Conversion"))
}
biasedTreatment()
```

Running tests using this causal framework is a way to avoid tests being confounded without having to run AABB tests or anything like that.

Run the test data through the skeleton function from pcalg and observe the graph. If there is no link between source and treatment (or whatever additional variables you consider) and the graph has an edge between treatment and conversion then the test is good.

Any link between source and conversion indicates that it might be worth factoring the results by source; it could be that different variations work better with different traffic sources. Sometimes it will be practical to take advantage of this but sometimes it will be over segmentation.

In the case where the source variable is linked with the treatment variable it is necessary to split out the data into a table with more columns (as in the example) before deciding on a course of action. Or, if you are confident that the test is setup properly then just wait for the link to disappear as more data comes in.