A/B Testing

A/B Testing with Pearson's chi-squared test of independence, and a monte carlo simulation to visually check the results

A/B Testing

Imagine that you’ve got a great product and a beautiful website with plenty of visitors, but and here’s the rub … not as many sales as you’d like.

There are all kinds of changes you could make, but the important thing is how can you tell whether whatever it is that you’ve changed has made a difference? And how can you tell whether or not any differences are down to chance?

It turns out that there’s a tried and tested way of finding these things out and thats called A/B Testing.

To run A/B tests you create two different versions of something - usually versions of a webpage, and then randomly serve 1 of them to a user, creating a simple randomized trial. You then record whether the user converts to making a sale or perhaps creating an account or whatever it is you wish to achieve.

A/B testing lets you compare version A - usually your current or control option, with version B the new option. Then you are able to compare the convertion rates of A and B and see whether there are any differences and whether those differences are significant.

We’ll begin our project by simulating some data with different convertion rates using different binomial distributions.

require(dplyr)
require(ggplot2)
require(cowplot)

# decide how many samples
size <- 1000

# create 4 different page simulations
#  with different convertion rates
pages <- data.frame(group = c('Control', 'A', 'B', 'C'),
                    c.rate = c(0.03, 0.05, 0.02, 0.08),
                    stringsAsFactors = FALSE)

# seed for reproducibility
set.seed(1234)

control <- rbinom(size, 1, pages$c.rate[1])
a <-       rbinom(size, 1, pages$c.rate[2])
b <-       rbinom(size, 1, pages$c.rate[3])
c <-       rbinom(size, 1, pages$c.rate[4])

# put into a (tidy) dataframe
df <- data.frame(page=rep(x = c('Control', 'A', 'B', 'C'), each=size),
                 converted=c(control, a, b, c),
                 stringsAsFactors = FALSE )

Looking at some summary statistics for our simulated data:

# get some summary stats
df %>% 
    group_by(page) %>% 
    summarise(Num=n(),
              num.converted=sum(converted),
              num.n.converted=n()-sum(converted),
              c.rate.sim=100*mean(converted)) %>% 
    left_join(pages, by=c("page" = "group")) %>% 
    mutate(c.rate=100*c.rate)

## # A tibble: 4 x 6
##      page   Num num.converted num.n.converted c.rate.sim c.rate
##     <chr> <int>         <int>           <int>      <dbl>  <dbl>
## 1       A  1000            33             967        3.3      5
## 2       B  1000            21             979        2.1      2
## 3       C  1000            86             914        8.6      8
## 4 Control  1000            42             958        4.2      3

To assess whether our two pages convertion rates are actually different we need to perform a hypothesis test. Our null hypothesis, H0, is that our 2 pages convertion rates are equal, whereas our alternate hypothesis, HA, is that they differ.

Pearson’s chi-squared test of independence

If our samples are large enough and our convertion rates aren’t near 0 or 1, then we can use Pearson’s chi squared test.

This will give us a p value, which tells us the probability of getting results this extreme (or more) - assuming that the null hypothesis is true. So if our p-values are small (p < 0.05 is usual) then we can assume that the null hypothesis is unlikely enough, so we can accept the alternative hypothesis.

Comparing our Control with A, B & C

Comparing Control and A

df %>% 
    filter(page=='Control' | page=='A') %>% 
    group_by(page) %>% 
    summarise(num.converted=sum(converted),
              num.n.converted=n()-sum(converted)) %>% 
    select(-page) %>% 
    chisq.test()

## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 0.88658, df = 1, p-value = 0.3464

Comparing Control and B

df %>% 
    filter(page=='Control' | page=='B') %>% 
    group_by(page) %>% 
    summarise(num.converted=sum(converted),
              num.n.converted=n()-sum(converted)) %>% 
    select(-page) %>% 
    chisq.test()

## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 6.5557, df = 1, p-value = 0.01045

Comparing Control and C

df %>% 
    filter(page=='Control' | page=='C') %>% 
    group_by(page) %>% 
    summarise(num.converted=sum(converted),
              num.n.converted=n()-sum(converted)) %>% 
    select(-page) %>% 
    chisq.test()

## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 15.433, df = 1, p-value = 8.548e-05

Comparing the new pages to our control we get:

page	p-value	Significant	Convertion Rate	Actual Rate
Control	–	–	3.1	3
A	0.3464	No	4.7	5
B	0.01045	Yes	1.3	2
C	8.548e-05	Yes	8.1	8

This shows that page A isn’t significantly different to the control, any difference could be down to chance.

Page B is significantly different to the control, however the convertion rate is lower than the control.

Page C is significantly different to the control, and the convertion rate much higher. This would be the page to change to.

Visual Comparison

If we create a monte carlo simulation of our web hits and convertions then we can get a visual comparison to check whether our results make sense.

# get sample distributions
each.sample.size <- 250
num.samples <- 100

getMean <- function(nIndex, num) {
    
    mean(sample_n(df %>% filter(page==pages$group[nIndex]), num)$converted)
}

s.con <- data.frame(page=pages$group[1], 
                    c.rate=replicate(num.samples, getMean(1, each.sample.size)))
s.a <- data.frame(page=pages$group[2], 
                  c.rate=replicate(num.samples, getMean(2, each.sample.size)))
s.b <- data.frame(page=pages$group[3], 
                  c.rate=replicate(num.samples, getMean(3, each.sample.size)))
s.c <- data.frame(page=pages$group[4], 
                  c.rate=replicate(num.samples, getMean(4, each.sample.size)))

# combine the sample distributions
s.df <- rbind(s.con, s.a, s.b, s.c)

# get the summary stats
s.df %>% 
    group_by(page) %>% 
    summarise(Num=n(),
              median=median(c.rate),
              sd=sd(c.rate),
              min=min(c.rate),
              min.05=mean(c.rate)-(1.96*sd),
              mean=mean(c.rate),
              max.05=mean(c.rate)+(1.96*sd),
              max=max(c.rate)
              ) %>%
    left_join(pages, by=c("page" = "group"))

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## character vector and factor, coercing into character vector

## # A tibble: 4 x 10
##      page   Num median          sd   min      min.05    mean     max.05
##     <chr> <int>  <dbl>       <dbl> <dbl>       <dbl>   <dbl>      <dbl>
## 1 Control   100  0.040 0.009609003 0.020 0.022646355 0.04148 0.06031364
## 2       A   100  0.036 0.010187415 0.012 0.014912666 0.03488 0.05484733
## 3       B   100  0.020 0.007504988 0.004 0.007010223 0.02172 0.03642978
## 4       C   100  0.086 0.014375906 0.048 0.058423225 0.08660 0.11477678
## # ... with 2 more variables: max <dbl>, c.rate <dbl>

# grah the sample distributions for each 'page'
plot.density <- 
    s.df %>% 
    ggplot(aes(x=c.rate, colour=page)) +
        geom_density() +
    theme(axis.text.y=element_blank()) +
    ggtitle("Convertion Rate Variance for each Page")

# boxplot using a custom box size 
#  where the box size is 0.05 < p < 0.95
#  and the whisker length is min, max
plot.boxplot <- 
    s.df %>% 
    ggplot(aes(x=page, y=c.rate, colour=page)) +
    stat_summary(geom = "boxplot", 
                 fun.data = function(x) setNames(quantile(x, c(0.0, 0.05, 0.5, 0.95, 1.00)), c("ymin", "lower", "middle", "upper", "ymax")), 
                 position = "dodge") +
    coord_flip() +
    theme(axis.text.y=element_blank())

plot_grid(plot.density, plot.boxplot, ncol = 1, nrow = 2)

center
The density plot and box plots both show the same monte carlo generated data to simulate convertion rates.

The boxplot has been modified so that each box size shows 0.05 < p < 0.95 for that page, with the whiskers showing the min/max values, this clearly shows how the distributions overlap.

From these graph it can be seen that:

Control and A are very similar
Control and B are similar but significantly different
- Note that the median lines are not within the 0.05 < p < 0.95 range of the other’s box
Control and C are clearly different

From these visual results it’s also clear that we would choose to use version C of the page.

The github repository for this project can be viewed here.

A/B Testing

Pearson’s chi-squared test of independence

Comparing our Control with A, B & C

Visual Comparison

Dialogue & Discussion