Generating a Sample of AdWords Data Using R

Posted on December 2, 2010

I am working on an AdWords data analysis tool; I’ve got the basics done but I need to do some testing. I could use client data for this but I am slightly hesitant; partly because I am concerned about data security and partly because clients don’t like to be told that I am doing experiments with their account

So the problem is getting AdWords like data from a source other than AdWords.

I decided to solve this by making it all up. Faking data is something that human beings are pretty bad at so I think the following line of attack is best:

  1. Create an account structure manually.This means that the campaigns, ad groups, adverts and keywords are put together by hand, by me. The test account will have semantic structure in a way that would be beyond my ability to generate automatically.
  2. Performance statistics are generated using the statistical programming language R (stupid name, great tool). The main advantage of this is that it will help me learn more R. It will also save time and hopefully produce more ‘real’ looking figures then if I did it by hand.

The account structure work was simple but tedious. From this I got a lot of advert-keyword pairs for which to generate performance data. I have chosen to ignore the content/meaning of the advert-keyword pair; instead my program will generate data for n pairs without knowledge of what they are. Figuring out a CTR to assign to a given advert-keyword pair based on its content is too difficult for me to do realistically.

Here is a quick walk through of the code.

Start by defining the number of advert-keyword pairs to generate data for

numberofadkwds <- 100

I assume that the true click through rate for each advert-keyword in the account follows a normal distribution with the following parameters:

ctr <- 0.0285
ctrsd <- 0.02

Define a distribution for the average position, conversion rate and CPC in a similar way:

avgpos <- 2.3
avgpossd <- 4

convr <- 0.0543
convrsd <- 0.04

avgcpc <- 60 #in pence
cpcsd <- 10

It would be better to model the average position and CPC by running a simulation auction with other bidders, but this way is a bit simpler. As it stands, this is the part of the model I am least happy with.

The next step is to create a vector with the search volume for each advert-keyword. A true simulation would generate search volumes for each keyword and then split that between the adverts, but again I have chose to ‘Keep It Simple, Stupid’ and sample the search volume from a pareto distribution. I chose a pareto distribution because it models the idea of the ‘long tail of search’ very well. I picked the parameters by experimenting until it looked right.

searchvolume <- round(100*rpareto(numberofadkwds,6,1))

The true CTR and conversion rate for each advert-keyword pair is generated by sampling from a normal distribution with parameters defined above

truectr <- abs(rnorm(numberofadkwds,ctr,ctrsd))
trueconvr <- abs(rnorm(numberofadkwds,convr,convrsd))

Now I define a function which, when given a search volume, true CTR and true conversion rate returns the daily statistics for an advert-keyword pair:

dailystats <- function(impressions,ctr,convr) {
  #Daily impressions varies so take sample from normal distribution
  dailyimpressions <- floor(rnorm(1,1,0.1)*impressions)
  #Sample the ad position for each one of these impressions. Then average
  dailyavgpos <- mean(ceiling(abs(rnorm(dailyimpressions,avgpos,avgpossd))))
  #Number of clicks sampled from a Bernoulli distribution
  dailyclicks <- sum(rbern(dailyimpressions,ctr))
  #Similarly for conversions
  dailyconv <- sum(rbern(dailyclicks,convr))
  #Sample CPC from a normal distribution about the mean.
  #This is the bit of the model I am most unhappy about
  dailycost <- sum(abs(round(rnorm(dailyclicks,avgcpc,cpcsd))))/100
  #return result in a data.frame
  result <- data.frame(impressions = dailyimpressions,clicks = dailyclicks, cost = dailycost,pos = dailyavgpos,conv = dailyconv)
  result
}

This next function is an unnecessarily complicated wrapper around dailystats() which given a day d returns all the stats for each advert-keyword on that day:

dayStatsForAllAdKwds<-function(d) {
  x <- t(mapply(dailystats,searchvolume,truectr,trueconvr))
  data.frame(day=d,campaign_id=result$campaign_id,adgroup_id=result$adgroup_id,keyword_id=result$keyword_id,advert_id=result$advert_id,impressions=unlist(x[,1]),clicks=unlist(x[,2]),cost=unlist(x[,3]),pos=unlist(x[,4]),conv=unlist(x[,5]))
}

The next part of the program defines the number of days to generate data for and then counts this many days back from the current date:

numberofdays<-100

daterange<-Sys.Date( ):(Sys.Date()-numberofdays)

#Format as string in YYYY-MM-DD format
#Origin set to 01/01/1970 so that Sys.Date() is today
dates<-strftime(as.Date(daterange,origin = "1970-01-01"),"%Y-%m-%d")

The final part of the program is a loop that generates everything. It takes a little while to run.

generateStats <- function() {
  a <- c() #initialise accumulator
  for (d in dates) {
    a<-rbind(a,dayStatsForAllAdKwds(d))
  }
  a
}

I will make the R file for this available to download once I’ve figured out a bit more about how to do this with Zotonic. If you are desperate for a copy then contact me.