An introduction and guide to the R programming language for web analysts
If you are new to programming then take your time reading. There are lots of new concepts - don’t expect to be able to scan through and still get good value for time
Last updated on Sunday 09 February 2020
This document is an extension and improvement of the course materials I originally prepared for the MeasureCamp training session on using R.
The goal of the document (and the training session) is to help web analysts get to grips with coding using R and enable them to solve the kind of problems that web analysts try to solve.
Throughout the document, code samples are presented in the following format:
First the R code which can be copied or typed into the console (if you don’t know what the console is - this will be explained later) and then the output you can expect to see when it is run.
There will be some code samples where no output is presented. This is either because when the code is run there is no output to the console or because I have chosen to not make the results of evaluating that code public.
I highly recommend running all the sample code you come accross
This document is available under a CC Attribution License.
A little bit of context about me might help you decide whether or not my advice is worth listening to. If your background, experiences and desired outcomes are completely different from mine then take this document with the pinch of salt it deserves.
I started my career working in PPC and began to make an effort to learn more about web analytics after gaps and errors in client implementations lead me to take actions that appeared awesome but which were actually harmful.
My favourite example of this is when I sent tonnes of traffic from Nigeria at a lead generation form - the form got filled in a lot (which is what was being tracked) but the client did not make very much money from this.
My experience ranges across a wide range of sectors but it is very narrow when it comes to tools. Almost all (95%+) my work is done using Google Analytics.
I studied Maths at university which, with the courses I chose, involved about 8-12 hours of programming tuition using the languages Maple and Matlab. At the end of university I could not use programming to produce useful business outcomes. However, I could produce a sweet animated gif showing how a fourier series converges.
It wasn’t long after entering the professional world that I began to think that learning to code might be useful. This was probably at the time when the “everyone must learn to code” movement was getting started.
I began learning a language called Haskell because I had studied a small amount of category theory at university which meant that I got some of the jokes about how difficult Haskell is to learn. Those of you who know me well will know that picking the difficult choice like this is fairly typical behaviour for me.
If you want to learn more about computing and maths and aren’t in a massive rush to get something useful done then I really recommend you have a good look at Haskell. I believe that one day Haskell, or a language inspired by Haskell, will take over the world but that day will not be soon and right now Haskell is shit for the kind of tasks web analysts typically do.
The first language is the hardest to learn so once I had a bit of basic coding knowledge I was able to start dabbling in Python and then later R. With the sort of tasks I do right now most of my coding is in R.
I often get asked “should I learn R?” type questions. The answer is not an emphatic “YES” and the trade offs to consider are complicated.
For an analyst who has no experience with programming the “should I learn R?” question is actually two questions:
The answer to the first question is an emphatic yes because being able to program enables you to use more powerful tools than you would use otherwise.
There is a significant opportunity cost to learning to code, especially to start with, when doing simple things in code takes way longer than doing them in Excel. But the payoff is huge, especially when you consider that you have the whole of the rest of your career to recoup the investment.
I am going to reiterate that point because it is very important. To start with, everything will take longer and you won’t be able to do anything cool. The stuff you start off coding will be the type of thing that would take you less then a minute in Excel. It is worth it to spend the time to build this foundation. This is not specific to R - you will have this problem with any programming language.
R is good for doing analytical tasks so it is appropriate for analysts to learn as a first language. The code samples in this document will hopefully help speed you over the “I can do this 10x quicker in Excel” gap.
Once you ignoring the fact that learning new things generally makes you a better person the case for learning R over another language that you already know is much weaker. For me, the main strengths of R compared to other languages I have used are:
In all of these areas, particularly 1 and 3, Python is rapidly catching up with R. So if the language you already know is Python then knowing R too will not increase the amount of cool stuff you can do by as much as if the language you already know is PHP.
Another thing to think about is the blurring of the boundary between analysis and the operations that implement the insight. It is increasingly common for a hybrid coder/analyst to define behaviours that identify a user segment and to code site functionality that does something special with this user segment. If this kind of work is your end goal and you only have time to learn one programming language then first you should try to remove your “one programming language” constraint and then if that doesn’t work you should learn Python.
Web analysts frequently have responsibility for site tagging too. So it is often necessary to know javascript too.
The following table has a rough and ready comparison of R, python and javascript for web analytical tasks:
Task | R | Python | Javascript |
---|---|---|---|
Site tagging | 0/5 | 0/5 | 5/5 |
Data analysis | 5/5 | 3.5/5 (and rapidly improving) | I used to say 1/5, but I haven’t kept up with developments here so no idea! |
Presentation and charting | 4/5 | 2/5 | 3/5 to 5/5 depending on if you can use d3.js or not |
Building website features | 2/5 | 5/5 | 5/5 |
Perhaps I will also have to do a guide on python!
Firstly install R from one of these pages
That is all you need to do but you will probably find things way easier if you also install a piece of software call RStudio. RStudio sits on top of R and presents a much less intimidating user interface. I highly recommend it for beginners.
Install RStudio from here. The rest of this tutorial will assume you are using RStudio.
Both R and RStudio are free software both in terms of “free as in free beer” and “free as in freedom”. There is much to be said for and against the use of free software in business but my view is that for things like programming tools free software completely dominates the non free alternatives.
On opening RStudio you should see a pane on the left called the console. You enter R commands in this pane with the cursor after the “>” character. Press enter to run the command.
If you get the answer 17 when you type “2+15” and then press enter then things have installed correctly. You have got over the first hurdle. Well done.
You have just seen (and replicated for yourself) an example of adding two numbers together. The other foundational aritmetic functions work in similar ways. Try these examples out and test out your own.
The next set of examples is similar but slightly more complicated. I’ve included comments in the code to make it easier to understand. A comment is a line that starts with a “#” character. A comment line does nothing to change the behaviour of a program - it is only for a person reading it
Note that there is no output from the above code block. On with the examples
Not every arithmetical operation makes sense
In this error message the “binary operator” is division. It is a binary operation because it takes two numbers as input - the numerator and the denominator. The “non-numeric argument” is the letter ‘a’.
Ok, now you can throw away your pocket calculator or abacus and use R instead.
It is far more useful to do things to a whole series of numbers rather than one or two numbers at a time. R has excellent support for this.
There are two main different representations of an series of elements:
Unless specifically mentioned otherwise you will use vectors in this tutorial.
Create a vector with more than one element like this:
There are some functions that operate on a whole vector and return a single answer. I call these aggregate functions. There will be more about functions in general later.
Other functions operate on each element of the vector and return another vector as the result
In fact, when we were doing simple arithmetic before, we weren’t just adding numbers together; we were adding together vectors of numbers - but the vectors only had one element
This means we can do things like this:
Or this:
In the above example, the shorter vector is duplicated until it is the same length as the longer vector.
As always, there are some things we can try to do that just don’t make sense
As you can see, you get a warning that R isn’t quite sure if you want to be doing this or not.
You can assign a particular result to a variable to make it easier to refer to later
In R people generally use “<-” to assign to a variable, but you might also see code where people use “=”. There are some subtle differences between the two methods but this is not important right now. In this document we will use “<-”.
This code block assigns a vector to the variable “myvector”
You can see the value of a variable by entering it into the console
We can use all the normal vector functions on our variable as if we had typed the whole thing out each time
You can also overwrite a variable by reusing the name when you assign something else
It will help you out a lot if you use variable names that semantically link with what they reference. For example call a vector of daily sessions “sessions” rather than “vector”.
You have already seen some functions like “sum” and “length”.
To get more information on what a function does type a question mark followed by the name of the function into the console.
You don’t need to be able to understand all of a functions help file to use the function.
Some functions have optional arguments. Here is an example:
Summing over a vector that contains an NA returns NA as the result. This makes sense when you think that it is meaningless to add NA to a number.
Instead we use the optional argument “na.rm=TRUE” to tell the sum function to ignore the NA values.
Use the function help to see a list of function arguments and what they mean (e.g. “?sum”).
Creating your own functions is a great way to save time when doing repetitive tasks.
Functions are created using the function function (!!?!). Here is a very simple example function that takes no arguments and always returns the answer 42.
When copying this code into RStudio, don’t copy the “+” symbols that appear at the start of each line - they show that R is expecting more input before it starts computing. This means you don’t always have to fit all your commands on one line.
Here is an example that takes two arguments and adds them together:
The final example inserts an element at the start of a vector:
So far we have looked at vectors which contain one dimensional data. Far more common is to have a table of two dimensional data. The most common way of working with this kind of structure in R is called a data frame.
A data frame is a list of vectors. Recall that a list is a series of elements not necessarily of the same type and you see why a data frame has to be a list of vectors rather than a vector of vectors.
You can see that when creating your own data frame the columns are just vectors with a column name. Use the names function to see the column names for a data frame.
Use the column numbers and row numbers to select rows, columns and elements from a data frame.
An easier way to select a column is like this:
Using a “$” followed by the name of the column is much easier to understand than trying to remember which is the eighth column when you are reading old code.
You can also select a column like this:
Use this method rather than the “$” method when:
Once you have selected a column it behaves just like a vector:
You can easily add a new column to a data frame in a similar way to how you assign a variable.
Filter rows in a data frame like this:
Filters can be combined using “&” for AND and “|” for OR.
To see the number of rows in a data frame use the function “nrow” rather than length. Using length returns the length of the containing list rather than the length of one of the vectors that contains the data in a column.
There is also a function called “ncol” for the number of columns.
If you already have mad SQL skills (and non-digital analysts frequently do) then these methods of manipulating data frames can be a bit tedious.
Fortunately there is a library called “sqldf” that means you can use your knowledge of SQL when manipulating a data frame.
Now we can run SQL queries against our food data frame:
R comes with several open data sets built in. For these exercises we will use the “mtcars” data set which contains some statistics about certain makes/models of car. Here is how to load it up:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can find out more about the data by typing “?mtcars” into the console.
This section covers importing data into R though two methods:
The first is by far the easiest so we will start with that.
There is a simple function for importing from CSV files called “read.csv”. The CSV file is converted into a data frame when it is loaded into R.
## ONSCode CommissioningRegionCode CommissioningRegionName
## 1 E38000056 01C NHS Eastern Cheshire
## 2 E38000151 01R NHS South Cheshire
## 3 E38000189 02D NHS Vale Royal
## 4 E38000194 02E NHS Warrington
## 5 E38000196 02F NHS West Cheshire
## 6 E38000208 12F NHS Wirral
## TotalAdmissions MaleAdmissions FemaleAdmissions X AdmissionsPer100K
## 1 20 NA NA NA 10
## 2 43 14 29 NA 24
## 3 23 6 17 NA 23
## 4 16 NA NA NA 8
## 5 45 10 35 NA 20
## 6 23 NA NA NA 7
## MaleAdmissionsPer100K FemaleAdmissionsPer100K
## 1 NA NA
## 2 16 32
## 3 12 33
## 4 NA NA
## 5 9 30
## 6 NA NA
Use “?read.csv” to see more information about this function. There are many optional arguments that enable you to work with things like tab separated files or files without proper column headers.
You can write CSV files using the function “write.csv”. In the following example writes the mtcars data frame to a tab separated variable file.
You should change the file argument to save the data to a location of your choice.
This is where things get tricky and a lot of people start to get errors which they can’t find a way past. Take your time and proceed with close attention to detail.
There are four parts to getting this working:
You will need a Google Analytics account with some data to do all of this.
The library we will be using to access the API comes with a default token. But it is better to create your own because then you won’t use up the quota on the default token and you won’t have stuff break because someone else has used up all the quota.
Follow these instructions precisely.
Common mistakes to be avoided include the following:
Here are the instructions. The design of the Google developer console changes rapidly; hopefully this is still accurate.
Let me know any parts that are particularly unclear and I will illustrate with screenshots.
You should now see your client ID and client secret on the screen. If you also see things like “email address” and “javascript origins” then you have generated the wrong type of client ID and should start again from step seven.
Store the client id and client secret in some R variables.
An R library is a collection of functions written by someone else that they have made available for you to use. There is a centralised collection of vetted R libraries called CRAN. Libraries which are available on CRAN are very easy to install.
The function to install a CRAN library is install.packages. To start with, install the library “Rcpp” - not having an up to date version of this library causes some people problems later in the process and installing the latest version now does not harm and can prevent errors later on.
If you get errors about a lack of file system permissions doing this then you will have to ask your IT department to install libraries for you.
There are four main choices of libraries to work with Google Analytics in R:
All the libraries are similar in both how they operate and what you can do with them.
In this section you will authorise R to access Google Analytics data and create a token file which saves the details. This means you will not have to authorise every time and it enables you to automate things to run on a server; just make sure the token file is on the server.
First load the library into R using the library function. Sometimes you will see people use the function require instead. The difference between them is beyond the scope of this tutorial.
First we will use the default token just to check that everything is working.
Allow the library access to Google Analytics which creates a file called .httr-oauth in your working directory which contains your credentials. You don’t need to do anything with this file; just be aware that it is there.
Now list your accounts to check everything is working
You should see a list of accounts that you have access to. This is how you find the view id (important later).
Now how do we do this with our own token?
You should be able to get the list of accounts again:
The first thing to do is to figure out the view ID of the Google Analytics view you want.
We are almost ready to grab some data. But first we will install another library that makes it easier to work with dates.
Here we go!
You will now have a data frame called “sessions” with the daily sessions total for the last two years.
If this is working for you then you have got the R part working.
A big motivation for many people using the API is to avoid sampled data. Sampling occurs when a combination of the number of hits and the query dimensions exceeds a threshold. The API can sometimes be used to reduce the amount of or avoid sampling by making a series of requests and summing the results.
For example, if you are getting sampled data for last month’s report it might be possible to avoid the sampling by requesting data for one day at a time.
The library has a cool “anti_sample” feature that tries to figure out the frequency with which to request data to avoid sampling. Sometimes it will download daily data, sometimes less frequently than that
The argument for this is “anti_sample”
Filters are a bit more complicated in the latest version of the API. But this extra complexity makes it easier to combine predefined filters.
> twitter <- dim_filter(dimension="source",operator="EXACT",expressions="twitter")
> facebook <- dim_filter(dimension="source",operator="EXACT",expressions="facebook")
> events <- met_filter("totalEvents", "GREATER_THAN", 2)
>
> unsampled <- google_analytics_4(viewid,
+ date_range=c(twoyearsago,yesterday),
+ metrics="sessions",
+ dimensions="date",
+ anti_sample=TRUE,
+ dim_filters = filter_clause_ga4(list(twitter, facebook), operator = "OR"),
+ met_filters = filter_clause_ga4(list(events))
+ )
There is just a short set of exercises here which are mainly based around knowing the Google Analytics reporting API. The best resource for getting to grips with this is the Query Explorer.
ggplot is the best platform for making non-interactive visualisations/charts in the world.
First, as you might expect, install the library.
First you will see a series of example charts using the sessions data frame made earlier with data pulled from Google Analytics.
Plot the data as translucent points and a smoothed line
Plot a histogram of the number of daily sessions
Change the theme to something more minimal
Add a horizontal red line to see which days are better and worse than the mean
Change the axis labels, make them bigger and add a chart title
> ggplot(sessions,aes(x=date,y=sessions)) +
+ geom_line() +
+ theme_minimal() +
+ xlab("Date") +
+ ylab("Number of sessions") +
+ ggtitle("Sessions per day over two years") +
+ theme(axis.title.x = element_text(size = 20),
+ axis.title.y = element_text(size = 20),
+ plot.title = element_text(size = 25))
ggplot implements an idea called “the grammer of graphics” which is a way of thinking about and describing charts.
Essentially there are three elements:
For an example, consider this bar chart:
There are three aesthetics here:
I feel I am not explaining this very well - this is at least partly because it is a difficult and unusual concept. The ggplot book explains this in more detail and has tonnes of nice examples [the affiliate link is not mine - it is the author’s own]. For something that focusses more on the grammer of graphics concept and less on how ggplot implements it and which is free check out the Layered Grammer of Graphics paper in PDF.
Anyway, the following image will hopefully make it a little bit clearer how all this works in ggplot.
The following example (using the mtcars data) shows a few more asthetics (colour and size).
> ggplot(mtcars, aes(x=mpg,y=qsec, size=wt*1000, color=as.factor(cyl))) +
+ geom_point(alpha=0.8) +
+ xlab("Miles per gallon") +
+ ylab("Quarter mile time (s)") +
+ scale_color_discrete(guide = guide_legend(title = "Cylinders")) +
+ scale_size_continuous(guide = guide_legend(title = "Weight (lbs)")) +
+ theme_minimal()
ggplot is a big library with a very broad scope so it is often difficult to know what is and isn’t possible as well as what functions are provided for you. Thankfully, because of ggplot’s fairly unique name Googling for help works well. This is not true for R in general.
The official documentation is also excellent.
These exercises use builtin data set (like mtcars) because then everyone is starting from the same place so it is possible for me to provide solutions. I encourage you to try similar plots on data pulled in from Google Analytics.
My guess is that you will find these exercises harder than the others because you have not been introduced to all the functions you will need to complete them.
R has good library support for many methods of forecasting. I suspect that it has the best and easiest support out of all the languages you might use for this.
The forecast package is the one we will use in this tutorial.
The examples in this section use the “sessions” data frame containing daily Google Analytics sessions generated earlier.
To start to use the forecast package we must convert this data frame into a timeseries. For the purpose of this example we are interested in forecasts that take into account weekly seasonality (by which I mean the common phenomenom that some days of the week are normally better than others).
Once you have created a timeseries you can do interesting things very easily.
Above you see a collection of four charts.
This analysis makes the assumption that the results you see are the sum of the underlying trend, the weekly seasonal factors and some randomness. These assumptions fit a lot of metrics you see in web analytics.
A nice and simple forecasting model is Holt Winters [by simple, I mean simple to use]. I don’t want to get too much into the maths here, but if forecasting is very important for you then you should have a look for yourself.
Building the forecast model is very easy: