R for web analysts

An introduction and guide to the R programming language for web analysts

If you are new to programming then take your time reading. There are lots of new concepts - don’t expect to be able to scan through and still get good value for time

About this document

Last updated on Sunday 09 February 2020

This document is an extension and improvement of the course materials I originally prepared for the MeasureCamp training session on using R.

The goal of the document (and the training session) is to help web analysts get to grips with coding using R and enable them to solve the kind of problems that web analysts try to solve.

Throughout the document, code samples are presented in the following format:

>     2+2
## [1] 4

First the R code which can be copied or typed into the console (if you don’t know what the console is - this will be explained later) and then the output you can expect to see when it is run.

There will be some code samples where no output is presented. This is either because when the code is run there is no output to the console or because I have chosen to not make the results of evaluating that code public.

I highly recommend running all the sample code you come accross

This document is available under a CC Attribution License.

About me

A little bit of context about me might help you decide whether or not my advice is worth listening to. If your background, experiences and desired outcomes are completely different from mine then take this document with the pinch of salt it deserves.

Web analyst background

I started my career working in PPC and began to make an effort to learn more about web analytics after gaps and errors in client implementations lead me to take actions that appeared awesome but which were actually harmful.

My favourite example of this is when I sent tonnes of traffic from Nigeria at a lead generation form - the form got filled in a lot (which is what was being tracked) but the client did not make very much money from this.

My experience ranges across a wide range of sectors but it is very narrow when it comes to tools. Almost all (95%+) my work is done using Google Analytics.

Programming background

I studied Maths at university which, with the courses I chose, involved about 8-12 hours of programming tuition using the languages Maple and Matlab. At the end of university I could not use programming to produce useful business outcomes. However, I could produce a sweet animated gif showing how a fourier series converges.

It wasn’t long after entering the professional world that I began to think that learning to code might be useful. This was probably at the time when the “everyone must learn to code” movement was getting started.

I began learning a language called Haskell because I had studied a small amount of category theory at university which meant that I got some of the jokes about how difficult Haskell is to learn. Those of you who know me well will know that picking the difficult choice like this is fairly typical behaviour for me.

If you want to learn more about computing and maths and aren’t in a massive rush to get something useful done then I really recommend you have a good look at Haskell. I believe that one day Haskell, or a language inspired by Haskell, will take over the world but that day will not be soon and right now Haskell is shit for the kind of tasks web analysts typically do.

The first language is the hardest to learn so once I had a bit of basic coding knowledge I was able to start dabbling in Python and then later R. With the sort of tasks I do right now most of my coding is in R.

Why learn R?

I often get asked “should I learn R?” type questions. The answer is not an emphatic “YES” and the trade offs to consider are complicated.

Learning R for analysts who are first time coders

For an analyst who has no experience with programming the “should I learn R?” question is actually two questions:

  1. Should I learn to code?
  2. Should the first language I learn be R?

The answer to the first question is an emphatic yes because being able to program enables you to use more powerful tools than you would use otherwise.

There is a significant opportunity cost to learning to code, especially to start with, when doing simple things in code takes way longer than doing them in Excel. But the payoff is huge, especially when you consider that you have the whole of the rest of your career to recoup the investment.

I am going to reiterate that point because it is very important. To start with, everything will take longer and you won’t be able to do anything cool. The stuff you start off coding will be the type of thing that would take you less then a minute in Excel. It is worth it to spend the time to build this foundation. This is not specific to R - you will have this problem with any programming language.

R is good for doing analytical tasks so it is appropriate for analysts to learn as a first language. The code samples in this document will hopefully help speed you over the “I can do this 10x quicker in Excel” gap.

Learning R for analysts who can code in something else already

Once you ignoring the fact that learning new things generally makes you a better person the case for learning R over another language that you already know is much weaker. For me, the main strengths of R compared to other languages I have used are:

  1. The library and language support for the type of tasks that analysts do is excellent
  2. Libraries implementing cutting edge techniques straight out of academia are more likely to be available in R than elsewhere. This sounds really cool, but if you are the sort of person where this is actually a really useful feature then this guide is far too basic for you.
  3. The support for charting and visualisation is extremely good

In all of these areas, particularly 1 and 3, Python is rapidly catching up with R. So if the language you already know is Python then knowing R too will not increase the amount of cool stuff you can do by as much as if the language you already know is PHP.

Other points

Another thing to think about is the blurring of the boundary between analysis and the operations that implement the insight. It is increasingly common for a hybrid coder/analyst to define behaviours that identify a user segment and to code site functionality that does something special with this user segment. If this kind of work is your end goal and you only have time to learn one programming language then first you should try to remove your “one programming language” constraint and then if that doesn’t work you should learn Python.

Web analysts frequently have responsibility for site tagging too. So it is often necessary to know javascript too.

The following table has a rough and ready comparison of R, python and javascript for web analytical tasks:

Task R Python Javascript
Site tagging 0/5 0/5 5/5
Data analysis 5/5 3.5/5 (and rapidly improving) I used to say 1/5, but I haven’t kept up with developments here so no idea!
Presentation and charting 4/5 2/5 3/5 to 5/5 depending on if you can use d3.js or not
Building website features 2/5 5/5 5/5

Perhaps I will also have to do a guide on python!

Getting Started

Installing the necessary programs

Firstly install R from one of these pages

That is all you need to do but you will probably find things way easier if you also install a piece of software call RStudio. RStudio sits on top of R and presents a much less intimidating user interface. I highly recommend it for beginners.

Install RStudio from here. The rest of this tutorial will assume you are using RStudio.

Both R and RStudio are free software both in terms of “free as in free beer” and “free as in freedom”. There is much to be said for and against the use of free software in business but my view is that for things like programming tools free software completely dominates the non free alternatives.

On opening RStudio you should see a pane on the left called the console. You enter R commands in this pane with the cursor after the “>” character. Press enter to run the command.

>     2+15
## [1] 17

If you get the answer 17 when you type “2+15” and then press enter then things have installed correctly. You have got over the first hurdle. Well done.

The very basics

You have just seen (and replicated for yourself) an example of adding two numbers together. The other foundational aritmetic functions work in similar ways. Try these examples out and test out your own.

>     3-21.5
## [1] -18.5
>     2*5
## [1] 10
>     9/5
## [1] 1.8

The next set of examples is similar but slightly more complicated. I’ve included comments in the code to make it easier to understand. A comment is a line that starts with a “#” character. A comment line does nothing to change the behaviour of a program - it is only for a person reading it

>     # This line is a comment
>     # Comments don't return any output

Note that there is no output from the above code block. On with the examples

>     2 + (5*5)
## [1] 27
>     2*pi # pi is the number pi (3.14159...)
## [1] 6.283185

Not every arithmetical operation makes sense

>     1/0
## [1] Inf
>     # Dividing the number 5 by the letter 'a'
>     # This makes no sense!
>     5/'a'
## Error in 5/"a": non-numeric argument to binary operator

In this error message the “binary operator” is division. It is a binary operation because it takes two numbers as input - the numerator and the denominator. The “non-numeric argument” is the letter ‘a’.

Ok, now you can throw away your pocket calculator or abacus and use R instead.

It is far more useful to do things to a whole series of numbers rather than one or two numbers at a time. R has excellent support for this.

There are two main different representations of an series of elements:

  1. A vector - all the elements must have the same type e.g. all numbers or all text
  2. A list - the elements can have different types

Unless specifically mentioned otherwise you will use vectors in this tutorial.

Create a vector with more than one element like this:

>     c("This","is","a","vector","of","words")
## [1] "This"   "is"     "a"      "vector" "of"     "words"

There are some functions that operate on a whole vector and return a single answer. I call these aggregate functions. There will be more about functions in general later.

>     # get the length of a vector (the number of elements)
>     length(c("This","is","a","vector","of","words"))
## [1] 6
>     # add up all the elements in a vector
>     sum(c(1,2,3,4,5))
## [1] 15
>     sum(c("This","is","a","vector","of","words"))
## Error in sum(c("This", "is", "a", "vector", "of", "words")): invalid 'type' (character) of argument

Other functions operate on each element of the vector and return another vector as the result

>     # make everything uppercase
>     toupper(c("This","is","a","vector","of","words"))
## [1] "THIS"   "IS"     "A"      "VECTOR" "OF"     "WORDS"
>     # adds 2 to each element of the vector
>     c(1,2,3,4,5) + 2
## [1] 3 4 5 6 7

In fact, when we were doing simple arithmetic before, we weren’t just adding numbers together; we were adding together vectors of numbers - but the vectors only had one element

>     length(5)
## [1] 1
>     # use '==' to check if two things are equal
>     "word" == c("word")
## [1] TRUE

This means we can do things like this:

>     c(1,2,3) + c(pi,15,1/0)
## [1]  4.141593 17.000000       Inf

Or this:

    c(1,2,3,4) * c(1,2)
## [1] 1 4 3 8

In the above example, the shorter vector is duplicated until it is the same length as the longer vector.

As always, there are some things we can try to do that just don’t make sense

>     c(1,2,3,4) + c(1,2,3)
## Warning in c(1, 2, 3, 4) + c(1, 2, 3): longer object length is not a
## multiple of shorter object length
## [1] 2 4 6 5

As you can see, you get a warning that R isn’t quite sure if you want to be doing this or not.

Variables

You can assign a particular result to a variable to make it easier to refer to later

In R people generally use “<-” to assign to a variable, but you might also see code where people use “=”. There are some subtle differences between the two methods but this is not important right now. In this document we will use “<-”.

This code block assigns a vector to the variable “myvector”

>     myvector <- c("This","is","a","vector","of","words")

You can see the value of a variable by entering it into the console

>     myvector
## [1] "This"   "is"     "a"      "vector" "of"     "words"

We can use all the normal vector functions on our variable as if we had typed the whole thing out each time

>     length(myvector)
## [1] 6
>     toupper(myvector)
## [1] "THIS"   "IS"     "A"      "VECTOR" "OF"     "WORDS"

You can also overwrite a variable by reusing the name when you assign something else

>     myvector <- c(1,2,3,4)
>     myvector
## [1] 1 2 3 4
>     sum(myvector)
## [1] 10

It will help you out a lot if you use variable names that semantically link with what they reference. For example call a vector of daily sessions “sessions” rather than “vector”.

Functions

You have already seen some functions like “sum” and “length”.

To get more information on what a function does type a question mark followed by the name of the function into the console.

>     # help with the sum function
>     ?sum

You don’t need to be able to understand all of a functions help file to use the function.

Some functions have optional arguments. Here is an example:

>     # create a variable called "x"
>     # the "NA" is used for a blank or missing value
>     # it is quite common to have these
>     x <- c(1,2,3,4,NA)
> 
>     sum(x)
## [1] NA

Summing over a vector that contains an NA returns NA as the result. This makes sense when you think that it is meaningless to add NA to a number.

>     sum(x, na.rm=TRUE)
## [1] 10

Instead we use the optional argument “na.rm=TRUE” to tell the sum function to ignore the NA values.

Use the function help to see a list of function arguments and what they mean (e.g. “?sum”).

Making your own functions

Creating your own functions is a great way to save time when doing repetitive tasks.

Functions are created using the function function (!!?!). Here is a very simple example function that takes no arguments and always returns the answer 42.

>     theAnswer <- function() {
+     return(42)
+     }
> 
>     theAnswer()
## [1] 42

When copying this code into RStudio, don’t copy the “+” symbols that appear at the start of each line - they show that R is expecting more input before it starts computing. This means you don’t always have to fit all your commands on one line.

Here is an example that takes two arguments and adds them together:

>     adder <- function(x,y) {
+     return(x+y)
+     }
> 
>     adder(32,2454)
## [1] 2486

The final example inserts an element at the start of a vector:

>     addToStartOfVector <- function(element,oldvector) {
+     newvector <- c(element,oldvector)
+     return(newvector)
+     }
> 
>     addToStartOfVector(5,c(1,2,3,4))
## [1] 5 1 2 3 4
Data frames

So far we have looked at vectors which contain one dimensional data. Far more common is to have a table of two dimensional data. The most common way of working with this kind of structure in R is called a data frame.

>     foods <- data.frame(meal=c("Breakfast","Breakfast","Breakfast","Lunch","Dinner"),
+     food=c("Bacon","Sausage","Beans","Pork Pie","Raveoli"),
+     amount=c(2,2,387,1,35)
+     )
>     foods
##        meal     food amount
## 1 Breakfast    Bacon      2
## 2 Breakfast  Sausage      2
## 3 Breakfast    Beans    387
## 4     Lunch Pork Pie      1
## 5    Dinner  Raveoli     35

A data frame is a list of vectors. Recall that a list is a series of elements not necessarily of the same type and you see why a data frame has to be a list of vectors rather than a vector of vectors.

You can see that when creating your own data frame the columns are just vectors with a column name. Use the names function to see the column names for a data frame.

>     names(foods)
## [1] "meal"   "food"   "amount"

Use the column numbers and row numbers to select rows, columns and elements from a data frame.

>     # select the third row
>     foods[3,]
##        meal  food amount
## 3 Breakfast Beans    387
>     # select the first column
>     foods[,1]
## [1] Breakfast Breakfast Breakfast Lunch     Dinner   
## Levels: Breakfast Dinner Lunch
>     # see the number of beans consumed
>     foods[3,3]
## [1] 387

An easier way to select a column is like this:

>     foods$meal
## [1] Breakfast Breakfast Breakfast Lunch     Dinner   
## Levels: Breakfast Dinner Lunch

Using a “$” followed by the name of the column is much easier to understand than trying to remember which is the eighth column when you are reading old code.

You can also select a column like this:

>     foods[['meal']]
## [1] Breakfast Breakfast Breakfast Lunch     Dinner   
## Levels: Breakfast Dinner Lunch

Use this method rather than the “$” method when:

  1. Your column names are reserved words in R. For example if you had a column called “NA”.
  2. You are constructing the column name programatically. For example if the column name is one of the inputs to a function.

>     getColumn <- function(data, columnName) {
+     ## data$columnName won't work
+     return(data[[columnName]])
+     }
>     getColumn(foods,"meal")
## [1] Breakfast Breakfast Breakfast Lunch     Dinner   
## Levels: Breakfast Dinner Lunch

Once you have selected a column it behaves just like a vector:

>     # use "?mean" if you don't know what this function does
>     mean(foods$amount)
## [1] 85.4

You can easily add a new column to a data frame in a similar way to how you assign a variable.

>     foods$cumulativeAmount <- cumsum(foods$amount)
>     # what does cumsum do?
>     # how do you find this information out?
>     foods
##        meal     food amount cumulativeAmount
## 1 Breakfast    Bacon      2                2
## 2 Breakfast  Sausage      2                4
## 3 Breakfast    Beans    387              391
## 4     Lunch Pork Pie      1              392
## 5    Dinner  Raveoli     35              427

Filter rows in a data frame like this:

>     # note the position of the comma - easy to miss!
>     foods[foods$amount > 5,]
##        meal    food amount cumulativeAmount
## 3 Breakfast   Beans    387              391
## 5    Dinner Raveoli     35              427

Filters can be combined using “&” for AND and “|” for OR.

>     foods[foods$amount > 5 & foods$meal == "Breakfast",]
##        meal  food amount cumulativeAmount
## 3 Breakfast Beans    387              391
>     foods[foods$amount > 5 | foods$meal == "Breakfast",]
##        meal    food amount cumulativeAmount
## 1 Breakfast   Bacon      2                2
## 2 Breakfast Sausage      2                4
## 3 Breakfast   Beans    387              391
## 5    Dinner Raveoli     35              427

To see the number of rows in a data frame use the function “nrow” rather than length. Using length returns the length of the containing list rather than the length of one of the vectors that contains the data in a column.

>     nrow(foods)
## [1] 5

There is also a function called “ncol” for the number of columns.

SQLDF

If you already have mad SQL skills (and non-digital analysts frequently do) then these methods of manipulating data frames can be a bit tedious.

Fortunately there is a library called “sqldf” that means you can use your knowledge of SQL when manipulating a data frame.

>     # first install the package
>     install.packages("sqldf")
>     # then load the package
>     library(sqldf)

Now we can run SQL queries against our food data frame:

>     sqldf("SELECT * FROM foods WHERE amount > 5 OR meal = 'Breakfast'")
##        meal    food amount cumulativeAmount
## 1 Breakfast   Bacon      2                2
## 2 Breakfast Sausage      2                4
## 3 Breakfast   Beans    387              391
## 4    Dinner Raveoli     35              427
Exercises

R comes with several open data sets built in. For these exercises we will use the “mtcars” data set which contains some statistics about certain makes/models of car. Here is how to load it up:

>     data(mtcars)
> 
>     # now you have a data frame called "mtcars" with the data
>     head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can find out more about the data by typing “?mtcars” into the console.

  1. What is the mean mpg across all the cars?
    >     mean(mtcars$mpg)
    ## [1] 20.09062
  2. What is the maximum number of cylinders seen in the data?
    >     max(mtcars$cyl)
    ## [1] 8
  3. What is the difference in the mean mpg between cars with fewer than 5 cylinders and those with 5 or more cylinders?
    >     fewcylinders <- mtcars[mtcars$cyl < 5,]
    >     manycylinders <- mtcars[mtcars$cyl >= 5,]
    >     difference <- mean(manycylinders$mpg) - mean(fewcylinders$mpg)
    > 
    >     abs(difference)
    ## [1] 10.01602
  4. What proportion of the tested cars have an automatic transmission?
    >     automatic <- mtcars[mtcars$am == 0,]
    > 
    >     nrow(automatic) / nrow(mtcars)
    ## [1] 0.59375
  5. Which car has the fastest quarter mile time?
    >     # order the data
    >     fastest <- mtcars[order(mtcars$qsec),]
    > 
    >     # get the car name from the first row
    >     row.names(fastest)[1]
    ## [1] "Ford Pantera L"
  6. Write a function which takes an mpg value as input and returns a vector of all cars that have a better mpg than the input value.
    >     betterMpg <- function(targetmpg) {
    +     filteredcars <- mtcars[mtcars$mpg > targetmpg,]
    +     return( row.names(filteredcars) )
    +     }
    > 
    >     betterMpg(30)
    ## [1] "Fiat 128"       "Honda Civic"    "Toyota Corolla" "Lotus Europa"

Importing data

This section covers importing data into R though two methods:

  1. CSV files
  2. The Google Analtics API

The first is by far the easiest so we will start with that.

Importing from CSV files

There is a simple function for importing from CSV files called “read.csv”. The CSV file is converted into a data frame when it is loaded into R.

>     # yes - it works with urls and local files
>     hospitaladmissions <- read.csv("http://www.eanalytica.com/files/UK-Admissions.csv")
> 
>     head(hospitaladmissions)
##     ONSCode CommissioningRegionCode CommissioningRegionName
## 1 E38000056                     01C   NHS Eastern Cheshire 
## 2 E38000151                     01R     NHS South Cheshire 
## 3 E38000189                     02D         NHS Vale Royal 
## 4 E38000194                     02E         NHS Warrington 
## 5 E38000196                     02F      NHS West Cheshire 
## 6 E38000208                     12F             NHS Wirral 
##   TotalAdmissions MaleAdmissions FemaleAdmissions  X AdmissionsPer100K
## 1              20             NA               NA NA                10
## 2              43             14               29 NA                24
## 3              23              6               17 NA                23
## 4              16             NA               NA NA                 8
## 5              45             10               35 NA                20
## 6              23             NA               NA NA                 7
##   MaleAdmissionsPer100K FemaleAdmissionsPer100K
## 1                    NA                      NA
## 2                    16                      32
## 3                    12                      33
## 4                    NA                      NA
## 5                     9                      30
## 6                    NA                      NA

Use “?read.csv” to see more information about this function. There are many optional arguments that enable you to work with things like tab separated files or files without proper column headers.

You can write CSV files using the function “write.csv”. In the following example writes the mtcars data frame to a tab separated variable file.

>     # \t means tab
>     write.csv(mtcars, file="/tmp/mtcars.tsv", sep="\t")

You should change the file argument to save the data to a location of your choice.

Google Analytics

This is where things get tricky and a lot of people start to get errors which they can’t find a way past. Take your time and proceed with close attention to detail.

There are four parts to getting this working:

  1. Generating OAuth client ID and client secret to enable you to access the API
  2. Installing a relevant R library
  3. Generating an authorisation token using the library
  4. Finally, getting some data out of Google Analytics

You will need a Google Analytics account with some data to do all of this.

Generating the client ID and client secret

The library we will be using to access the API comes with a default token. But it is better to create your own because then you won’t use up the quota on the default token and you won’t have stuff break because someone else has used up all the quota.

Follow these instructions precisely.

Common mistakes to be avoided include the following:

  1. Reusing the client id and client secret from an old project of the wrong type. It is safest to generate a fresh project for use with R.
  2. Generating the wrong type of credentials.
  3. Forgetting to enable the Analytics API.

Here are the instructions. The design of the Google developer console changes rapidly; hopefully this is still accurate.

Let me know any parts that are particularly unclear and I will illustrate with screenshots.

  1. Go to https://console.developers.google.com/project
  2. Click “create project”
  3. The project name and the project id are not very important, but you should name it something meaningful so you don’t accidentally delete it at a later date. Ignore the advanced options and click “create”.
  4. Wait for the project to be created. When it is created you are automatically redirected to the project page
  5. Search for “analytics” and enable the “Analytics API” by selecting it from the list and clicking “Enable”.
  6. Click “Go to credentials”
  7. Select “Other UI (e.g. Windows, CLI tool)” in the “Where will you be calling the API from?” dropdown
  8. Check “User data” for the “What data will you be accessing” field. And click “What credentials do I need?” to proceed.
  9. Create a client id - the default of “Other client 1” is fine.
  10. You will also have to fill in some details for the OAuth consent screen. This is shown to people when they authorise your app. Only the project name and email address fields are compulsory.
  11. Eventually you will be shown your Client ID. You also need the Client secret as well. If you don’t see the secret, click “Done” and navigate to "Credentials.
  12. Click your Client ID in the list and you should see the Client ID and Client Secret.

You should now see your client ID and client secret on the screen. If you also see things like “email address” and “javascript origins” then you have generated the wrong type of client ID and should start again from step seven.

Store the client id and client secret in some R variables.

>     clientid <- "YOUR CLIENT ID"
>     clientsecret <- "YOUR CLIENT SECRET"
Installing R libraries

An R library is a collection of functions written by someone else that they have made available for you to use. There is a centralised collection of vetted R libraries called CRAN. Libraries which are available on CRAN are very easy to install.

The function to install a CRAN library is install.packages. To start with, install the library “Rcpp” - not having an up to date version of this library causes some people problems later in the process and installing the latest version now does not harm and can prevent errors later on.

>     install.packages("Rcpp")

If you get errors about a lack of file system permissions doing this then you will have to ask your IT department to install libraries for you.

There are four main choices of libraries to work with Google Analytics in R:

  1. RGA
  2. RGoogleAnalytics - this is supported by Tatvic
  3. rga - not available on CRAN so slightly harder to install. There are some API techniques to reduce how often sampled data is returned; this library makes these easier to use than the others.
  4. googleAnalyticsR - this is the option we will use here

All the libraries are similar in both how they operate and what you can do with them.

>     install.packages("googleAnalyticsR")
Generating a token

In this section you will authorise R to access Google Analytics data and create a token file which saves the details. This means you will not have to authorise every time and it enables you to automate things to run on a server; just make sure the token file is on the server.

First load the library into R using the library function. Sometimes you will see people use the function require instead. The difference between them is beyond the scope of this tutorial.

>     library(googleAnalyticsR)

First we will use the default token just to check that everything is working.

>     ga_auth()

Allow the library access to Google Analytics which creates a file called .httr-oauth in your working directory which contains your credentials. You don’t need to do anything with this file; just be aware that it is there.

Now list your accounts to check everything is working

>     ga_account_list()

You should see a list of accounts that you have access to. This is how you find the view id (important later).

Now how do we do this with our own token?

>     options(googleAuthR.webapp.client_id = clientid)
>     options(googleAuthR.webapp.client_secret = clientsecret)
>     ga_auth()

You should be able to get the list of accounts again:

>     ga_account_list()
Get some data

The first thing to do is to figure out the view ID of the Google Analytics view you want.

>     viewid <- "ID NUMBER"

We are almost ready to grab some data. But first we will install another library that makes it easier to work with dates.

>     install.packages("lubridate")

Here we go!

>     library(lubridate)
> 
>     yesterday <- today() - days(1)
>     twoyearsago <- today() - days(365*2)
> 
>     sessions <- google_analytics_4(viewid,
+     date_range=c(twoyearsago,yesterday),
+     metrics="sessions",
+     dimensions="date")

You will now have a data frame called “sessions” with the daily sessions total for the last two years.

>     head(sessions)
##         date sessions
## 1 2015-09-22      169
## 2 2015-09-23      172
## 3 2015-09-24      162
## 4 2015-09-25      155
## 5 2015-09-26      103
## 6 2015-09-27      122

If this is working for you then you have got the R part working.

Avoiding sampling

A big motivation for many people using the API is to avoid sampled data. Sampling occurs when a combination of the number of hits and the query dimensions exceeds a threshold. The API can sometimes be used to reduce the amount of or avoid sampling by making a series of requests and summing the results.

For example, if you are getting sampled data for last month’s report it might be possible to avoid the sampling by requesting data for one day at a time.

The library has a cool “anti_sample” feature that tries to figure out the frequency with which to request data to avoid sampling. Sometimes it will download daily data, sometimes less frequently than that

The argument for this is “anti_sample”

>     unsampled <- google_analytics_4(viewid,
+     date_range=c(twoyearsago,yesterday),
+     metrics="sessions",
+     dimensions="date",
+     anti_sample=TRUE
+     )
Filters

Filters are a bit more complicated in the latest version of the API. But this extra complexity makes it easier to combine predefined filters.

>     twitter <- dim_filter(dimension="source",operator="EXACT",expressions="twitter")
>     facebook <- dim_filter(dimension="source",operator="EXACT",expressions="facebook")
>     events <- met_filter("totalEvents", "GREATER_THAN", 2)
> 
>     unsampled <- google_analytics_4(viewid,
+     date_range=c(twoyearsago,yesterday),
+     metrics="sessions",
+     dimensions="date",
+     anti_sample=TRUE,
+     dim_filters = filter_clause_ga4(list(twitter, facebook), operator = "OR"),
+     met_filters = filter_clause_ga4(list(events))
+     )
Exercises

There is just a short set of exercises here which are mainly based around knowing the Google Analytics reporting API. The best resource for getting to grips with this is the Query Explorer.

  1. Query the API using more than one dimension (e.g. date and browser).
    >     # this is just an example.
    >     # there are many dimensions you could use
    >     browsersessions <- google_analytics_4(viewid,
    +     date_range=c(twoyearsago, yesterday),
    +     dimensions=c("date","browser"),
    +     metrics="sessions"
    +     )
  2. Query the API using more than one metric.
    >     sessionsevents <- google_analytics_4(viewid,
    +     date_range=c(twoyearsago, yesterday),
    +     dimensions=c("date"),
    +     metrics=c("sessions","totalEvents")
    +     )
  3. Query the API using a your own filter. The segment reference documentation might be useful here.
  4. Query the API pulling data from last month. The dates should be generated dynamically so that you can run the exact same code when you rerun the report next month. Be careful - what happens when you run your code in January?

ggplot

ggplot is the best platform for making non-interactive visualisations/charts in the world.

First, as you might expect, install the library.

>     install.packages("ggplot2")

First you will see a series of example charts using the sessions data frame made earlier with data pulled from Google Analytics.

>     # load the library
>     library(ggplot2)
> 
>     ggplot(sessions,aes(x=date,y=sessions)) + geom_line()
plot of chunk unnamed-chunk-78

Plot the data as translucent points and a smoothed line

>     ggplot(sessions,aes(x=date,y=sessions)) +
+     geom_point(alpha=0.2) +
+     geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
plot of chunk unnamed-chunk-79

Plot a histogram of the number of daily sessions

>     ggplot(sessions,aes(x=sessions)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot of chunk unnamed-chunk-80

Change the theme to something more minimal

>     ggplot(sessions,aes(x=date,y=sessions)) +
+     geom_line() +
+     theme_minimal()
plot of chunk unnamed-chunk-81

Add a horizontal red line to see which days are better and worse than the mean

>     avg <- mean(sessions$sessions)
>     ggplot(sessions,aes(x=date,y=sessions)) +
+     geom_line() +
+     geom_hline(yintercept=avg, size=5, color="red") +
+     theme_minimal()
plot of chunk unnamed-chunk-82

Change the axis labels, make them bigger and add a chart title

>     ggplot(sessions,aes(x=date,y=sessions)) +
+     geom_line() +
+     theme_minimal() +
+     xlab("Date") +
+     ylab("Number of sessions") +
+     ggtitle("Sessions per day over two years") +
+     theme(axis.title.x = element_text(size = 20),
+     axis.title.y = element_text(size = 20),
+     plot.title = element_text(size = 25))
plot of chunk unnamed-chunk-83
The grammer of graphics

ggplot implements an idea called “the grammer of graphics” which is a way of thinking about and describing charts.

Essentially there are three elements:

For an example, consider this bar chart:

plot of chunk unnamed-chunk-84

There are three aesthetics here:

  1. The count of things - expressed on the y axis
  2. The attribute - expressed by the color and position of the bars
  3. The person - expressed by the position of the bars

I feel I am not explaining this very well - this is at least partly because it is a difficult and unusual concept. The ggplot book explains this in more detail and has tonnes of nice examples [the affiliate link is not mine - it is the author’s own]. For something that focusses more on the grammer of graphics concept and less on how ggplot implements it and which is free check out the Layered Grammer of Graphics paper in PDF.

Anyway, the following image will hopefully make it a little bit clearer how all this works in ggplot.

The following example (using the mtcars data) shows a few more asthetics (colour and size).

>     ggplot(mtcars, aes(x=mpg,y=qsec, size=wt*1000, color=as.factor(cyl))) +
+     geom_point(alpha=0.8) +
+     xlab("Miles per gallon") +
+     ylab("Quarter mile time (s)") +
+     scale_color_discrete(guide = guide_legend(title = "Cylinders")) +
+     scale_size_continuous(guide = guide_legend(title = "Weight (lbs)")) +
+     theme_minimal()
plot of chunk unnamed-chunk-85
Getting more help with ggplot

ggplot is a big library with a very broad scope so it is often difficult to know what is and isn’t possible as well as what functions are provided for you. Thankfully, because of ggplot’s fairly unique name Googling for help works well. This is not true for R in general.

The official documentation is also excellent.

Exercises

These exercises use builtin data set (like mtcars) because then everyone is starting from the same place so it is possible for me to provide solutions. I encourage you to try similar plots on data pulled in from Google Analytics.

My guess is that you will find these exercises harder than the others because you have not been introduced to all the functions you will need to complete them.

  1. Using the mtcars data draw a boxplot showing the amount of horsepower for engines with differing numbers of cylinders.
    >     data(mtcars)
    > 
    >     ggplot(mtcars,aes(x=as.factor(cyl),y=hp)) + geom_boxplot()
    plot of chunk unnamed-chunk-86
  2. Add a point representing each car to the chart created in the previous task.
    >     box <- ggplot(mtcars,aes(x=as.factor(cyl),y=hp)) + geom_boxplot()
    > 
    >     # box + geom_point() will work, but geom_jitter is better
    >     box + geom_jitter(color="blue", alpha=0.4, size=5)
    plot of chunk unnamed-chunk-87
  3. Using the mtcars data set make a scatter plot (geom_point()) of the power to weight ratio against the quarter mile time.
    >     ggplot(mtcars,aes(x=hp/wt,y=qsec)) + geom_point()
    plot of chunk unnamed-chunk-88
  4. Change your plot from the last task into a faceted plot, faceted on the number of cylinders.
    >     ggplot(mtcars,aes(x=hp/wt,y=qsec, color=as.factor(cyl))) +
    +     geom_point(size=3, alpha=0.8) +
    +     facet_grid(. ~ cyl) +
    +     guides(color=FALSE)
    plot of chunk unnamed-chunk-89

Forecasting

R has good library support for many methods of forecasting. I suspect that it has the best and easiest support out of all the languages you might use for this.

The forecast package is the one we will use in this tutorial.

>     install.packages("forecast")

The examples in this section use the “sessions” data frame containing daily Google Analytics sessions generated earlier.

>     head(sessions)
##         date sessions
## 1 2015-09-22      169
## 2 2015-09-23      172
## 3 2015-09-24      162
## 4 2015-09-25      155
## 5 2015-09-26      103
## 6 2015-09-27      122

To start to use the forecast package we must convert this data frame into a timeseries. For the purpose of this example we are interested in forecasts that take into account weekly seasonality (by which I mean the common phenomenom that some days of the week are normally better than others).

>     library(forecast)
> 
>     # use 7 for the frequency because there are 7 observations
>     # per week.
>     sessionsts <- ts(sessions$sessions, frequency=7)

Once you have created a timeseries you can do interesting things very easily.

>     comp <- decompose(sessionsts)
> 
>     # not interested in fancy ggplot here
>     # just do something simple
>     # the meaning of the resulting chart is explained below
>     plot(comp)
plot of chunk unnamed-chunk-93

Above you see a collection of four charts.

  1. A plot of the raw data - nothing fancy here
  2. A plot of the underlying trend - this has the weekly seasonality and any outliers removed. It is way easier to see what is going on.
  3. The weekly seasonality
  4. The random bits of the data that don’t fit anywhere else

This analysis makes the assumption that the results you see are the sum of the underlying trend, the weekly seasonal factors and some randomness. These assumptions fit a lot of metrics you see in web analytics.

A nice and simple forecasting model is Holt Winters [by simple, I mean simple to use]. I don’t want to get too much into the maths here, but if forecasting is very important for you then you should have a look for yourself.

Building the forecast model is very easy:

>     forecastmodel <- HoltWinters(sessionsts)
>     plot(forecastmodel)