E-Analytica

Posted on February 6, 2013

One of the most popular posts on here is to do with analysing a social network and finding the most influential community members: The PageRank of #PPCChat participants The code for the project was written in Python, but I’ve recently written a new version using R. R is a worse language in my opinion, but the number and variety of libraries is amazing so it is worth learning.

Here is the R source code:

Start by loading some libraries. Stringr for regex support and igraph to do the heavy lifting.

Also specify the location where the TSV file of tweet data is stored. #+BEGIN_{SRC} R library(stringr) library(igraph)

location = “/tmp/data/ppcchat.tsv” #+END_{SRC} R

Load the TSV file. ‘stringsAsFactors=FALSE’ forces the tweets to be loaded as a string rather than as a factor. Factor is an R data type for discrete values.

Initialise an empty list of edges. #+BEGIN_{SRC} R raw <- read.csv(“ppcchat.tsv”, header=FALSE, sep=’\t’,stringsAsFactors=FALSE) edges <- c() #+END_{SRC} R

Now iterate through the list of tweets and build a list of edge pairs. raw*V*3*c**o**n**t**a**i**n**s**t**h**e**t**e**x**t**o**f**a**l**l**t**h**e**t**w**e**e**t**s*.*r**a**w*V10 contains the username of the person tweeting. #+BEGIN_{SRC} R for (i in 1:length(raw$V3)) { #Extract the usernames from the tweets mentions = unlist(str_{extractall}(tolower(raw$V3[i]),"@[a-z0-9_{]}{2,15}")) if (length(mentions)!=0) { for (j in 1:length(mentions)) { if(raw$V10[i]!="" && substring(mentions[j],2)!="") { #needed for when parser borks edges=c(edges,c(tolower(raw$V10[i]),substring(mentions[j],2))) } } } } #+END_{SRC} R

Turn this into an adjacency matrix and create the graph #+BEGIN_{SRC} R edgematrix <- t(matrix(edges,nrow=2)) g <- graph.edgelist(edgematrix) #+END_{SRC} R

I have found that you get far better results from this kind of thing when loops are removed from the graph. This means ignoring tweets where a person mentions themselves. #+BEGIN_{SRC} R for (i in 1:length(g[,1])){ g[i,i] = 0 } #+END_{SRC} R

Finally, calculate PageRank and return the top ten #+BEGIN_{SRC} R pr<-page.rank(g,directed=TRUE) topten <- sort(pr$vector,decreasing=TRUE)[1:10] #+END_{SRC} R