Twitter Network Analysis in R

Posted on February 6, 2013

One of the most popular posts on here is to do with analysing a social network and finding the most influential community members: The PageRank of #PPCChat participants The code for the project was written in Python, but I’ve recently written a new version using R. R is a worse language in my opinion, but the number and variety of libraries is amazing so it is worth learning.

Here is the R source code:

Start by loading some libraries. Stringr for regex support and igraph to do the heavy lifting.

Also specify the location where the TSV file of tweet data is stored. #+BEGINSRC R library(stringr) library(igraph)

location = “/tmp/data/ppcchat.tsv” #+ENDSRC R

Load the TSV file. ‘stringsAsFactors=FALSE’ forces the tweets to be loaded as a string rather than as a factor. Factor is an R data type for discrete values.

Initialise an empty list of edges. #+BEGINSRC R raw <- read.csv(“ppcchat.tsv”, header=FALSE, sep=’\t’,stringsAsFactors=FALSE) edges <- c() #+ENDSRC R

Now iterate through the list of tweets and build a list of edge pairs. rawV3containsthetextofallthetweets.rawV10 contains the username of the person tweeting. #+BEGINSRC R for (i in 1:length(raw$V3)) { #Extract the usernames from the tweets mentions = unlist(strextractall(tolower(raw$V3[i]),"@[a-z0-9]{2,15}")) if (length(mentions)!=0) { for (j in 1:length(mentions)) { if(raw$V10[i]!="" && substring(mentions[j],2)!="") { #needed for when parser borks edges=c(edges,c(tolower(raw$V10[i]),substring(mentions[j],2))) } } } } #+ENDSRC R

Turn this into an adjacency matrix and create the graph #+BEGINSRC R edgematrix <- t(matrix(edges,nrow=2)) g <- graph.edgelist(edgematrix) #+ENDSRC R

I have found that you get far better results from this kind of thing when loops are removed from the graph. This means ignoring tweets where a person mentions themselves. #+BEGINSRC R for (i in 1:length(g[,1])){ g[i,i] = 0 } #+ENDSRC R

Finally, calculate PageRank and return the top ten #+BEGINSRC R pr<-page.rank(g,directed=TRUE) topten <- sort(pr$vector,decreasing=TRUE)[1:10] #+ENDSRC R