One of the most popular posts on here is to do with analysing a social network and finding the most influential community members: The PageRank of #PPCChat participants The code for the project was written in Python, but I’ve recently written a new version using R. R is a worse language in my opinion, but the number and variety of libraries is amazing so it is worth learning.
Here is the R source code:
Start by loading some libraries. Stringr for regex support and igraph to do the heavy lifting.
Also specify the location where the TSV file of tweet data is stored. #+BEGINSRC R library(stringr) library(igraph)
location = “/tmp/data/ppcchat.tsv” #+ENDSRC R
Load the TSV file. ‘stringsAsFactors=FALSE’ forces the tweets to be loaded as a string rather than as a factor. Factor is an R data type for discrete values.
Initialise an empty list of edges. #+BEGINSRC R raw <- read.csv(“ppcchat.tsv”, header=FALSE, sep=’\t’,stringsAsFactors=FALSE) edges <- c() #+ENDSRC R
Now iterate through the list of tweets and build a list of edge pairs. rawV3containsthetextofallthetweets.rawV10 contains the username of the person tweeting. #+BEGINSRC R for (i in 1:length(raw$V3)) { #Extract the usernames from the tweets mentions = unlist(strextractall(tolower(raw$V3[i]),"@[a-z0-9]{2,15}")) if (length(mentions)!=0) { for (j in 1:length(mentions)) { if(raw$V10[i]!="" && substring(mentions[j],2)!="") { #needed for when parser borks edges=c(edges,c(tolower(raw$V10[i]),substring(mentions[j],2))) } } } } #+ENDSRC R
Turn this into an adjacency matrix and create the graph #+BEGINSRC R edgematrix <- t(matrix(edges,nrow=2)) g <- graph.edgelist(edgematrix) #+ENDSRC R
I have found that you get far better results from this kind of thing when loops are removed from the graph. This means ignoring tweets where a person mentions themselves. #+BEGINSRC R for (i in 1:length(g[,1])){ g[i,i] = 0 } #+ENDSRC R
Finally, calculate PageRank and return the top ten #+BEGINSRC R pr<-page.rank(g,directed=TRUE) topten <- sort(pr$vector,decreasing=TRUE)[1:10] #+ENDSRC R