Blog»2013-02-13-thedigitals-network-analysis-in-R

2013-02-13-thedigitals-network-analysis-in-R

--- title: #TheDigitals Network Analysis in R tags: twitter, R ---

My last post was about some simple techniques to analyse a twitter network in R. I've been doing some more work on this with tweets from #theDigitals hashtag. This hashtag is used by competitors in a competition be an assistant judge at an award show. You can see the current leaderboard here.

Two people Barry Adams and EdwinBongo have been removed from the leaderboard for having too many mentions from low value accounts. The way both accounts no longer feature at all makes me think that this is manual action rather than an automated filtering of what is/isn't important.

Let's see what we can discover:

library(stringr)
library(igraph)

location = "/tmp/data/thedigitals.tsv"

tweets <- read.csv(location, header=FALSE, sep='\t',stringsAsFactors=FALSE)

#key columns in the tweets table are:
# tweets$V3 - contains the tweet text
# tweets$V10 - contains the username of the tweeter

#edges is an accumulator for all the edges in the graph
edges = c()

for (i in 1:length(tweets$V3)) {
  #for each tweet get a list of accounts mentioned
  #just extract any string that matches a username regex
  mentions = unlist(str_extract_all(tolower(tweets$V3[i]),"@[a-z0-9_]{2,15}"))
  if (length(mentions)!=0) {
    for (j in 1:length(mentions)) {
      if(tweets$V10[i]!="" && substring(mentions[j],2)!="") { #needed for when parser borks
        #add the tweeter and the mentionee to the edge list
        edges=c(edges,c(tolower(tweets$V10[i]),substring(mentions[j],2)))
      }
    }
  }
}

#turn the edgelist into a matrix
edgematrix <- t(matrix(edges,nrow=2))
#turn the matrix into a graph
g <- graph.edgelist(edgematrix)

#remove self links
for (i in 1:length(g[,1])){
  g[i,i] = 0
}

#calculate the page rank
pr<-page.rank(g,directed=TRUE)

#to run the community detection we need a simple, undirected graph
g.simple = simplify(as.undirected(g))

#community detection
fg <- fastgreedy.community(g.simple)

#graph layout: can take a while
l <- layout.fruchterman.reingold(g.simple)

#remove labels
#without this all you see on the plot are labels
V(g.simple)$label <- NA

#plot the graph
plot(g.simple, vertex.color=fg$membership+1,vertex.frame.color=fg$membership+1,
     vertex.size=100*pr$vector, vertex.label.dist=0.1, layout=l)

This code is pretty much identical to the code in my last post except that it runs a community detection algorithm called "fastgreedy". The first answer to this stackoverflow question has a good explanation of how the algorithm works and what other options are available. I picked it because it is relatively quick and the results are good enough for what I am trying to show here.

When the graph is plotted, it looks something like this:

thedigitalsnetwork.png

I have manually labelled some of the nodes.

Some quick notes on the plot:

  1. The nodes are sized proportional to their pagerank
  2. The bot networks used by BarryAdams and EdwinBongo are easy to pick out because they form distinct communities
  3. But this isn't quite perfect; unless some of the bots have also been tweeting other competitors
  4. BarryAdams and EdwinBongo have a bit of overlap in their respective bot armies.

1 How can we come up with a better way of ranking this network?

Pagerank obviously doesn't work here. Closed off communities like that containing agencyquotes operate as linkwheels and a large number of spam bots inflate the pagerank of some accounts.

The main problem with using pagerank is that it gives the bots too much power. My first idea for fixing this was to only give a whitelist of accounts an initial pagerank value. Then the algorithm could be applied to see how this spreads through the network (this is basically how Google trustrank works). The problem with this approach is that is the whitelist is a source of bias - whoever is on the whitelist has a big advantage over everyone who isn't.

My next idea is to only allow nodes who have been mentioned by someone else to be part of the graph. In the data, spam bots don't mention each other so by filtering this way we remove their power whilst not raising the bar too high for legitimate users who want to win the competition; anyone who is going to win needs to get mentioned by someone else anyway.

V(g)$indegree <- degree(g, v=V(g), mode="in")
g.nobots <- delete.vertices(g,V(g)[ indegree == 0 ])
g.nobots$label <- NA
pr.nobots <- page.rank(g.nobots,directed=TRUE)
plot(g.nobots, layout=layout.fruchterman.reingold, vertex.size=100*pr.nobots$vector)

This give something like this:

thedigitalsnobots.png

Which I think is much more reasonable. And who are the top ten? Here goes (current leaderboarded ranking in parentheses):

  1. Econsultancy (not in top 100)
  2. Dan Barker (1)
  3. Debs Jowett (27)
  4. Gareth Westhead (not in top 100)
  5. Rene Power (52)
  6. Pete Handley (13)
  7. EdwinBongo (he does deserve to be here!)
  8. Graham Charlton (8)
  9. Craig Sullivan (21)
  10. Will Scott (2)
Authored by Richard Fergie