Miskatonic University Press

Seriation and the kayiwa-yo_bj vortex

code4lib r

After my last post about #c4l13 tweets I got a couple of tweets from Hadley Wickham pointing me at the R package seriation: it "will make it much easier to see clusters," he said. If Hadley Wickham recommended it I had to check it out.

I installed it in R:

> install.packages("seriation")

Then I read the documentation. Like a lot of R documentation for it looks rather forbidding and cryptic at the start (to me, at least), but all R documentation includes examples, and once you find the right thing to copy and paste and then fiddle with, you're on your way.

The seriate function is explained thusly: "Tries to find an linear order for objects using data in form of a dissimilarity matrix (two-way one mode data), a data matrix (two-way two-mode data) or a data array (k-way k-mode data)." Hadley mentioned clusters ... this says dissimilarity ... hmm.

I tried the example code, which uses the iris data set that's built into R.

> data("iris")
> x <- as.matrix(iris[-5])
> x <- x[sample(1:nrow(x)),]
> d <- dist(x)
> order <- seriate(d)
> order
object of class ‘ser_permutation’, ‘list’
contains permutation vectors for 1-mode data

  vector length seriation method
1           150             ARSA
> def.par <- par(no.readonly = TRUE)
> layout(cbind(1,2), respect = TRUE)
> pimage(d, main = "Random")

Unseriated iris data

> pimage(d, order, main = "Reordered")

Seriated iris data

Aha! Something interesting is going on there.

One of the other parts of the documentation is a Townships data set: "This data set was used to illustrate that the conciseness of presentation can be improved by seriating the rows and columns." Let's try the example code, just copying and pasting:

> data(Townships)
> pimage(Townships)

Unseriated townships data

Ho hum.

> order <- seriate(Townships, method = "BEA", control = list(rep = 5))
> pimage(Townships, order)

Seriated townships data

Hot diggity! If I could use this package to improve the stuff I did last time, fantastic.

I'm going to skip over two or three hours of fiddling with things and not understanding what was going on. Crucial to all of what comes next, especially in getting it to work with the ggplot2 package, is this gist of an example of seriation, which I found in a Google search. As often happens my first attempt to get things working failed, so I created a very minimal example and went through the steps many times until it worked. I'm not sure what the problem was any more---probably something to do with not referring to a column of data properly---but once it worked, it was easy to apply the steps to the full data set.

That full data set is online, so you can copy and paste what follows and it should work. If you haven't installed the ggplot2, plyr and reshape2 packages (all created by Hadley Wickham, apparently an inexhaustible superhuman repository of Rness), you'll need to. Skip the first three lines if you have, but either way you'll need to load them in.

> install.packages("ggplot2")
> install.packages("plyr")
> install.packages("reshape2")
> library(ggplot2)
> library(plyr)
> library(reshape2)
> mentions.csv <- read.csv("http://www.miskatonic.org/files/20130223-tweets-mentioned.csv")
> head(mentions.csv)
        tweeter       mentioned
1   anarchivist mariatsciarrino
2   anarchivist        eosadler
3 tararobertson         ronallo
4     saverkamp        benwbrum
5     saverkamp mathieu_boteler
6     saverkamp          kayiwa
> mentions <- count(m, c("tweeter", "mentioned"))
> head(mentions)
         tweeter     mentioned freq
1     3windmills         yo_bj    1
2    aaroncollie        kayiwa    1
3 aaronisbrewing tararobertson    1
4     abedejesus tararobertson    1
5       abugseye  bretdavidson    1
6       abugseye     cazzerson    1
> nrow(mentions)
[1] 2201

This mentions data frame is what we'll be working with. It's 2,201 lines long and tells how many times anyone using the #c4l13 hashtag mentioned anyone else.

> ggplot(count(mentions, c("tweeter", "mentioned")), aes(x=tweeter, y=mentioned))
+ geom_tile(aes(fill=freq))
+ theme(axis.text = element_text(size=4), axis.text.x = element_text(angle=90))
+ xlab("Who mentioned someone") + ylab("Who was mentioned")
+ labs(title="People who mentioned other people (using the #c4l13 hastag)")

Who mentioned whom

All right, folks. Let's seriate.

> mentions.cast <- acast(mentions, tweeter ~ mentioned, fill = 0)
Using freq as value column: use value.var to override.
> mentions.seriation <- seriate(mentions.cast, method="BEA_TSP")
> mentions$tweeter <- factor(mentions$tweeter, levels = names(unlist(mentions.seriation[[1]][])))
> mentions$mentioned <- factor(mentions$mentioned, levels = names(unlist(mentions.seriation[[2]][])))
> ggplot(count(mentions, c("tweeter", "mentioned")), aes(x=tweeter, y=mentioned))
+ geom_tile(aes(fill=freq))
+ theme(axis.text = element_text(size=4), axis.text.x = element_text(angle=90))
+ xlab("Who mentioned someone") + ylab("Who was mentioned")
+ labs(title="People who mentioned other people (using the #c4l13 hastag)")

Who mentioned whom, seriated

WHOA!

The acast command turns the three-column mentions data frame into a 433x339 matrix, with tweeter names as rows and mentioned names as columns. The value of the matrix at each cell is how many times the mentioned person was mentioned by the tweeter. We know what 3windmills mentioned yo_bj once, so there's a 1 at that spot in the matrix. We need this matrix to feed into the seriate command, which creates a special object:

> mentions.seriation
object of class ‘ser_permutation’, ‘list’
contains permutation vectors for 2-mode data

  vector length seriation method
1           433          BEA_TSP
2           339          BEA_TSP

We can poke into it a bit by working through the commands used above:

> head(mentions.seriation[[1]][])
       jaleh_f  NowWithMoreMe  WNYLIBRARYGUY andrewinthelib         p3wp3w
           186            301            422             23            306
     Hypsibius
           175
> head(unlist(mentions.seriation[[1]][]))
       jaleh_f  NowWithMoreMe  WNYLIBRARYGUY andrewinthelib         p3wp3w
           186            301            422             23            306
     Hypsibius
           175
> head(names(unlist(mentions.seriation[[1]][])))
[1] "jaleh_f"        "NowWithMoreMe"  "WNYLIBRARYGUY"  "andrewinthelib"
[5] "p3wp3w"         "Hypsibius"

This is a list of names ordered in a way that seriate has determined will make an optimal ordering. We force our data frame to use this ordering, and then we get the nicer chart.

All right, that's pretty fine, but that's a lot of people and a lot of stuff going on. What if we dig into it a bit and simply things? What if we pick a user and analyze the nexus of tweeting action around them? Let's start with a boring example: me.

> wdenton.tweets <- subset(mentions, (tweeter == "wdenton" | mentioned == "wdenton"))
> wdenton.nexus <- subset(mentions, tweeter %in% unique(c(as.vector(wdenton.tweets$tweeter), as.vector(wdenton.tweets$mentioned))))
> wdenton.cast <- acast(wdenton.nexus, tweeter ~ mentioned, fill = 0)
Using freq as value column: use value.var to override.
> wdenton.seriation <- seriate(wdenton.cast, method="BEA_TSP")
> wdenton.nexus$tweeter <- factor(wdenton.nexus$tweeter, levels = names(unlist(wdenton.seriation[[1]][])))
> wdenton.nexus$mentioned <- factor(wdenton.nexus$mentioned, levels = names(unlist(wdenton.seriation[[2]][])))
> ggplot(wdenton.nexus, aes(x=tweeter, y=mentioned))
+ geom_tile(aes(fill=freq))
+ theme(axis.text.y = element_text(size=3), axis.text.x = element_text(size=10, angle=90))
+ xlab("Who did the mentioning") + ylab("Who was mentioned")
+ labs(title="The wdenton nexus of #c4l13 tweeting")

The wdenton #c4l13 tweet nexus

Blowing my mind!

What's going on here? First we pick out every instance where I either mention someone or someone mentions me.

> head(wdenton.tweets)
        tweeter mentioned freq
131         arg   wdenton    1
312   calvinmah   wdenton    1
812    ian_chan   wdenton    1
1099 lljohnston   wdenton    1
1466 nelltaylor   wdenton    1
1469   nihiliad   wdenton    1

Then we expanded that to a tweet nexus (this is like superduping) by saying: for every person that I mentioned or mentioned me, find everyone who mentioned them or they mentioned.

> head(wdenton.nexus)
      tweeter    mentioned freq
128       arg      sabarya    1
129       arg     save4use    1
130       arg       tmasao    1
131       arg      wdenton    1
310 calvinmah       kayiwa    2
311 calvinmah PuckNorris19    1

Then we just did the same as we'd done before to seriate it and make a chart.

There's something shared between this chart and the big one, and I call it the kayiwa-yo_bj vortex. yo_bj mentioned a lot of people (because Becky Yoose is a lively tweeter), and kayiwa was mentioned by a lot of people (because Francis Kayiwa was the chief conference organizer) and Becky mentioned Francis, so a common set of people emerges.

Let's see what happens if we look at yo_bj herself.

> yobj.tweets <- subset(mentions, (tweeter == "yo_bj" | mentioned == "yo_bj"))
> yobj.nexus <- subset(mentions, tweeter %in% unique(c(as.vector(yobj.tweets$tweeter), as.vector(yobj.tweets$mentioned))))
> yobj.cast <- acast(yobj.nexus, tweeter ~ mentioned, fill = 0)
Using freq as value column: use value.var to override.
> yobj.seriation <- seriate(yobj.cast, method="BEA_TSP")
> yobj.nexus$tweeter <- factor(yobj.nexus$tweeter, levels = names(unlist(yobj.seriation[[1]][])))
> yobj.nexus$mentioned <- factor(yobj.nexus$mentioned, levels = names(unlist(yobj.seriation[[2]][])))
> ggplot(yobj.nexus, aes(x=tweeter, y=mentioned))
+ geom_tile(aes(fill=freq))
+ theme(axis.text.y = element_text(size=3), axis.text.x = element_text(size=10, angle=90))
+ xlab("Who did the mentioning") + ylab("Who was mentioned") + labs(title="The yo_bj nexus of #c4l13 tweeting")

The yo_bj #c4l13 tweet nexus

Fewer people who did the mentioning, but a lot of people getting mentioned, and roughly the same kind of shape as we saw before.

I did this a few more times on different people, then I realized that I was just repeating myself: I was running the same commands over and over, just starting with a different username. When that happens, you need to automate things. So I made a function.

> chart.nexus <- function (username) {
  username <- tolower(username)
  tweets <- subset(mentions, (tweeter == username | mentioned == username))
  tweets.nexus <- subset(mentions, tweeter %in% unique(c(as.vector(tweets$tweeter), as.vector(tweets$mentioned))))
  tweets.cast <- acast(tweets.nexus, tweeter ~ mentioned, fill = 0)
  tweets.seriation <- seriate(tweets.cast, method="BEA_TSP")
  tweets.nexus$tweeter <- factor(tweets.nexus$tweeter, levels = names(unlist(tweets.seriation[[1]][])))
  tweets.nexus$mentioned <- factor(tweets.nexus$mentioned, levels = names(unlist(tweets.seriation[[2]][])))
  ggplot(tweets.nexus, aes(x=tweeter, y=mentioned)) + geom_tile(aes(fill=freq)) + theme(axis.text.y = element_text(size=3), axis.text.x = element_text(size=10, angle=90)) + xlab("Who did the mentioning") + ylab("Who was mentioned") + labs(title=paste("The", username, "nexus of #c4l13 tweeting"))
}
> chart.nexus("eosadler")

The eosadler #c4l13 tweet nexus

That's the tweet nexus around Bess Sadler, whose talk Creating a Commons was a highlight of conference. (More about that and a couple of other talks soon, I hope.) You can see the kayiwa-yo_bj vortex.

Aaron Swartz was missed by us all. You can see the vortex around him, too.

> chart.nexus("aaronsw")

The aaronsw #c4l13 tweet nexus

Ed Summers wasn't there, but he was watching the video stream and chatting in IRC and on Twitter:

> chart.nexus("edsu")

The edsu #c4l13 tweet nexus

yo_bj doesn't show there in the mentioners on the x-axis, which surprised me. The kayiwa part of the vortex is evident, though.

Brewster Kahle was mentioned a few times in discussions relating to the Open Library. No vortex here, but you can see edsu tweeting at him a few times.

> chart.nexus("brewster_kahle")

The brewster_kahle #c4l13 tweet nexus

Having done all this, two things strike me: first that I should try Shiny for this, and second that some sort of graph would probably be a better representation of the connections between people. Still, I learned a lot, the charts are cool, and I coined the ridiculous phrase "the kayiwa-yo_bj vortex."