So we have a nice long file that records the details of 87,464 reference desk interactions since February 2011.
$ wc -l libstats.csv 87464 libstats.csv $ head -5 libstats.csv question.type,question.format,time.spent,library.name,location.name,initials,timestamp 4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,CC,02/01/2011 09:20:11 AM 4. Strategy-Based,In-person,10-20 minutes,Scott,Drop-in Desk,CC,02/01/2011 09:43:09 AM 4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,CC,02/01/2011 10:00:56 AM 3. Skill-Based: Non-Technical,Phone,5-10 minutes,Scott,Drop-in Desk,CC,02/01/2011 10:05:05 AM
Let’s look at it in R. First load in three libraries we’re going to need: lattice for graphics, Hadley Wickham’s plyr for data manipulation, and chron to help us with dates. Then load the CSV file into a data frame called
> library(lattice) > library(plyr) > library(chron) > libstats <- read.csv("libstats.csv")
Each line in the file represents a single reference desk interaction. I want to analyze them by the week, so I add a column that specifies which week the interaction happened. This seems to be a pretty ugly way of doing it, but it works. The
week column gets filled with YYYY-MM-DD dates that are the Mondays of the week in question.
> libstats$week <- as.Date(cut(as.Date(libstats$timestamp, format="%m/%d/%Y %r"), "week", start.on.monday=TRUE)) > head(libstats,4) question.type question.format time.spent library.name 1 4. Strategy-Based In-person 5-10 minutes Scott 2 4. Strategy-Based In-person 10-20 minutes Scott 3 4. Strategy-Based In-person 5-10 minutes Scott 4 3. Skill-Based: Non-Technical Phone 5-10 minutes Scott location.name initials timestamp week 1 Drop-in Desk CC 02/01/2011 09:20:11 AM 2011-01-31 2 Drop-in Desk CC 02/01/2011 09:43:09 AM 2011-01-31 3 Drop-in Desk CC 02/01/2011 10:00:56 AM 2011-01-31 4 Drop-in Desk CC 02/01/2011 10:05:05 AM 2011-01-31 > up.to.week <- tail(levels(as.factor(libstats$week)), 1)
up.to.week is the most recent week date, and I’ll use it for labelling charts.
levels tells you the elements in a list of factors. The names of the library branches are a great example: there are eight different values for
library.name through out 87,464 entries, one for each of our libraries plus one for an information desk that doesn’t do research help. (The Osgoode Hall Law School Library doesn’t record their reference statistics in this system so they’re not here.)
> branches <- levels(libstats$library.name) > branches  "ASC" "Bronfman" "Frost" "Maps" "Scott"  "Scott Information" "SMIL" "Steacie"
Let’s look at the statistics for the Bronfman library. Turns out there are 10,754 encounters recorded there.
> bronfman <- subset(libstats, library.name == "Bronfman") > nrow(bronfman)  10754
We have a problem to solve before we can make a chart. Each line in the
bronfman data frame records one desk enounter. We want to analyze things by the week. How do we aggregate a week’s worth of data into one number? We’ll use
ddply, whose help files defines what it does as: “For each subset of a data frame, apply function then combine results into a data frame.”
A short example will help explain it. Make a data frame about some coloured clothing.
ddply(tmp, .(colour), nrow) means “look at the data frame called
tmp, pick out the individual entries in the
colour column, and run the function
nrow on each element to find out how many of them there are.” Using
nrow here is a nice way of counting up how many of something there are, but if you were doing real statistics you might use
mean or some other function.
> tmp <- data.frame(colour = c("red", "red", "green", "red", "blue"), item = c("shirt", "socks", "shirt", "socks", "socks")) > tmp colour item 1 red shirt 2 red socks 3 green shirt 4 red socks 5 blue socks > ddply(tmp, .(colour), nrow) colour V1 1 blue 1 2 green 1 3 red 3 > ddply(tmp, .(item), nrow) item V1 1 shirt 2 2 socks 3
Back to our
bronfman data frame. For each week I want to know how many of
question.type was asked:
> questions <- ddply(bronfman, .(question.type, week), nrow)w > head(questions) question.type week V1 1 1. Non-Resource 2011-01-31 44 2 1. Non-Resource 2011-02-07 58 3 1. Non-Resource 2011-02-14 43 4 1. Non-Resource 2011-02-21 20 5 1. Non-Resource 2011-02-28 49 6 1. Non-Resource 2011-03-07 37
This new data frame has a count of how many 1s were asked each week, then how many 2s, and so on up to how many 5s were asked each week.
xyplot does the trick for making a chart of this:
> xyplot(V1 ~ as.Date(week) | question.type, data = questions, type = "h", main = "Questions asked at Bronfman", sub = paste("Feb 2011 to", up.to.week), ylab = "Number of questions", xlab = "Week", par.strip.text=list(cex=0.7), )
Notice how nicely R figured out how to label the x and y axes. Because it knows that the
week column consists of dates, it was able to divide up the x-axis into three-month chunks. Beautiful.