Miskatonic University Press

Review of Statistics for Library and Information Services

code4lib libraries r reviews

A few months ago I read a review of Alon Friedman’s Statistics for Library and Information Services: A Primer for Using Open Source R Software for Accessibility and Visualization and was intrigued. It seemed like it would give me a good refresher on statistics while being grounded in the library world. I bought a copy and to my dismay found the first few chapters so poorly written and riddled with so many errors that I’m going to recycle it.

I do not recommend this book to anyone or for any collection. I don’t normally post negative reviews of books, but I saw so many errors in this one that I feel people need to know.

I was concerned as soon as I started reading chapter 1, but “1.7 Open Source Software” was where I got really worried:

The term open source often refers to something that can be modified because its gate is publicly accessible. In the context of software, open source means that the software code can be modified or enhanced by anyone. The open source movement began in the late 1970s, when two separate organizations promoted the idea of software that is available for anyone to use or modify. The first organization that aimed to create a free operating system was General Public License (GPL). The leading person behind this movement was Richard Stallman. The second organization was Open Source Initiative (OSI), under the leadership of Bruce Perens and Eric S. Raymond.

“Its gate”? Further: the GPL is a license, not an organization. RMS has been working on and for free software (there’s a difference) since the seventies, but the Free Software Foundation wasn’t created until 1985. The Open Source Initiative began in 1998.

The next paragraph confuses R with its predecessor, S. The third paragraph begins:

R is similar to other programming languages, such as C, Java, and Perl, in that is helps people perform a wide variety of computing tasks by giving them access to various commands.

I don’t understand how that sentence could be written by anyone that actually programs.

Skipping ahead past the very introductory statistics stuff, which is confusing, let’s look at “4.3 Introduction of Basic Functionality in R.” It will mislead any reader.

For example on page 53, “4.3.2 Writing Functions” begins:

When you write an R function there are two things you should keep in mind: the arguments and the return value.

Certainly true! True in any language. This is not the time to introduce functions, however. It’s too early.

The book then gives this example:

> p <- c("p1", "p2", "p3", "p4")
> p
> [1] p1, p2, p3, p4

In reality this will look like:

> p <- c("p1", "p2", "p3", "p4")
> p
[1] "p1" "p2" "p3" "p4"

There are many, many code snippets in the book where the output is wrong and the formatting bad.

The section concludes:

We will encounter functions throughout the book. For example, there is a function named “+” that does simple addition. However, there are perhaps two thousand or so functions built in to R, many of which never get called by the user directly but serve to help out other functions.

This is the reader’s first introduction to functions!

“4.3.4 The Return Value” says:

In R, the return value is a function that has exactly one return value. If you need to return more than one thing, you’ll need to make a list (or possibly a matrix, data.frame, or table display). We will discuss matrix, data.frame, and table display in chapter 17.

That first sentence is incorrect, and the section is utterly unhelpful.

Moving to the next section, let’s look at a few examples from “4.4 Introduction to Variables in R.”

On page 54, it says, “In any programming language, we often encounter seven types of data,” namely numeric (“decimal values, also known as numeric values”), integers, strings, characters, factors, fractions (“represents a part of a whole or any number with equal parts”) and “logical.” I offer this as an essay question in first-year computer science exams: “‘In any programming language, we often encounter seven types of data.’ Discuss.”

On page 55 there’s discussion of assigning variables. It says, “The most common operator assignment is <- and ==, the first assignment being the preferred way.” == is the equality test! This should be =!

Then assign() is introduced, though surely no beginning R user needs to know about it, and it’s introduced incorrectly. Here’s the example:

> assign("j", "4")
> j
[1] 4
>4

What is that trailing 4 doing there? I don’t know. Also, j is being made a string, but it’s shown here as an integer. The example should look like:

> assign("j", "4")
> j
[1] "4"

Next it says variable names can contain underscores, which is correct and certainly useful to know, but the example won’t work. This is what it shows:

> bob2 <- 38_a
> bob2
[1] 38_a

This is what happens:

> bob2 <- 38_a
Error: unexpected input in "bob2 <- 38_"

That’s because 38_a is not a valid variable name. A bit of looking around turns up the documentation in ?make.names, which says, “A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.” So a_38 is valid, but not 38_a.

Even if the example used a_38 it still wouldn’t work, because that variable hasn’t been defined yet:

> bob2 <- a_38
Error: object 'a_38' not found

Moving on to page 57, about characters, the book says, “A character is used to represent a single element such as a letter of the alphabet or punctuation. In order to convert objects into character values in R, we need to declare this value as character with the as.character() function.” The example is:

> x = as.character(0.14)
> x
[1] "0.14"
> class(x)
[1] character

That code works and is correct, but why change to using = for assignment instead of the usual <-? Also, “0.14” is not a “single element”! Also, there’s no point in using as.character there, because it’s unnecessary; the function could be introduced later when one needs to convert some other data type to a string.

Furthermore, strings aren’t actually a different class:

> class("foobar")
[1] character

In the section on fractions we see that fractions aren’t actually built into R, because they require a special package to use them. The instructions on how to install that package are incorrect and will cause an error. Somebody must have noticed that because fractions have disappeared in the version on the web site. “‘In any programming language, we often encounter six types of data.’ Discuss.”

The section on logical variables says, “The logical value makes a comparison between variables and provides additional functionality by adding/subtracting/etc.” What?

This is the example (which should use <- for assignment):

> x = 1; y = 2 # sample value
> i> x + y
> i
[1] 3

What? That makes no sense. This is what happens when you run it:

> x = 1; y = 2
> i > x + y
Error: object 'i' not found

Was it meant to be something like this?

> 1 > x + y
[1] FALSE

On the web site it looks like this:

> x = 1; y = 2
> x > y
[1] FALSE

That runs and is correct, but if this was the original intent, how on earth did it get mangled into what’s in the book? Why not say 1 > 2 and see what happens?

All of that is just between pages 53–59 where the book is introducing the most very basic aspects of a programming language. I didn’t go further.

What little I read of basic statistics was confusing and unhelpful. I didn’t bother to go further on that either.

The web site has different code on it, but I don’t see any notices about errata or corrections.

Some of the many faults of the book could have been fixed by using methods of reproducible research. It’s possible to write a book mixing text and code in R and Markdown. Hadley Wickham and Garrett Grolemund did this in R for Data Science, an excellent book, and all of the source code is openly available.

Anyone in LIS looking to learn statistics and R is advised to look elsewhere. I will post recommendations as I find better books.