Miskatonic University Press

Anonymous Wikipedia edits from the House of Commons

r wikipedia

Excitement has been justifiably high the last few days because of a rash of Twitter bots that post a short note whenever an anonymous change is made to Wikipedia from an IP number belonging to a government. It started with @parliamentedits, then Ed Summers wrote anon and used it to set up @congressedits (don’t miss his blog post about it, or the new Wikipedia entry). Then things exploded.

(This is a long post. Good background music while reading is Listen to Wikipedia.)

As always with Ed’s programs, anon is easy to install and to use and comes under a free license, so if you’re a bit handy with the command line then you’re bound to be able to get it working without too much trouble. That helped the rapid flourishing of similar Twitter bots.

A quick note about how it works: anon uses wikichanges to listen to IRC channels spewing notifications every time a Wikipedia is edited, which is a great way of getting this information out. If you go to #en.wikipedia you’ll see something like this:

Screenshot of en.wikipedia
Little use to a human, but you can write a program to parse each line.


Nick Ruest (we work together at York University Libraries) used anon to set up @gccaedits, which monitors anonymous edits from Government of Canada IP numbers. He got a huge amount of interest from Canadian media, which was great to see. Some links:

The Dean Del Mastro vandalism is unacceptable but also rather amusing, like the change of profession from “Auto Dealer” to “Dealer of Used Cars with Bent Frames, Perjurer.” (Del Mastro has more to worry about than Wikipedia edits, in any case. As the article says truthfully, “After being charged by Elections Canada with falsifying election documents and knowingly exceeding the Election spending limit, he resigned from the Conservative caucus. Del Mastro faces charges of violating the Canada Elections Act and up to five years in prison with a $5,000 fine.”) Browse the full editing history of the page yourself.

Why it’s anonymous and why it will stop

I didn’t know why these bots were only reporting anonymous edits. It’s because those are the only ones trackable to an IP number. That’s in Wikimedia’s privacy policy, which I’d never read before. (Nor On privacy, confidentiality and discretion, which is good.)

From Your Public Contributions:

When you make a contribution to any Wikimedia Site, including on user or discussion pages, you are creating a permanent, public record of every piece of content added, removed, or altered by you. The page history will show when your contribution or deletion was made, as well as your username (if you are signed in) or your IP address (if you are not signed in). We may use your public contributions, either aggregated with the public contributions of others or individually, to create new features or data-related products for you or to learn more about how the Wikimedia Sites are used.

Unless this Policy says otherwise, you should assume that information that you actively contribute to the Wikimedia Sites, including personal information, is publicly visible and can be found by search engines. Like most things on the Internet, anything you share may be copied and redistributed throughout the Internet by other people. Please do not contribute any information that you are uncomfortable making permanently public, like revealing your real name or location in your contributions.

And from Account Information and Registration:

[I]f you contribute without signing in, your contribution will be publicly attributed to the IP address associated with your device.

But if you do create an account and sign in, your account name is available but your IP number isn’t. Wikipedia keeps it in private for a little while, in case of abuse, then throws it out. You can edit pages pseudonymously and unidentifiably if you create an account and take a bit of care. That’s good.

All this explained to me why it’s only anonymous edits that are being reported: there is no way to know who else on Parliament Hill is editing Wikipedia. (Well, no easy and obvious way for regular people—setting aside CSEC and government IT, who can track anything they want.)

That’s why I bet all the anonymous Wikipedia editing from government offices is going to dry up. Word will get around, staffers will be admonished, politicians will huff and puff … and then anyone wanting to do anything sneaky will use a sock puppet or tether their phones or go to a café. Unless they identify themselves or someone does detective work, we’ll have no idea who they really are.

Another approach to all of this is to begin not with an incomplete set of editors but with a complete set of pages and tweet every time any one of them is edited. anon does this now too, and @congresseditors is one result. Nice work from Ed and all the others who helped!

One last key quote from the privacy policy:

Whatever you post on Wikimedia Sites can be seen and used by everyone.

Good advice about everything online, in fact.

Historical anonymous edits

I like Twitter bots as ambient ways of keeping up with what’s happening, but this deluge of information blasting out was all going too fast for me. I move more slowly. Besides, anon is written in CoffeeScript and node.js, neither of which I’m good at. I began to wonder: what about past edits? What did they show? Aha! This was something I could tackle with Ruby and then make charts in R, which is more my speed. Here’s what I did.

Information about edits made by accounts is available in a nice human-readable way. For example, Special:Contributions/ shows the most recent changes made by someone at, the IP number that made that Dean Del Mastro edit. (Notice the information box at the bottom that points out this is an IP user. “Registering also hides your IP address,” it reminds. The whois link tells you more about who owns this IP number: “CDAGOVN - Government Telecommunications and Informatics Services, CA.”)

But for this I used the MediaWiki API, in particular Usercontribs: “Gets a list of contributions made by a given user, ordered by modification time.” The API lets you get that Special:Contributions information and a lot more through variables specified in a URI. For example:

  • [Get title, timestamp, sizediff for 50 most recent changes on en.wikipedia.org made by](https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser= timestamp sizediff&format=json) (JSON)
  • [Get title, timestamp, sizediff for 50 most recent changes on en.wikipedia.org made by](https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser= timestamp sizediff&format=xml) (XML)

The XML version will probably be more readable in a browser.

At a command line you can use curl to get the content and then feed it into either jsonlint or xmllint, depending on the format you want. With jsonlint the command is:

curl "https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&ucuser=|timestamp|sizediff&format=json" | jsonlint | more

Which begins:

  "query-continue": {
    "usercontribs": {
      "uccontinue": "20140424143003|605611011"
  "query": {
    "usercontribs": [
        "userid": "0",
        "user": "",
        "ns": 0,
        "title": "Pierre-Hugues Boisvenu",
        "timestamp": "2014-07-16T14:09:05Z",
        "sizediff": -660
        "userid": "0",
        "user": "",
        "ns": 3,
        "title": "User talk:",
        "timestamp": "2014-07-15T18:09:34Z",
        "sizediff": 35
        "userid": "0",
        "user": "",
        "ns": 0,
        "title": "Small Dead Animals",
        "timestamp": "2014-07-15T18:02:03Z",
        "sizediff": 91

Perfect for reading into a program and munging.

That was just for one IP number. What about for a range? For that, I wrote contributions-by-ip.rb. Given ranges of IP numbers, it runs through each one and queries 37 different Wikipedias (different languages) to find any changes, and then it dumps the results to a comma-separated value file.

I ran it on the House of Commons ranges that Nick has listed: ["", ""] (and everything in between). While running it looks like this:

Contributions script running
Each dot is a Wikipedia, each asterisk is an edit found.

Here’s some of the output:

user,lang,title,timestamp,pageid,revid,parentid,sizediff,en,Noam Chomsky,2005-01-24T22:20:01Z,21566,9653831,9624290,9,en,Willie Adams,2005-02-04T20:19:34Z,705884,13096944,9946154,1,en,Don Boudria,2005-04-12T15:39:03Z,479215,12458768,12210018,320,en,Helena Guergis,2005-04-29T12:01:39Z,1415882,12987067,12971149,-237,en,Gerry Ritz,2005-05-02T21:50:03Z,1831626,13179278,0,130,en,Gerry Ritz,2005-05-03T15:07:31Z,1831626,13197551,13179278,194,en,Jeremy Harrison,2005-05-03T15:24:57Z,1834907,13197734,0,334,en,Gatineau (electoral district),2005-05-16T18:41:04Z,1745109,17398215,13794806,0

The full file is on GitHub for now.

I like CSV files. They just sit there, solid, unchanging, comfortable, approachable, usable in any tool or language, unencumbered by licenses. CSV is the old sofa of data formats.

Examining with R

Let’s load in some of the usual handy libraries, read the CSV, format a couple of date things to make things easier later, and then have a beginning look at the data. There are 4485 edits recorded, we can see. From how many IPs?

> library(dplyr)
> library(lubridate)
> library(gpplot2)
> contributions <- read.csv("~/src/wikipedia.edits/house-of-commons.csv")
> contributions$date <- as.Date(contributions$timestamp)
> contributions$month <- floor_date(contributions$date, unit="month")
> str(contributions)
  [Useful output deleted, but str is always a good command to run]
> contributions %>% group_by(user) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [4 x 2]

            user count
1  1624
2  1624
3   624
4   613

Four IPs! What!? Only four IPs at the House of Commons doing the editing? No others? It’s surprising, but reasonable, especially given that we don’t know how the government or House IT department runs things. Surely the House has more than 255 IP numbers for internal use. These four could be the public-facing gateways. (Incidentally, is parl153.parl.gc.ca; the name for parl203 exists, but there’s no parl155 or parl205.)

Let’s see which pages were the most edited.

> contributions %>% group_by(title) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [1,965 x 2]

                                                                    title count
1                                                List of Star Trek novels   101
2  Endorsements for the Liberal Party of Canada leadership election, 2006    65
3                                                 John McKay (politician)    59
4                                                     Partition of Quebec    55
5                                            Steven Fletcher (politician)    51
6                       Liberal Party of Canada leadership election, 2006    49
7                                                           Gatineau Park    39
8                                                          Jamie Nicholls    36
9                                  Concours Eurovision de la chanson 2009    35
10                                                       Pierre Poilievre    34

List of Star Trek novels!? This page has been edited 101 times by one or more anonymous users at the House of Commons? It’s been edited more than any other page? Let’s look closer. When were these edits made?

> contributions %>% filter(title == "List of Star Trek novels") %>% group_by(date) %>% summarize(count = n()) %>% arrange(date)
Source: local data frame [3 x 2]

        date count
1 2009-12-23    55
2 2009-12-24    37
3 2012-03-05     9

92 edits on 23 and 24 December 2009. Christmas Eve. Imagine it.

Christmas tree

Some poor schnook has to go into work, but there’s nothing going on. The House isn’t sitting. The MPs have all gone back to their ridings. Ottawa is filled with Christmas lights and people shopping and planning for the holidays. Every senior civil servant is at home. But meanwhile our Star Trek-loving staffer has to go in and sit around the office, maybe catching up on complaints from constituents or editing some committee submission. It’s boring. The day is dragging. It’s snowy outside. “The hell with this,” our staffer says. “I’m going to fix up that list of Star Trek novels on Wikipedia.” Our staffer digs into the work for an hour or two, enjoys it, and that evening checks over copies of books at home to make more improvements the next day.

I have no problem with this. This is perfectly acceptable. We’ve all done it. As Ed said in his post:

I wrote this post to make it clear that my hope for @congressedits wasn’t to expose inanity, or belittle our elected officials. The truth is, @congressedits has only announced a handful of edits, and some of them are pretty banal. But can’t a staffer or politician make a grammatical change, or update an article about a movie? Is it really news that they are human, just like the rest of us?

I am going to delete these edits from the data set so they don’t obscure the rest of the analysis.

> contributions <- contributions %>% filter(title != "List of Star Trek novels")

Edits by language

Let’s see which languages are being edited. Turns out it’s almost all English, some French, then a little bit of German, Spanish, Italian, Portuguese, Ukrainian and Slovak. We’ll make a list of all the non-English/French pages that were edited.

> contributions <- contributions %>% filter(title != "List of Star Trek novels")
> contributions %>% group_by(lang) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [8 x 2]

  lang count
1   en  3768
2   fr   570
3   de    19
4   es    13
5   it     5
6   pt     4
7   uk     4
8   sk     1
> contributions %>% filter(lang != "en", lang != "fr") %>% group_by(lang, title) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [25 x 3]
Groups: lang

   lang                             title count
1    de              Brand von Washington     6
2    de                     Steven Blaney     6
3    it                    Isabelle Morin     5
4    es                  Dólar canadiense     2
5    es                   Reyes Católicos     2
6    es                  Sangría (bebida)     2
7    es Ámbito metropolitano de Barcelona     2
8    pt                            Lisboa     2
9    pt                      Offer Nissim     2
10   uk           Маневич Абрам Аншелович     2
11   de          28. Kanadisches Kabinett     1
12   de                    Cotard-Syndrom     1
13   de                     Dory Funk Jr.     1
14   de                       Mark Carney     1
15   de Once Upon a Time  Es war einmal     1
16   de                  Ravished Armenia     1
17   de                 Sarah Knox Taylor     1
18   es                La Raya (frontera)     1
19   es                          Las once     1
20   es                    Marcela Guerra     1
21   es                     Simón Bolívar     1
22   es        Terremoto de Haití de 2010     1
23   sk                            Kanada     1
24   uk          Ляшко Олександр Павлович     1
25   uk     Обговорення користувача:NickK     1

Interesting mix there. Brand von Washington (de) sounds like a Pynchon character, but it’s the Burning of Washington in 1814, a popular feature of Canadian history. Steven Blaney is a Conservative MP; Isabelle Morin is an NDP MP. Sangria is a delicious wine-based drink. The one edit on Kanada at sk.wikipedia.org is a correction to say we’re a constitutional monarchy and that the United States and France (via Saint Pierre and Miquelon) are our neighbours. That’s a good update.

What are the most edited pages on the French language Wikipedia?

> contributions %>% filter(lang == "fr") %>% group_by(title) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [265 x 2]

                                    title count
1  Concours Eurovision de la chanson 2009    35
2                          Jamie Nicholls    34
3                          Isabelle Morin    13
4                  Senators de Binghamton    11
5                              Laurin Liu    10
6         Jeux olympiques d'hiver de 2018     9
7              Nouveau Parti démocratique     9
8  Concours Eurovision de la chanson 2008     8
9                           Maria Mourani     8
10                Céline Hervieux-Payette     7

Jamie Nicholls, Isabelle Morin and Laurin Liu are all NDP MPs. Maria Mourani was expelled last year from the Bloc Québécois and now sits as an independent. Céline Hervieux-Payette is a Liberal senator. They are all from Quebec.

Concours Eurovision de la chanson 2009 is the 2009 Eurovision Song Contest. Looks like there was one fan who was making minor edits over nine months. That’s just a Star Trek novels thing.

Enough of that, let’s make a chart.

Charts of all edits

> ggplot(contributions %>% group_by(month) %>% summarise(count=n()), aes(x=month, y=count)) + geom_bar(stat="identity") + geom_vline(xintercept = as.numeric(as.Date(c("2006-01-23", "2008-10-14", "2011-05-01"))), linetype = "dashed") + labs(title="Anonymous Wikipedia edits from House of Commons IPs", x="Lines show election dates", y="")
All edits
Most anonymous edits were done during the first Conservative minority, 2006--2008.

Now let’s break it down by IP number:

> ggplot(contributions %>% group_by(month, user) %>% summarise(count=n()), aes(x=month, y=count)) + facet_grid(user ~ .) + geom_bar(stat="identity") + geom_vline(xintercept = as.numeric(as.Date(c("2006-01-23", "2008-10-14", "2011-05-01"))), linetype = "dashed") + labs(title="Anonymous Wikipedia edits from House of Commons IPs", x="Lines show election dates", y="")
Edits broken down by IP number
Why so quiet,

Hmm. Why are the edit counts so different for the four IP addresses? How are they allocated? I would like to know.

Edits since the 2011 election

Marking elections made me wonder how the editing has been since the last one, when the Conservatives won a majority. It turns out Jamie Nicholls has the most edited page(s), so let’s find out exactly where and when those edits were made.

> contributions %>% filter(date > "2011-05-01") %>% group_by(user, title) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [636 x 3]
Groups: user

             user                 title count
1        Jamie Nicholls    26
2      Douglas Kinsella    15
3        Isabelle Morin    13
4       Chris Warkentin    12
5            Don Davies    11
6          Yonah Martin    11
7        Corneliu Chisu     9
8 Montreal City Council     9
9         Robert Sopuck     9
10          Charlie Watt     8
..            ...                   ...   ...
> contributions %>% filter(date > "2011-05-01", title == "Jamie Nicholls") %>% group_by(user, date, lang) %>% summarize(count = n()) %>% arrange(date)
Source: local data frame [10 x 4]
Groups: user, date

             user       date lang count
1 2011-08-06   fr     3
2 2011-08-06   fr     1
3 2011-08-17   en     1
4 2011-08-17   en     1
5 2013-09-13   fr    14
6 2013-09-23   fr     5
7 2013-11-14   fr     4
8 2013-11-20   fr     3
9 2014-01-20   fr     3
10 2014-03-28   fr     1
Looks like 13 September 2013 was a busy day on Jamie Nicholls (fr). We can use the revisions API to get [a list of all of the changes that day, with username, timestamp and comment](https://fr.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Jamie%20Nicholls&rvstart=2013-09-12T00:00:00Z&rvend=2013-09-14T00:00:00Z&rvdir=newer&rvlimit=20&rvprop=timestamp user comment). There was an edit war going on, I think with defending Nicholls: one comment says, “L’auteur ajoute de l’information trompeuse à propos de Jamie Nicholls en utilisant des références qui ne soutiennent pas le caractère critique de ses propos,” which Google translates as “The author adds misleading information about Jamie Nicholls using references that do not support the critical nature of his remarks.” There is some back and forth.

The 15 Douglas Kinsella edits are worth a look. He’s the father of Warren Kinsella, a lobbyist and Liberal back room operator, disliked by many people, which is no doubt why there’s vandalism.

> contributions %>% filter(title == "Douglas Kinsella") %>% group_by(user, date, lang) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [1 x 4]
Groups: user, date

            user       date lang count
1 2013-05-24   en    15
The [listing of edits that day](https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Douglas%20Kinsella&rvstart=2013-05-24T00:00:00Z&rvend=2014-05-25T00:00:00Z&rvdir=newer&rvlimit=20&rvprop=timestamp user comment) shows an edit war between and Glaisher. It starts with vandalizing the page, then it goes back and forth with the page being repaired over and over until it is marked protected to prevent more vandalism from the House of Commons IP.

Every Wikipedia user also has a Talk page, where other people can talk to the user. User_talk: has a May 2013 section with a complaint, a warning and then a message saying the account was blocked from editing for 48 hours. There’s another complaint in June from another user: “Please note that you are not permitted to overwrite Wikipedia’s articles about Members of Parliament with their own self-penned and unreferenced biographies.”

Here are the talk pages for all four IPs. They’re all marked as being shared IPs, and they’ve all been blocked for vandalism or come very close.

  • User_talk: May 2013: “You have been blocked from editing for a period of 48 hours for your disruption caused by edit warring and violation of the three-revert rule at Douglas Kinsella.”
  • User_talk: May 2009: “This is the only warning you will receive for your disruptive edits.”
  • User_talk: July 2014: “You may be blocked from editing without further warning the next time you vandalize Wikipedia, as you did at Small Dead Animals.”
  • User_talk: May 2011: “Please stop adding inappropriate external links to Wikipedia, as you did to Ontario. It is considered spamming and Wikipedia is not a vehicle for advertising or promotion.”

Back to the recent edits. The Jamie Nicholls edits were defending against an anti-Nicholls user, the Kinsella changes were pure vandalism. What about the next two most edited pages?

> contributions %>% filter(title == "Isabelle Morin") %>% group_by(user, date, lang) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [12 x 4]
Groups: user, date

             user       date lang count
1 2014-03-05   fr     7
2 2014-03-05   en     6
3 2012-05-15   it     3
4 2012-05-15   fr     2
5 2012-05-23   fr     2
6 2012-05-23   en     2
7 2012-05-15   it     1
8 2012-05-16   it     1
9 2012-05-23   en     1
10 2012-05-15   en     1
11 2012-05-16   fr     1
12 2012-05-23   fr     1
> contributions %>% filter(title == "Chris Warkentin") %>% group_by(user, date, lang) %>% summarize(count = n()) %>% arrange(desc(count))
Source: local data frame [4 x 4]
Groups: user, date

            user       date lang count
1 2013-03-19   en     5
2 2013-03-21   en     3
3 2012-10-18   en     2
4 2012-10-19   en     2

The history for Isabelle Morin (fr) for 5 March 2014 shows edits being reverted for being “trop promotionnel et modifié par le bureau même de la députée.” The history for Chris Warkentin shows the same: one edit by was reverted with the comment “Reverting back to remove another copy and paste bio from a HOC [House of Commons] IP. Copy and pastng online bios is not how Wikipedia articles are written.”

I’m not going to dig into all this any more here, especially since we’ve already seen the Dean Del Mastro vandalism. There’s much more analysis that could be done, but it seems to be the case that recent edits are generally not helpful.

On the other hand, there really are very few of them.

> library(scales)
> ggplot(contributions %>% filter(date > "2011-05-01") %>% group_by(month, user) %>% summarise(count=n()), aes(x=month, y=count)) + facet_grid(user ~ .) + geom_bar(stat="identity") + labs(title="Anonymous Wikipedia edits from House of Commons IPs since 1 May 2011", x="", y="") + scale_x_date(labels = date_format("%b %Y"))
Edits since 1 May 2011
Very few since the last election.

That is not many edits at all. Given everything happening a) at the House of Commons and b) at Wikipedia, it’s minuscule. The vandalism is unacceptable but given what the Conservative government is doing to the country it’s trivial.

Stop being idiots

I have nothing at all against anonymous edits on Wikipedia. I make them myself. And I don’t know how government IT runs its networks in the House of Commons, where those IPs are allocated or how they’re shared or why only those four IPs are seen on Wikipedia.

But to those people in the House doing the vandalizing: stop being idiots. Set up a real account and be more constructive. You know you can’t get away with it any more. Be helpful, whether that means editing pages about Canadian subjects (in English, French or other languages) or Star Trek novels or something else.

Wikipedia’s data access is wonderful

I’d never used the MediaWiki API before. It’s a bit awkward but the openness and accessibility of the data is wonderful. Everything is there! All of the data is available, for free, for download and use. Fantastic! People like to use Twitter for data mining and research, understandably, but read its terms of service and privacy policy. Twitter just wants to make money. Wikipedia is open. It makes an incalculable difference.

I know there’s a lot of academic research on Wikipedia but I’ve hardly looked at any. I will now.

I mentioned the Special:Contributions pages before. You can use them to check on what each of the four IPs is doing:

Norwegian edits

Jari Bakken (@jarib) did something similar to this for Norway: Anonymous Wikipedia edits from the Norwegian parliament and government offices. Looks like they’re more active over there than here.

Chart of Norwegian edits

He did this using Wikipedia data dumps.

I just saw that he also pulled Anonymous Wikipedia edits from the Government of Canada using Google BigQuery’s Wikipedia Revision History (which goes from 2002–2010). Nice! See Anonymous Wikipedia edits from around the world for more.


Through ARIN I found an IP range apparently belonging to the Communications Security Establishment Canada, Canada’s signals intelligence spy agency, our equivalent of the NSA or GCHQ: - I checked those IPs and only found two anonymous edits: