Miskatonic University Press

Access 2009, Thursday #5: Dan Chudnov, Repository Development at the Library of Congress


A lively, interesting and inspiring talk from Dan Chudnov, as always. He gets right to the good stuff and leaves out the dull bits about how committees mandated their initial governance models. Peter Zimmerman blogged this talk. Here are my notes:

He works in the Repository Development Group at LC. 30 people, with dedicated developers, QA, systems operations, project management. They are part of the OSI, Strategic Initiatives. Their job: Capture the digital artifiact, register at Copyright Office, pass it along into the digital collection for registration, cataloging, indexation and preservation."

World Digital Library: wdl.org

Partners from all over the world, with data and metadata coming in from all over in all kinds of way. Also it's all available for people all over the world /ru/ for Russian/, /zh/ Chinese, /ar/ for Arabic.

Huge load on the site when it went live, with 9000 requests a second. Bigger than any web thing LC had done before. Built on Solaris, Apache, Ngingx, Django, etc.

Clean URIs that won't change. Static pages, which helps global edge caching. They used Akamai to keep access working, they couldn't havekept up with all that demand themselves.

Chroncling Americaa: chroniclingamerica.loc.gov

Started with about 140,000 US newspaper title records. All of the data there is freely downloadable. Whole issues. 100 TB of data, growing quickly, at just 16 states so far. Petabytes in a few years. Built on Solaris, Apache, MySQL, Solr, Django. Again with the clean URIs, lots of local page caching because it has many more pages that are each used a lot less.

They use BagIt, in fact helped put it together. It's like a packing slip for data. Works across space, systems, organizations, and time. Bagger, a GUI for making Bags: http://sf.net/proects/loc-xferutils/ This was the first open-source licensed software that LC has released.

They get 30,000 books in each day, lots of newspapers, etc: their mailroom is a fascinating place to watch, he says.

Showed Transfer UI, an inventory/workflow tool they use internally to manage all of the stuff coming in for Chronicling America.

Registering and depositing materials for copyright: They hope to support eDposit with these tools. Also "Deposit Demand," taking in basically databases of journals.

How does all this stuff get incorporated digitally into the collection? "Does anybody know what that means? I don't know what it means either."

Traditional approach: make catalogue records, make an exhibit site. Cost of integrating all of this is high. Building them costs money, maintaining them costs money. But cost of consistent web strategies is low.

Linked data: Use URIs as names for things. Use HTTP URIs. Provide useful information. Include links to other URIS. http://w3.org/DesignIssues/LinkdedData.html

Two sites doing this:

http://id.loc.gov/ LCSH on the web for free. Embodies linked data principles. View source to see alternate formats: RDF/XML, Ntriples, JSON. Also linked at bottom too. "At this URI is a concept with a precise meaning." Now there is a standard way to refer to an LCSH heading, instead of strings. All freely downloadable too. This was also new for LC.

[ ] Idea: Get a dump of all items in our collection and their subject headings. Download LCSH. Compare and see what we have that has invalid or non-authoritative subject headings.

At Chronicling America, view source to see OCR information and a resourcemap. Look at the resourcemap. OAI-ORE aggregation. "A constellation of stars in the sky." Linked data, reusing other vocabularies. All exposed on the web. This was new for LC too.

Really interesting thing about this app: the web is the API. You visit the page, you're using the API. The API documentation is just a bunch of links to pages on the site.

If we all do this in all our apps, we have: distributed conceptual integration. The web is a universal collection. LC puts its artifacts on its web, we puts ours on our web, it all fits together and there we go.

He summarizes: "Content that scales on the way in. Apps that scale on the way out. Movage movage movage. Transfer, inventory, workflow. BagIt. FLOSS. Free data you can use. Web of data, available and useful."