Access 2009, Friday #7: Bess Sadler, Blacklight: Findability for Your Whole Collection

Bess Sadler was back up, talking about Blacklight. Pete Zimmerman took notes on this and the next talk, which were both short and took up one 45-minute slot. See the video.

"If your interface requires instructions, it needs to be redesigned." ---Dan Rubin

Bad things about their old catalogue:

  • lack of relevance ranking
  • lack of permanent URLs/RSS
  • siloing of collections
  • lack of object type-appropriate behaviour
  • inability to respond to user requests and suggestions

Data sources they want to bring all together:

  • library catalogue
  • IR
  • theses and dissertations
  • Google Books
  • library digitization projects
  • departmental digitizaton projects
  • faculty research output
  • archival finding aids
  • licensed journals and databases

"Solr is the anti-silo."

"The 30,000 steamboats problem." How do you make sense out of piles of stuff that are all of the same, and overwhelm all the other results?

They use Cucumber to define and test how their relevancy rankings should work, giving specific cases and the expected results. When her librarians send her bug reports about rankings they send them in Cucumber format!!

On home page, they assume that most relevant result is an exact match on the title. On a special music place that's not true, so they boost the importance of names. Music students have such special needs that they did special stuff for them.

http://projectblacklight.org

Plugin structure allows for local customizations without forking.

Access 2009, Friday #6: Roy Tennant, Inspecting the Elephant: Characterizing the Hathi Trust Collection

Roy Tennant talked about the Hathi Trust and the work he'd done hacking around with the metadata. As interesting a project as it is and as his hacking was, it was overshadowed by the video he did to go with Stan Rogers's song White Collar Holler, which many people bought from iTunes while it was still playing up at the front. Watch the video of the talk to see it.

Access 2009, Friday #5: Dorothea Salo, Grab a Bucket! It's Raining Data!

I've been reading Dorothea Salo online for years and it was nice to finally meet her and to see her in action. She gave a very good talk. Watch the video of her talk to get the full benefit. One thing I liked about her talk was that not only did she not put a lot of words up on her slides, just big interesting pictures, but she'd reuse the pictures and come back to one she'd shown before, but say new things about the related subject. As her talk went on we saw some pictures several times, each connected to some particular topic, and the visuals helped us connect what she was saying to what she'd said before about the topic. Interesting technique.

Here are my notes, and as usual Peter Zimmerman blogged it copiously.

Exciting time to be a digital libarian: there's a whole new world of stuff out there.

Says she'd been cast as the Cassandra of Open Access. But she's definitely in favour of it---hard to be against something that is an unambiguous good. But actually working at it shows it's not easy. We're asking for something for nothing from faculty members. And the same kind of foresight and designing and planning is what she's seeing in the data world. Is she now the Cassandra of data curation?

Says she's going to point out a lot of problems about data curation, but she's all in favour of it.

Focus on one problem: fit between content and container.

What do we know about research data? There's a lot of it. Even if we admit that a lot of big projects (LHC) will take care of their own stuff, we have a lot left over to handle. Can we?

"Data are there to be interacted with."

We'll need to get rid of all technical barriers around reusing data. "Different kinds of data have different kinds of affordances." They get used for different purposes.

"Data are wildly diverse" so you need different places to put them, depending on their nature. Two photographs are the same, no the content. But a book encoded in TEI and a book that's a collection of page scans need to be treated differently. Example of scientist who takes lots of images of the same cell and then builds a 3D model of it: they need to be all tied together in a sensible way and they won't all fit into eg DSpace.

"Data are already out there."

All the data that we already have is sitting around in different silos and it's all in danger. We made our own mess

Lots of data is analog (eg handwritten lab notebooks) that needs to be digitized. Can we scale up to that? Today, probably not.

"Data are project-based." Example: http://exploringthehyper.net/, a web-based thesis. How to preserve that? [dchud and vphill said in IRC: Fire up Heretrix.])

"Data are sloppy." If our systems only accept nice clean data then it won't handle real world stuff.

"Data aren't standardized."

Our big bucket: The digital library. Another big bucket: The institutional repository. But you can't just put an IR in place and then expect it to work. "There is no magic pixie dust for digital curation."

Why keep calling things digital as though it's something special? How do you then brand a digital collection if you think that digital is normal?

We built digital libraries same as print ones: careful selection and attention. Eg. Naskapi Lexicon at LAC.

Made a point about cyberinfrastructure going to the people with the money, but I missed it while checking IRC.

Archives don't keep everything. They throw a lot out. So we can't keep every bit of research data that comes our way. How do we decide? How do we balance giving good service to faculty and not swamping ourselves with garbage?

"Production is a Taylorist's dream." So you end up only digitizing the stuff that easily fits your production line model. Might have great images but no finding aids. Digital libraries specialize themselves by data types for efficiency's sake. That's a problem when more data comes in: it's diverse in nature, it doesn't fit into our Fordist model for handling it.

How do we handle it when the user's technology and ours don't match? Eg the web-based thesis. The best you could do with that site is to take on its whole technology stack and then future-proof it. That's a lot of work, a ton of work.

She's scared to death that so many librarians don't realize all this is a problem. They just don't realize what's involved and what it'll mean to handle it all.

"Everything about how we cost out and budget a project will have to change."

What if you don't have a Taylorist production model? You just hack something together for a given project, for now, not for the future? You have a silo. Many digital libraries are silos. Example: Decameron Web.

"Project silos are not really part of the web." Article called them "cabinets of curiosities."

Some librarians talk about the "context of an object" as though if you take something out of its context everything is broken. She says context is fluid. Context is constantly being built and rebuilt. We should expose our digital objects so they can used and reused in new contexts: that's no decontextualization, that's REcontextualization.

Presentation is content-specific. Books browse differently than maps, etc.

We've lost a lot of projects already to all these problems. Mellon-funded digital humanities projects have disappeared completely.

What about IRs? "Institutional" is becoming a problem. If we want to bring stuff into our IR, we need to prove a link to our institution. The more tenuous the link, the more work involved, no matter how worthy the project.

We can take things into our IR that are final, static and immutable: but to researchers, that stuff is dead. By the time it's in that state, they're done with it.

IRs terrible at dealing with diverse kinds of data. Not everything is a PDF of a research paper.

People see the ugly interfaces we have (DSpace, Fedora) and run away.

"Any metadata you want ... as long as it's key-value pairs." "Do anything you want ... as long as it's download." No way to deal with real-world data.

Summarizing: "We need bigger, better buckets. Silos are both necessary and unacceptable. We have a lot of modeling to do. And meta-modeling. We have a lot of code to write. We can't code or model in isolation. Fedora is the new world. But Fedora must change. Focus on the start ... not so much on the finished product. Solr brings it all together."

Hopes she can stop being the Cassandra of data curation and instead become its Clio.

Access 2009, Friday lightning talks

All of the lightning talks at Access 2009 on the morning of Friday 2 October were good. The video recording has them all, I think, or you can read Peter Zimmerman's notes. I didn't write down much.

Dalhousie's WorldCat Local had problems with the FRBR implementation and they want some changes made to suit academic libraries. For example, if there are multiple editions grouped together, it shows at the top the one with the most holdings in WorldCat libraries. They always want the most recent edition to show, especially in the sciences.

Bess Sadler showed some great GIS stuff.

My York University colleague Ali Sadaqain talked about our implementation of VuFind.

Access 2009, Thursday #9 - Stevan Harnad, Grasping What is Already Within Immediate Reach: Universal Open Access Mandates

Stevan Harnad wasn't there, so he sent a screencast of him giving the talk over his slides. It worked quite well, and was lightened by a couple of interludes where he got up to let his cat in or out. We could hear it meow. After the video, he came up on screen in a live video feed and answered questions. He couldn't see us but did very well considering. The videos are online.

Peter Zimmerman blogged the talk. Here are my notes.

Defined open access. Free, open access.

To what? Essential: all 2.5 million articles published each year in all 25,000 peer-reviewed journals, in all scholarly and scientific disciplines, worldwide.

Open access does this. 25% - 250% greater research impact. Lawrence (2001) in Nature showed big citation advantage for open access computer science papers. Later work showed same result in all other disciplines.

Golden way: Pulishers conveert their jounrals into OS. Depends on publishers.

Green way: Researchers deposit all their published articles into their own institution's repository. Depends only on research community. Research community can't make publshers change to Golden, but they can change themselves to Green.

EPrints (openarchives.org) is OAI harvestable.

But IRs are a necessary but not sufficient condition for green OA. Most of them are empty. Need to have mandates that say everyone does OA.

U Southhampton does it. Showed how their repository is growing. World's first green OA mandate.

"You self-archive unto others as you would have others self-archive unto you."

Self-archiving mandate is natural extension of publishing mandate, for the web era.

Vast majority of people willing to self-archive, but don't bother to go to the trouble unless it's mandated.

57 university mandates so far, 4 in US, three faculty-level and MIT's university-wide one.

EUA recommended mandate for all 791 universities in 46 countries, but not adopted yet.

Discussed green OA self-archiving advocacy and how to show its advantages to people.

Mandate should require: deposit of all articles; in the institutional repository; immediately upon acceptance of publication.

(look for Which Green OA Mandate is Optimal? and other links)

63% of journals already endorse green OA self-archiving.

EPrints has a solution for the others: button so you can request a copy for research purposes. ID/OA mandate. Immediate Deposit + Open Access.

"Don't succumb to Gold Fever. The fastest and surest road to OA is Green."

Three points about why us OA. Maxmize research output. Measure and reward research output. collect, showcase, manage a permanent record of that research output.

Access 2009, Thursday #7-8: Roy Tennant vs Mike Rylander

I think my numbering scheme for Access talks doesn't match anyone else's. Hard cheese, I'm afraid. This was a twofer, first with Roy Tennant in OCLC vendor mode and then Mike Rylander in Equinox Software (of Evergreen fame) vendor mode. Peter Zimmerman blogged both talks.

They both described the same kind of thing, a distributed ILS out on the Internet and not running locally, but Rylander's vision was people building it all themselves and owning it all themselves. This vibed much better with me than the OCLC idea, though the latter is certainly more likely and will be really interesting to see.

Roy Tennant, ILS In the Sky with Diamonds

Putting the ILS in the sky means "moving library data and applications to the network level at web scale." Moving to network level: going to the cloud. Explained cloud computing. Advantages and disadvantages. Amazon.

OLE Project. oleproject.org. Showed workflow diagrams, the point at which their Mellon funding ended and they don't have more funding. What next? Unknown. [This was clarified in IRC by someone involved with the project who said the first round of funding ended; the second is to come.]

eXtensible Catalog, out of Rochester.

Libraries are held back by: too many systems to support, too much invested in maintenance, a fragmented web presence, lost opportunities for leveraging data.

Putting an ILS on the network: boring. What can they do to use the infrastructure they have?

1,212,383 libraries worldwide. Transactions: 166,041,975,140 per year. 5,265/per second

They say they could do that with a handful of commodity servers. How?

There's data, public and private. Users. Libraries. Services.

Next-gen ILS: Do all library functions. Scalable. 100% web-based.

He stressed how much cheaper this will be. Selling the product. Shows timelines for what they're doing. Targetting a rollout in 2011 for this new thing.

Mike Rylander, Open Source ILS

Open source matters. Open data matters.

People threaten their vendors by saying they'll move to Evergreen. He's cool with that.

Cloud computing: "The use of any computing resource that is not mine, AND that I don't have to manage." "Learning not to waste computing resource."

Evergreen. SOA, SaaS-ish. Paas-ish.

Could they scale Evergreen up to run a community-owned, community-run community-maintained Platform-as-a-Service cloud?

Access 2009, Thursday #5: Dan Chudnov, Repository Development at the Library of Congress

A lively, interesting and inspiring talk from Dan Chudnov, as always. He gets right to the good stuff and leaves out the dull bits about how committees mandated their initial governance models. Peter Zimmerman blogged this talk. Here are my notes:

He works in the Repository Development Group at LC. 30 people, with dedicated developers, QA, systems operations, project management. They are part of the OSI, Strategic Initiatives. Their job: Capture the digital artifiact, register at Copyright Office, pass it along into the digital collection for registration, cataloging, indexation and preservation."

World Digital Library: wdl.org

Partners from all over the world, with data and metadata coming in from all over in all kinds of way. Also it's all available for people all over the world /ru/ for Russian/, /zh/ Chinese, /ar/ for Arabic.

Huge load on the site when it went live, with 9000 requests a second. Bigger than any web thing LC had done before. Built on Solaris, Apache, Ngingx, Django, etc.

Clean URIs that won't change. Static pages, which helps global edge caching. They used Akamai to keep access working, they couldn't havekept up with all that demand themselves.

Chroncling Americaa: chroniclingamerica.loc.gov

Started with about 140,000 US newspaper title records. All of the data there is freely downloadable. Whole issues. 100 TB of data, growing quickly, at just 16 states so far. Petabytes in a few years. Built on Solaris, Apache, MySQL, Solr, Django. Again with the clean URIs, lots of local page caching because it has many more pages that are each used a lot less.

They use BagIt, in fact helped put it together. It's like a packing slip for data. Works across space, systems, organizations, and time. Bagger, a GUI for making Bags: http://sf.net/proects/loc-xferutils/ This was the first open-source licensed software that LC has released.

They get 30,000 books in each day, lots of newspapers, etc: their mailroom is a fascinating place to watch, he says.

Showed Transfer UI, an inventory/workflow tool they use internally to manage all of the stuff coming in for Chronicling America.

Registering and depositing materials for copyright: They hope to support eDposit with these tools. Also "Deposit Demand," taking in basically databases of journals.

How does all this stuff get incorporated digitally into the collection? "Does anybody know what that means? I don't know what it means either."

Traditional approach: make catalogue records, make an exhibit site. Cost of integrating all of this is high. Building them costs money, maintaining them costs money. But cost of consistent web strategies is low.

Linked data: Use URIs as names for things. Use HTTP URIs. Provide useful information. Include links to other URIS. http://w3.org/DesignIssues/LinkdedData.html

Two sites doing this:

http://id.loc.gov/ LCSH on the web for free. Embodies linked data principles. View source to see alternate formats: RDF/XML, Ntriples, JSON. Also linked at bottom too. "At this URI is a concept with a precise meaning." Now there is a standard way to refer to an LCSH heading, instead of strings. All freely downloadable too. This was also new for LC.

[ ] Idea: Get a dump of all items in our collection and their subject headings. Download LCSH. Compare and see what we have that has invalid or non-authoritative subject headings.

At Chronicling America, view source to see OCR information and a resourcemap. Look at the resourcemap. OAI-ORE aggregation. "A constellation of stars in the sky." Linked data, reusing other vocabularies. All exposed on the web. This was new for LC too.

Really interesting thing about this app: the web is the API. You visit the page, you're using the API. The API documentation is just a bunch of links to pages on the site.

If we all do this in all our apps, we have: distributed conceptual integration. The web is a universal collection. LC puts its artifacts on its web, we puts ours on our web, it all fits together and there we go.

He summarizes: "Content that scales on the way in. Apps that scale on the way out. Movage movage movage. Transfer, inventory, workflow. BagIt. FLOSS. Free data you can use. Web of data, available and useful."

Access 2009, Thursday #2: Richard Akerman - Will We Command Our Data?

Richard Akerman blogs at Science Library Pad. Peter Zimmerman blogged this talk too.

My notes from his Access 2009 talk:

Talking about data. Storage has gotten incredbly amazing. Can store huge amounts of data on tiny spaces. Data floating around in the cloud doesn't seem real, but it is, and it takes a lot of energy and hardware to store it: electricity, air conditioning, etc. Carbon emissions.

Four sources of data: research data, government data, library data, personal data.

Research data: lots done about storing it, giving access to it.

Government data: has really opened up in the last year.

Library data: in catalogues, access logs, id.loc.gov, etc.

Personal data: easy to share your GPSed location, all other personal data, on line.

Everything about sharing data is getting easier: the value of it (more can be done with it by others), the ease of it, and the level of it.

OECD agreed: data from publicy-funded research should be released to the public. One reason this isn't controversial is that publishers aren't in, were never in, the business of publishing data. So data is an easy way to get into open access.

Toronto statement on prepublication data sharing: http://www.nature.com/nature/journal/v461/n7261/full/461168a.html

Open up the data before any papers based on it come out. Say, I'm going to write about this, but go ahead and use this data however you want.

In libraries: Berkeley Accord (March 2008). Basic rights to access to data in library systems. All vendors but one signed on (Innovative Interfaces)? Though how well have they implemented?

Personal data. WIRED cover feature "Living By Numbers" and personal data tracking (July 2009).

Why libraries? Advocates, exemplars, experts.

If lots of data is made available, how is it made findable? Need solid metadata and classification to make it easy for people to find, otherwise it's just a big mess of numbers.

http://datacite.org/ DOIs for data

NRC/CISTI: Gateway to Scientific Data Sets http://cisti-icist.nrc-cnrc.gc.ca/eng/services/cisti/scientific-data/data-sets/

Crown copyright in Canada makes it hard to give away government data (which is another example of how stupid it is), but this project is on: GeoGratis: http://geogratis.cgdi.gc.ca/

How can libraries connect to their patrons?

  • LibraryThing's free covers
  • Open Library
  • Talis Connected Commons
  • MESUR
  • id.loc.gov

APIs vs raw data. APIs: always serves up latest data, control over access, tracking/stats, complex functionality. Raw data: unconstrained access, not limited by API, no metadata

Book about recording of personal data: http://totalrecallbook.com/.

Access 2009, Thursday #1: Cory Doctorow

The first talk at Access 2009, early in the morning of Thursday 1 October, was by Cory Doctorow. Like all the talks it was recorded, and for fun I'll include embed the video here.

You can also just download the audio.

Peter Zimmerman blogged this talk and all the others. My notes:

Old visions of networks and Internet as just hyper versions of what we already had, with more TV and movie stars.

"We have forever traded quality and reliability for price, access, and customizabilty."

"Content isn't king, conversation is." That's why the telecommunications industry is bigger than the entertainment industry.

Telcos calling the shots and setting the laws now, which means trouble.

Discussion of culture and industry and ownership. Rules vary from country to country. Parody right in US but not Canada/UK. South African laws inhibit making alternate versions of even out of copyright books for the blind. Search engines probably illegal in Europe.

Regulatory system for big companies, making lots of copies of things with big machines, being shoehorned to fit regular people who can't get through a day without copying. His examples: finding NHS information about what it means when a kid has little pink polka-dots, birthday calls from relatives, all affected by copyright law. Obama doing fireside chats on YouTube, regulated by copyright.

Copyright continues to be made as industrial rules for industrial players. Copyright should regulate what industries do, not what you and I do.

Cory likes copyright, but he doesn't want the same rules he uses with his publisher to apply to his readers.

Copyright's purpose is to ensure that the largest number of people have the most amount of participation.

Librarians have powerful voices to speak out about this: very respected by everyone, our goal of universal access to all human knowledge is a fundamental human goal, everyone knows we're not getting rich on it, so we speak as an "unimpeachable force for moral good."

On a question about a phony trade war with China and how much we buy from them: "Our factories can't be converted back from executive lofts."

Shell isn't an oil company, it's an IT company that moves oil. Shell without the Internet is "just a hole in the ground with some guns around it." MacDonald's isn't a hamburger company, it's an IT company that sells hamburgers.

The future is more IT and better supply chains.

"The coin through which you level up in the great game that is academe is citation."

My Hackfest report

This morning at Access during the Hackfest report I went on after Bess Sadler, who talked about the work she and Lisa Goddard did figuring out Ross Singer's marc2rdf-modeler. Here's what I said.

  • I listed the four things about Linked Data.
  • I said RDF is about subject-predicate-object relationships. "This conference hasName Access" and "This conference hasLocation Charlottetown." Not scintillating conversation but a straightforward way of stating facts.
  • If you do all this then you end up being part of the big web of data illustrated on the Linked Data page. I need some data in that web for OpenFRBR, from Freebase, id.loc.gov, etc. How can I get to it with Ruby?
  • Let's do an example. Take the LCSH authority record for Cider. I have the URI for that ...
  • and I have the URI for broader term thanks to SKOS ...
  • so how can I use Ruby to go from the URI for Cider to the URI for its broader term? I have the subject (Cider), the predicate (has broader term), and URIs for both. How do I get the URI for actual broader term? I can see with my eye it's "Fruit wines" but I want to do it in Ruby.
  • I did it using Ruby bindings for Redland.

It took me all afternoon but eventually I figured this out:

$ irb
irb(main):001:0> require 'rdf/redland'
=> true

irb(main):002:0> cider_uri = Redland::Uri.new("http://id.loc.gov/authorities/sh85025940#concept")
=> #<Redland::Uri:0xb7c58854 @uri=#<SWIG::TYPE_p_librdf_uri_s:0xb7c5882c>>

irb(main):003:0> broader_uri = Redland::Uri.new("http://www.w3.org/2004/02/skos/core#broader")
=> #<Redland::Uri:0xb7c54074 @uri=#<SWIG::TYPE_p_librdf_uri_s:0xb7c5404c>>

irb(main):004:0> prefLabel_uri = Redland::Uri.new("http://www.w3.org/2004/02/skos/core#prefLabel")
=> #<Redland::Uri:0xb7c4f6f0 @uri=#<SWIG::TYPE_p_librdf_uri_s:0xb7c4f6c8>>

irb(main):005:0> storage=Redland::TripleStore.new("hashes",  "test", "new='yes',hash-type='memory',dir='.'")
=> #<Redland::TripleStore:0xb7c4a448 @store=#<SWIG::TYPE_p_librdf_storage_s:0xb7c4a3f8>,@store_type="hashes", @name="test">

irb(main):006:0> model=Redland::Model.new(storage)
=> #<Redland::Model:0xb7c47a68 @model=#<SWIG::TYPE_p_librdf_model_s:0xb7c47a18>, @store=#<Redland::TripleStore:0xb7c4a448 @store=#<SWIG::TYPE_p_librdf_storage_s:0xb7c4a3f8>, @store_type="hashes", @name="test">, @statements=[]>

irb(main):007:0> parser=Redland::Parser.new("raptor", "", nil)
=> #<Redland::Parser:0xb7c44200 @context=nil, @parser=#<SWIG::TYPE_p_librdf_parser_s:0xb7c441c4>, @idents=[]>

irb(main):008:0> model.size
=> 0

irb(main):009:0> parser.parse_into_model(model, cider_uri)
=> 0

irb(main):010:0> model.size
=> 14

irb(main):011:0> model.object(cider_uri, prefLabel_uri).to_s
=> "Cider@en"

irb(main):012:0> object_uri = model.object(cider_uri, broader_uri).uri
=> #<Redland::Uri:0xb7c399b8 @uri=#<SWIG::TYPE_p_librdf_uri_s:0xb7c39990>>

irb(main):013:0> object_uri.to_s
=> "http://id.loc.gov/authorities/sh85052187#concept"

irb(main):014:0> parser.parse_into_model(model, object_uri)
=> 0

irb(main):015:0> model.size
=> 27

irb(main):016:0> model.object(object_uri, prefLabel_uri).to_s
=> "Fruit wines@en"

With that, I move from piece of linked data to another, parse it, and pick out the piece of information I want. Now everything in the web of linked data is available to me.

Syndicate content