Miskatonic University Press

Access 2009, Friday #5: Dorothea Salo, Grab a Bucket! It's Raining Data!


I've been reading Dorothea Salo online for years and it was nice to finally meet her and to see her in action. She gave a very good talk. Watch the video of her talk to get the full benefit. One thing I liked about her talk was that not only did she not put a lot of words up on her slides, just big interesting pictures, but she'd reuse the pictures and come back to one she'd shown before, but say new things about the related subject. As her talk went on we saw some pictures several times, each connected to some particular topic, and the visuals helped us connect what she was saying to what she'd said before about the topic. Interesting technique.

Here are my notes, and as usual Peter Zimmerman blogged it copiously.

Exciting time to be a digital libarian: there's a whole new world of stuff out there.

Says she'd been cast as the Cassandra of Open Access. But she's definitely in favour of it---hard to be against something that is an unambiguous good. But actually working at it shows it's not easy. We're asking for something for nothing from faculty members. And the same kind of foresight and designing and planning is what she's seeing in the data world. Is she now the Cassandra of data curation?

Says she's going to point out a lot of problems about data curation, but she's all in favour of it.

Focus on one problem: fit between content and container.

What do we know about research data? There's a lot of it. Even if we admit that a lot of big projects (LHC) will take care of their own stuff, we have a lot left over to handle. Can we?

"Data are there to be interacted with."

We'll need to get rid of all technical barriers around reusing data. "Different kinds of data have different kinds of affordances." They get used for different purposes.

"Data are wildly diverse" so you need different places to put them, depending on their nature. Two photographs are the same, no the content. But a book encoded in TEI and a book that's a collection of page scans need to be treated differently. Example of scientist who takes lots of images of the same cell and then builds a 3D model of it: they need to be all tied together in a sensible way and they won't all fit into eg DSpace.

"Data are already out there."

All the data that we already have is sitting around in different silos and it's all in danger. We made our own mess

Lots of data is analog (eg handwritten lab notebooks) that needs to be digitized. Can we scale up to that? Today, probably not.

"Data are project-based." Example: http://exploringthehyper.net/, a web-based thesis. How to preserve that? [dchud and vphill said in IRC: Fire up Heretrix.])

"Data are sloppy." If our systems only accept nice clean data then it won't handle real world stuff.

"Data aren't standardized."

Our big bucket: The digital library. Another big bucket: The institutional repository. But you can't just put an IR in place and then expect it to work. "There is no magic pixie dust for digital curation."

Why keep calling things digital as though it's something special? How do you then brand a digital collection if you think that digital is normal?

We built digital libraries same as print ones: careful selection and attention. Eg. Naskapi Lexicon at LAC.

Made a point about cyberinfrastructure going to the people with the money, but I missed it while checking IRC.

Archives don't keep everything. They throw a lot out. So we can't keep every bit of research data that comes our way. How do we decide? How do we balance giving good service to faculty and not swamping ourselves with garbage?

"Production is a Taylorist's dream." So you end up only digitizing the stuff that easily fits your production line model. Might have great images but no finding aids. Digital libraries specialize themselves by data types for efficiency's sake. That's a problem when more data comes in: it's diverse in nature, it doesn't fit into our Fordist model for handling it.

How do we handle it when the user's technology and ours don't match? Eg the web-based thesis. The best you could do with that site is to take on its whole technology stack and then future-proof it. That's a lot of work, a ton of work.

She's scared to death that so many librarians don't realize all this is a problem. They just don't realize what's involved and what it'll mean to handle it all.

"Everything about how we cost out and budget a project will have to change."

What if you don't have a Taylorist production model? You just hack something together for a given project, for now, not for the future? You have a silo. Many digital libraries are silos. Example: Decameron Web.

"Project silos are not really part of the web." Article called them "cabinets of curiosities."

Some librarians talk about the "context of an object" as though if you take something out of its context everything is broken. She says context is fluid. Context is constantly being built and rebuilt. We should expose our digital objects so they can used and reused in new contexts: that's no decontextualization, that's REcontextualization.

Presentation is content-specific. Books browse differently than maps, etc.

We've lost a lot of projects already to all these problems. Mellon-funded digital humanities projects have disappeared completely.

What about IRs? "Institutional" is becoming a problem. If we want to bring stuff into our IR, we need to prove a link to our institution. The more tenuous the link, the more work involved, no matter how worthy the project.

We can take things into our IR that are final, static and immutable: but to researchers, that stuff is dead. By the time it's in that state, they're done with it.

IRs terrible at dealing with diverse kinds of data. Not everything is a PDF of a research paper.

People see the ugly interfaces we have (DSpace, Fedora) and run away.

"Any metadata you want ... as long as it's key-value pairs." "Do anything you want ... as long as it's download." No way to deal with real-world data.

Summarizing: "We need bigger, better buckets. Silos are both necessary and unacceptable. We have a lot of modeling to do. And meta-modeling. We have a lot of code to write. We can't code or model in isolation. Fedora is the new world. But Fedora must change. Focus on the start ... not so much on the finished product. Solr brings it all together."

Hopes she can stop being the Cassandra of data curation and instead become its Clio.