code4lib

Code4Lib North notes

(I sent something like this to the code4lib mailing list.)

About 40 of us were in Kingston, ON last Friday for Code4Lib North. It was a great day! I had a really good time and I hope everyone else did too.

Thursday there were about twenty people hanging out in the afternoon, and most went off for dinner and some ended up at The Sleepless Goat, a hippie cafe/restaurant, for dessert. I don't think much hacking went on, but the hanging out was good.

Friday we started at 9. There were 10 twenty-minute talks, and before lunch we did an Ask Anyone session (like Dan Chudnov did at Code4Lib in Asheville). Queen's University generously provided lunch. When we came back we had nine or ten lightning talks, the last three talks, and then broke up and some BOF sessions happened. The library shut at 4:30 and a group headed out for dinner while others went back home.

I knew some of the people there but there were lots of new faces. The talks were all very interesting. I went first and was, I'm afraid, insufficiently awake.

Walter Lewis spoke for himself and Art Rhyno about linked data and old Kingston newspapers in Our Ontario. MJ Suhonos's location-aware mytpl.ca had people oohing and ahhing when it showed that the nearest copy in Toronto of a certain book was at a branch in the very east end of the city (Kingston being 250 km east of Toronto). Alan Harnum talked about Toronto Public Library's use of Endeca, and attributed some of its features for a level 20 wizard.

Glen Newton's visualization of domains of knowledge in scientific journals was eye-opening. John Miedema gave a summation of OpenBook, his WordPress plugin that he's weaning from development, and Eric Palmitesta gave a great tutorial on XQuery and Exist. Nasser Saleh talked about Coagmento, a collaborative browsing/research tool.

I don't have the details of the nine or ten lightning talks, but there was a wide mix of people and subjects, and everyone had the Code4Lib spirit. Fifty interesting minutes, really well done.

Thanks again to Queen's University Library (it's a lovely campus, with very nice libraries) and Wendy Huot for organizing everything there. (Wendy led a BOF at the end of the day, a design critique of some new home page designs she'd working on, that was a good session, the kind of thing that really helps anyone working on library web sites.) I think it was a really fun and informative day, and I hope everyone else felt the same.

Code4Lib 2010: Wednesday 24 February

Emily Lynema, Iterative Development: Done Simply

Problems: You have too much work to do. Priorities change frequently. Requirements change. No business analysts. Emergencies happen. "IT black box" where no-one outside IT knows what's going on inside.

Agile development as opposed to the waterfall method.

Scrum: product owner/scrum master/team. Artifacts: product backlog, sprint backlog. 2-4 week cycle. Plant, commit to certain things and estimate. Daily scrum for fifteen minutes each day. What have you done since yesterday? What will you do today? What problems might you have? After sprint, sprint review and retrospective.

Case study at NCSU Libraries. They use iterative process loosely based on Scrum. JIT planning and documentation. Collaboration with customers. Joint project ownership.

They use JIRA, Confluence. Sprint planning: Google Docs + JIRA.

Sprint planning: use one week to plan across multiple projects. Day 1: overview of next 3-6 months. Prioritize. She does up a Google Docs spreadsheet with weeks as columns and projects as rows, and puts in everything she knows about what should be done.

Days 2-5: Meet with product owners for each prioritized project.

A release in JIRA = product iteration.

Day 6: Sprint planning. Reprioritize based on estimate and time available.

Development: Working on it all. Daily meetings. Weekly review.

Challenges: Multiple small projects within a cycle. Not traditional for Agile. Lack of documented requirements: what are the user stories and when do you need them? "Teams of librarians work slowly." Prioritization difficult for library staff. Testing: how to automate; no QA experts. Simultaneously handle support and development.

Outcomes: Projects working well. Keeping to six-week cycles keeps everything in line. 31 releases across six projects in 2009. Increased flexibility.

Agile for All. Succeeding with Agile

Bess Sadler, Vampires vs. Werewolves: Ending the War Between Developers & Sysadmins with Puppet

Developers say: Those sysadmins keep me from doing my job! My job is to write new software and add features and do cool stuff and get it into production.

Sys admins say: Those developers! My job is to keep things running and build trust and make sure things are reliable.

But if they're arguing and systems go down or new features don't get added, the angry villagers show up with pitchforks.

Innovation is about risk. You don't take risks with people you don't trust. Let go of the anger.

Testing. They use NAGIOS to watch their projects, not just uptimes: searches into systems, etc. They set up tests and use NAGIOS to run them automatically, so when there's an upgrade they can see automatically what works and what doesn't.

Write docs. Link to the doc from where the problems are seen (in NAGIOS for example).

Hudson, continuous integration tool.

Puppet for release management.

Naomi Dushay, Willy Mene, and Jessie Keck, I Am Not Your Mother: Write Your Test Code

They use Hudson to manage automated testing. Looks nice. Made me feel guilty because I never write tests. But then I'm just hacking on stuff myself, mostly.

Selenium: Firefox plugin to automate browsing, good for testing.

Summary of test-driven development.

Uses rSpec, Cucumber, and the Rails testing environment.

Jessie K Blacklight on his laptop. Showed an rSpec test. He knew it would fail. Ran it and it failed. Edited code to change something, reran the test, and it passed.

Cucumber. It has features that consist of scenarios. Showed an example of using this to test that something was on the home page. It wasn't, test failed, code changed, test rerun, test passed.

Types of tests: Unit tests. Integration tests. Black box/functional/acceptance testing.

Other web testing tools: WebRat, Watir.

Chris Beer, Media, Blacklight, and Viewers Like You

Media archives. "Anatomy of a film clip."

PBCore. Fedora. Blacklight. Solr. jQuery. MySQL. Lighttpd.

http://tinyurl.com/c4l-pbcore

They did their own video player, which handles scrolling simultaneous transcripts.

http://github.com/cbeer/ave-sync

Ian Walls, Becoming Truly Innovative: Migrating from Millennium to Koha

Had a full datestamp to the minute on his title slide.

University's security people pulled their server offline during a retirement party; the library noticed when their ILS went down. Thought about moving to Koha.

To migrate: bib/auth data, patron, checkouts, holds, serials issues, acquisitions. Patron data not to get out and ingest. Bib data harder. A number of export methods just wouldn't work, but they did get it working.

Explained everything about how they'd done it, and gave some advice on what to do it you're doing it. It went pretty smoothly, from the sounds of it, and there were no all-nighters.

Dan Chudnov (facilitating), Ask Anything!

Worked very well. I didn't have any questions to ask, or answers to give, but it was a great to watch and I think everyone enjoyed it.

Naomi Dushay and Jessie Keck, A Better Advanced Search

Stanford advanced search

Use cases: author + title, e.g. "mozart sonata 21"

Personal name in art: could be author, subject, additional author, etc.

Combining multiple facets: find books and videos, stuff in Spanish and English, stuff at this and that libraries.

People like Boolean.

Context-specific advanced seach, eg for music.

One pattern: "any of these words" "all of these words" "none of these words"

Or: Title, Author, Subject, fields.

Or: multiline form where you can pick modifiers and fields (like WebCat).

Notice on Stanford's form: Keyword isn't the first. Keyword searching wasn't high demand in use cases, so they didn't put it up front. "Subject terms" is their lingo to imply controlled vocabulary.

They had problem searching across multiple fields from one search box, because weightings didn't figure into it. Solution: localparams in Solr.

Documented here:

http://www.stanford.edu/people/~ndushay/code4lib2010/advSearchSolrQueries.pdf

This got quite technical and I don't now much about Solr.

Challenges in UI:

Multi-select facets. Make user easily aware of current facet selections. Integration with UI: Faceting. Search breadcrumbs.

Actionable facets in search results.

Cary Gordon, What's New in Drupal 7

He was filling in at the last minute for two Danes who couldn't make it because of weather-caused airplane delays.

  1. Make the most frequent tasks easy and less frequent tasks achievable.
  2. Design for 80%.
  3. Privilege the content creator.
  4. ?

Very complete update on everything that's new and changed in Drupal 7. It's in alpha now but I'll upgrade when it's ready.

Andreas Orphanides, Cory Lown, and Emily Lynema, Enhancing Discoverability With Virtual Shelf Browse

Why do a virtual shelf browse? Universal behaviour. We all browse shelves, but shelves are going away. Users like recommendations.

NOTE TO SELF: Add call numbers for all our electronic resources that just say ELECTRONIC right now. That's a big failing. Then do a virtual shelf browse.

http://www2.lib.ncsu.edu/catalog/record/NCSU1764762

Data model goals. Browse arbitrary number of titles around known item in call number order. Include online + all locations. Support browse searching, partial and non-matches. Browse by title, not by item. Forgiving call number searching.

Cron dumps out data in delimited text, ingest into DB, call number index in MySQL.

Front-end goals: Access to infinite shelf. Interactive visual browsing experience. Design cues from Google Books. High performance. Satisfy patrons and staff.

They used jQuery, Thickbox, jCarousel, SimpleTip.

Problems: DOM is slow. Three plugins = trouble. Remote servers = latency roulette. Too much Ajax = browser bottleneck. IE is bad.

Future includes virtual browsing across other dimensions of likeness.

http://www.lib.ncsu.edu/dli/projects/virtualshelfindex/

Naomi Dushay and Jessie Keck, How to Implement A Virtual Bookshelf With Solr

Showed how shelf browse works in their system. They're starting simple, vertical listing on left-hand side.

Described how they normalize call numbers, standardize things, but of course the data is often ugly or messed up or breaks rules.

What if you have a 40-volume enyclopedia? Don't want to go through 40 books. Can lop off volume number etc. Various kinds of lopping done.

In a mix of different classification schemes, need to separate them so that some archival or thesis stuff starting with E isn't mixed in with American history in LC's E.

Naomi explained all about the detailed parts of how they got this to work. Very useful; we can use this ourselves at York. Too much for me to take in during the talk, but if we implement this then a talk like this is exactly what we need.

Interface: jQuery. Animating left/right browsing he did in 10-15 lines of jQuery, with no plugins.

There's a lot of work involved in shelf browsing, apparently.

Lightning Talks

LibX Update, Godmar Back

http://libx.org/chrome/

Showed the new UI they did for Chrome. Looks nice.

LibX is nice software.

How to build a Virtual Bookshelf Without Solr (or MySQL) - Maccabee Levine

http://tinyurl.com/virtualbookshelf

Instead of loading stuff into a database or Solr, which requires IT department support, you can use the ILS API as it is. Simple way of doing it yourself, if your ILS has a web services API, which Voyager does.

VIVO, an interdisciplinary national network - Paul Albert

Semantic web way of connecting/relating people, grants, subjects, etc.

http://vivo.cornell.edu/

Look very interesting. Could we use this at York? OCUL?

http://vivoweb.org/

WolfWalk, two ways - Jason Casden

WolfWalk

iPhone app. Geolocation-aware way of showing images from special collections about campus history.

Ran into trouble when Apple's lawyers and NCSU lawyers couldn't agree on the App Store contract, so they did a web-based mobile app that will work everywhere, which he recomends.

Custom metasearch widgets - Alex Smith

http://xerxes.calstate.edu/

Node.js development - Gabriel Farrell

Node.js

Super-lively on GitHub: http://github.com/ry/node

Catalog Auto-suggest using SOLR - Jill Sexton

https://docs.google.com/present/view?id=dcz7k2rb_59xzgz36fg

http://search.lib.unc.edu/

Problem: Use of external index for library catalog limits access to authority data while searching.

So: do an autosuggest feature using library authority data.

Example search: starting to type "the big lebowski" or "dickinson emily" (notice fields on right-hand side).

This has caused a big increase in subject searches and auto-suggest search queries are used a lot, the logs show.

EmeraldView, a PHP frontend for Greenstone - Yitzchak Schaffer

Greenstone is a "digital library solution."

Kill the Search Button - Michael Nielsen, Jørn Thøgersen [facilitated by Roy Tennant]

http://developer.statsbiblioteket.dk/kill/code4lib

You Heard It Here First... - Roy Tennant

Roy announced the new OCLC Innovation Lab. (Mike Teets, Tip House, Rob Koopman.) Be interesting to see what comes out of that.

File Information Tool Set (FITS) - Spencer McEwen

http://fits.googlecode.com/

JavaScript E-book Reader -- Eric Palmitesta

Showed the ScholarsPortal Ebooks interface. They wrote an ebook reader for it, because existing ones weren't good enough.

Faceted browse on the cheap - Tom Keays

He set up a collection of books in RefWorks, then exported to BibTeX, then ran it through Babel to turn that into JSON. Smart!

SIMILE stuff from MIT is excellent.

Code4Lib 2010: Tuesday 23 February

More notes from Code4Lib 2010, some brief, some so brief as to be nonexistent.

Cathy Marshall: People, Their Digital Stuff, and Time: Opportunities, Challenges and Life-Logging Barbie

Three or four things to think of about when we mix people, stuff and time. Ruminations about personal digital archiving. "Feral ethnography."

"People rely on benign neglect as a de facto stewardship technique and collection policy."

"Personal digital archiving != archiving a personal digital collection."

Lots of laughing. Quite funny.

One can keep everything. Why might one do that? Don't know an item's future worth. It's hard thankless work to delete. "Filtering and searching can locate the gems among the gravel."

"It's easier to keep than to cull." Loss as a means of culling collections.

Personal scholarly archives as an example. One researcher who'd thought he'd have lots of stuff, but now has little, only things since 2001, mostly PDFs.

"Implicaton: not all long-term stores ned to perform with the same level of reliability."

Use-based heuristics help assess value.

Second point. No single preservation technology/repository/etc. will win the battle for your stuff.

Instead of centralizing, we'll be knittingtogether stores and services. Mobile devices, email, web sites, etc.

No single archive. The catalogue is the answer. Some things (dental records, high school photos, tax records) SHOULD be in different places. You can find them later when you need them.

Third point: Forget about digital originals or reference copies.

People have local copies of images etc. that they think of as the master copies, but there's a lot of useful added metadata in the online versions (eg Flickr) that makes it more valuable.

Example of a picture of an enormous catfish; the image has been resized and rescaled and had the quality changed, and appears in lots of different places online in different formats: and has lots of different explanations: it's a record-size fish, it's this, it's that, it was caught by X, it was caught by Y, it was caught here, it was caught there.

Where are the tools that will let us harvest the metadata that copies have grown? Where's the search tool for gathering copies, not deduping them?

Fourth: Given 1-3 there will be some interesting opportunities to take a fresh look at searching and browsing.

Techniques for re-encounter: stable personal geography; value-based organization; better presentation of surrogates.

Possible interfaces to do all this: faceted browsing, eg LifeBits approach. Annotated timeline (also LifeBits). Hard to do.

Bottom-up efforts: lots of digitization happening, policies, tools, practices. Personal archiving as cottage industry. SALT project at Stanford.

New opportunistic uses of massed data. She did a study of photos in Flickr of the same thing, a mosaic in Milan. Superstition: you stand on it, on the bull, spin on it, and then take a picture and post it online. In the pictures you can actually see the mosaic being eroded over time!

Links:

http://www.csdl.tamu.edu/~marshall/, http://research.microsoft.com/~cathymar/

Blog: http://ccmarshall.blogspot.com/

Twitter: http://twitter.com/ccmarshall

Jeremy Frumkin and Terry Reese, Cloud4Lib

Idea: Cloud4Lib = an open library platform.

Lots of different projects out there doing different or the same things. Need to glue them all together, put all the Code4Lib work and energy into one thing. "Enable libraries to truly and collaboratively build and use common infrastructure." Development efforts should enhance an entire platform, not just one piece of it all.

http://cloud4lib.org

They set up a wiki. Interesting: some Amazon EC2 servers where people can experiment. Sponsored by someone or other.

Use cases, mentioning University of Hobbitville

Ross Singer, The Linked Library Data Cloud: Stop Talking and Start Doing

Tim Berners-Lee's Four Rules of Linked Data

Library linked data cloud was amazingly empty but has been growing slowly. id.loc.gov, LIBRIS and DBPedia.

Ross matched up lcsubjects.org and http://marccodes.heroku.com/gacs/.

Chronicling America.

VIAF connects to DBPedia. Can search with SRU.

Ross made LinkedLCCN

He also hacked on VuFind to add RDFa. http://dilettantes.code4lib.org/vufind/ Explore button.

mirlyn.lib.umich.edu Bill Dueber added links to his VuFind.

TODO

  • Agree on data models. FRBR or something like it. Aboutness vs isness.
  • More linked data available from very common identifiers
  • More linkages to resources outside the library domain. Who will do that? How? Tools.
  • Sustainability and preservation

Good talk. People inspired.

Harrison Dekker, Do-It-Yourself Cloud Computing with Apache and R

R. Rapache.

Good blog about R that I follow: Revolutions

Rosalyn Metz and Michael B. Klein, Public Datasets in the Cloud

Infrastructure as a Service: Amazon EC2

Platform as a Service: Google Apps, Heroku

Software as a Service: Zoho, Google Docs

Not talking about data you can download eg in a CSV.

Did a video of setting up an EC2 instance (took seconds), and attached (mounting) a volume to it (the volume being a big set of census data in this case). Very cool.

Socrata. (Expensive.)

Did an example of loading census data into Google Fusion Tables. Really wild stuff. 200 GB dataset copied into place and ready to be analyzed into three minutes. Looked like great tools for analyzing the data, visualizing, cross-tabulating, etc.

Michael Klein talked about issues and problems with all of this. Authority, provenance, preservation, access, etc.

Pushing Library Data. Secondary uses of it: research, testing, unified indexing. Not just bib data: anonymized borrowing data, etc.

Karen Coombs, 7++ Ways to Improve Library UIs with OCLC Web Services

  1. Crosslisting print and electronic materials. Can't see all the formats all in one thing. Use WorldCat Search API to see if you have the print version of the book and add record to ebook record.

  2. Linking to libaries nearby in case a book is out.

  3. Providing journal table of contents. Use xISSN to see if feed of recent TOCs is available.

  4. Peer review indicators.

  5. Provide information about authors. Use Identities and Wikipedia API to insert author information into dialogue box in UI.

  6. Link to free online versions of books. Get OCLCnum, then use ?'s APIs to see if they have it and then link to it.

  7. Adding similar items on same screen. Use DDC classification. Makes a lot more sense than just a Title keyword match like VuFind does.

  8. Bonus: Creating a mobile catalogue.

http://librarywebchic.net/mashups/

Blog post: New York Times Mashups

Jennifer Bowen, Taking Control of Library Metadata & Websites Using the eXtensible Catalog

Extensible Catalog

Four components that can be used individually or together.

  1. User interface built on Drupal. Faceting. FRBRized. Customizable search interface.

Metadata has been restructured in a new way, FRBRy, but she didn't have time to get into that.

In search results there's a place to show the matching text (from record or whatever) so that users know why they're getting the result they did. They found in research that users want that.

Web forms to let you create custom search boxes, for just journals, databases, etc. (Widget-maker, I guess.) Showed how they automatically generate a DVD browser, with no special programming.

They've made about 20 Drupal modules.

  1. Metadata tools. Automated processing of large batches of metadata.

Metadata Services Toolkit. Harvest, process, dedupe, clean up, aggregate, synchronize metadata.

Nice.

Version tracking of metadata through changes, so you can track the history.

  1. Connectivity tools to match up XC and ILS.

NCIP and OAI. OAI Toolkit works with virtually any ISL that exports MARC. NCIP Toolkit lets XC talk to ILS auth, circ and patron services. Real-time.

http://www.screencast.com/users/eXtensibleCatalog

Anjanette Young and Jeff Sherwood, Matching Dirty Data---Yet Another Wheel

Goal: Ingest metadata and PDFs for ETDs from UMI into DSpace.

Matching.

Exact title. Find intersection of sets. Verify intersection with exact author. Shorten author names to remove punctuation etc.

Examples of titles and names that are tricky to match.

Levenshtein Edit Distance.

Reminded me of this poem by bp Nichol that's carved in the pavement of bp Nichol Lane:

A lake
A lane
A line
Alone

Similarity = 1 - dL / max (|s1|, |s2|)

Fuzzy query in Solr/Lucene users Levenshtein distance.

Reduce search space. Identify stop words. Throw out common words (eg "models" and "data" in their diss titles).

Got a bit lost here.

Jaro-Winkler Algorithm. (Solr spellchecker uses it.) Works best for short strings. Developed by US Census.

Code: pypi editdist

http://bit.ly/ZGSmF String Comparison Tutorial

What they were looking for but was released after they'd done all their own work: MarcXimiL: The Bibliographic Similarity Analysis Framework

Slides: http://www.slideshare.net/ghostmob/matching-dirty-data

Ryan Scherle and Jose Aguera, HIVE: A New Tool for Working With Vocabularies

HIVE = Helping Interdisciplinary Vocabulary Engineering. Jose Aguera wasn't here.

http://ils.unc.edu/mrc/hive/

Site: http://hive.nescent.org:9090/

Code: http://hive-mrc.googlecode.com/

http://datadryad.org/

David Kennedy and David Chandek-Stark, Metadata Editing---A Truly Extensible Solution

Duke University Libraries Digital Collections.

Python, Django, Yahoo! Grids CSS, jQuery.

http://library.duke.edu/trac/dc/wiki/Trident

http://tridentproject.org/, http://blog.tridentproject.org/

Lightning Talks

UW Forward - Steve Meyer

UW Forward uses Blacklight.

Search for 'psychology' and they recommend the psychology subject librarian. They use WorldCat API to get links to Wikipedia entries for authors. Challenges: 14 Voyager ILS instances in the U Wisconsin system! Serials licensed differently at different campuses. They're having various problems of the kinds such projects has and he'd like to talk to people with similar ones.

MODS4Ruby & Opinionated XML - Matt Zumwalt

Prezi presentation was a bit zoomy, but lively.

http://yourmediashelf.com/blog/

The Digital Archaeological Record - Matt Cordial

The Digital Archaeological Record

Beta application

Archaeologists can submit data encoded in whatever way they want, and then connect it to other data.

Example: fauna, searching in a square in SW United States. Integrate data tables.

Hydra: Blacklight + ActiveFedora + Rails - Willy Mene

Stanford + U Virginia + Hull.

Hydrangea Project next: Blacklight, ActiveFedora, Shelver, in Rails.

Why CouchDB? - Benjamin Young

Data gets lonely. Often depends on APIless app. Web apps a bit better. Open source apps better but data can be in RDBMSes.

CouchB

Listed advantages to using it. Portable standalone apps. Imagine as CouchDB apps: openlibrary.org. 3.5 gig dump now. If it supported replication you could get updates and parts of it.

Subject guides: ad-hoc.

couch.io does hosting and you can get one free database.

Data integrity (cheap, fast, and easy) - Gwen Exner

HathiTrust Large Scale Search update - Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search/

5.4 million full-text books. Full-text search of each, average response time of 3 seconds. They're using Solr. Big-scale stuff.

EAD and MARC Sitting in a Tree: D-R-U-P-A-L - Mark Matienzo

http://www.slideshare.net/anarchivist/ead-and-marc-sitting-in-a-tree-drupal

When he was at NYPL they migrated to Drupal from static content, ColdFusion, other stuff. Snazzy new site: http://nypl.org/

http://nypl.org/find-archival-materials/

http://nypl.org/shrew/b16185699/mods.xml, http://nypl.org/shrew/b16185699/marcxml.xml

EZproxy Wondertool - Paul Joseph

He's at UBC. Had a bunch of EZProxy work to do and did a thing that made all his work easier.

HathiTrust APIs - Albert Bertram

http://www.lib.umich.edu/two-over-threehundred/code4lib/

Repository of MARC Abominations - Simon Spero

Lovecraft meets MARC. Building a test set of the eldritch MARC records from non-Euclidean geometries. Send to marcthulhu@ibiblio.org.

Mystery Meat - Joe Atzberger

Stephen Abram's anti-open-source FUD. Sirsi's Workflows client ships with tar.exe and gzip.exe but does NOT come with a copy of the GPL, which its license says it must. Does the Software Freedom Law Center know about this?

Fuwatto Search - Masao Takaku

http://fuwat.to/worldcat, Twitter: http://twitter.com/tmasao

Code4Lib 2010: Monday 22 February

My brief notes on what I saw at Code4Lib 2010 in Asheville, North Carolina. I had a great time at the conference and am glad I went. My thanks to Kevin Clarke, Jodi Schneider, and all the other organizers and volunteers.

See also:

Preconference 1: Solr White Belt

Bess Sadler mostly ran through this Solr tutorial, elaborating with lots of examples from Blacklight work and personal experience. There were about 50 people in the room and everyone got Solr installed and working.

We tried:

  • requesting particular fields
  • highlighting
  • looking at how Solr parses a search
  • facets

Useful to know is Stanford's Solr config which gives their relevancy weightings:

<str name="qf">
title_245a_unstem_search^100000
title_245a_search^75000
vern_title_245a_search^75000
title_245_unstem_search^50000
...

That solves the Nature Problem. Putting in Nature used to bring up a book called Naturalism, but now the unstemmed match comes up a lot higher, so Nature will be the top result.

Very good morning. Bess did a fine job and everyone learned a lot of practical, hands-on stuff.

Preconference 2: Hacker 201 with Dan Chudnov

Dan Chudnov had done Hacking 101 in the morning, which used Processing as a way of learning programming and good programming habits. The afternoon session was to use Python and pymarc to hack on MARC, but there was some trouble with everyone in the room getting all the right things installed.

I wasn't too interested in hacking on MARC in Python, so I worked on OpenFRBR and ended up making some good progress through the afternoon and into the evening. I counted that as a hacking success.

Dan talked about the thirty-minute rule: if you're stuck on a problem for 30 minutes, take a break. A good rule. Here's another I'd forgotten: start with the simplest thing and then make it more complicated. I knew I wanted to deal with data from five or six sources in three different formats (JSON, RDF, XML). Different fields from different sources would mean different things and there would be different relations between them.

I was getting dismayed at how complicated this was turning out to be. Finally I realized, "I don't need to get all that working right off the bat. I just need to get one source working. I'll ignore everything else for now and add it later."

And that worked very well. By doing just one thing everything fell into place for me and I got it working quickly and enjoyably. That was nice.

Code4Lib North

My first post to the Code4Lib web site is Code4Lib North, an announcement of a new Code4Lib chapter for Ontario, Quebec, and nearby regions of the United States. Wendy Huot of Queen's University and I wanted to get this started, so we did.

My Code4Lib 2010 t-shirt

Here's my entry in the Code4Lib 2010 t-shirt contest:

Code4Lib 2010 t-shirt contest entry

branchhours

I set up a Github repository for branchhours, a short Ruby script I wrote to help show library branch opening hours on the home page. I set up a bunch of Google calendars, one for each branch, and use them to hold the hours the branch is open or an indicator it's closed. It's not in production yet, but it seems to work OK as it is, so I thought I'd mention it. I'll post more about it when it goes into use.

If the idea of using Google Calendar to show library branch opening hours on a home page seems familiar, perhaps that's because you read Using Google Calendar to Manage Library Website Hours by Andrew Darby (Code4Lib Journal #2). It's a simpler variation on what he did, in Ruby instead of PHP and without using a database or showing reference librarian information.

isbn2marc is back

My Ruby script isbn2marc is back. I thought I needed to use the advice in Configuring .htaccess to ignore specific subfolders but it turned out the Options Indexes line in /src/.htaccess was choking things and causing permissions errors. Beats me why, but at least it's fixed. Sorry about the disruption.

Code4Lib 2009 notes

I went to Code4Lib2009 in late February and had a fantastic time, not just because Jodi Schneider and I gave a talk called What We Talk About When We Talk About FRBR. The whole conference was a blast and it was great to hang out with people I knew and meet new people.

Ed Summers posted his conference notes, Jonathan Rochkind doesn't like the Code4Lib award idea Eric Morgan put forward, Karen Coombs posted notes, Jay Luker did a cool IRC channel timeline, Terry Reese did ten things to take away from it, Jon Phipps wasn't enchanted with it all.

Here are my notes, pretty much raw. I didn't take notes on the linked data pre-conference, the morning of which was excellent. The afternoon got a little too loose and I got distracted with some annoying Ruby on Rails problem.

Monday 24 February 2009

Roy Tennant: Introduction

Mark Matienzo: How to Meet People and Have Fun at Code4LibCon

Stefano Mazzocchi: A Bookless Future for the Libraries?

TODO: Listen to Jon Udell interview

Interesting talk. Began by talking about how marginal costs of communication have been dropping over history, from cave drawings to clay tablets to the printing press to electronic publishing.

Pros to this, but cons now, too, like a "degraded consumption experience:" low screen resolution, batteries required, poor network access.

Business models are being disrupted. Institutions like libraries are being disrupted. Almost-zero marginal costs are here to stay.

Libraries vs museums of books. Non-unique books can all go online in electronic versions. (Unique books still require special attention.)

No more shelves. Nearly infinite storage space. (He's getting into a lot of "the book is dead" stuff---tide turning against him in IRC.)

Do we still need metadata? (Shouts of "yes!"). Does metadata have to be made by people or can it be done by computers with statistical analysis?

Information is fragmented. Can the library mindset still work across a spectrum of such fragmented information, hyperlinks, journals, networks? Networks of relational assertions = the web of data.

He asked a few questions ("who's the youngest 2008 Academy Award winner") that were basically trivia questions. Where would we find the answers? His answer: Freebase. He found the answer to that question easily enough, but his talk ran into trouble on the next one when he tried to show how Freebase made the answer easy to find and the whole thing failed due to network slowness. He lost some of the audience here.

Showed FMDB, a granular linked-data version of IMDB data. Showed a translator. Showed http://typewriter.freebaseapps.com/, where they crowdsource specifying unspecified data. Showed the Genderizer (!?).

[A few days later Mazzocchi posted Post-Mortem of a Dissonant Keynote: he went into the #code4lib channel logs to see what people were saying while he was taking. That took guts. Great post.]

Anders Soderback: Why Libararies Should Embrace Linked Data

[I talked to Anders for a while after this. Very nice fellow, and Libris is a great piece of work.]

Libris, the Swedish national union catalogue.

Less technical version of the talk he gave the day before. Martin Malmsten was supposed to be here but couldn't make it. He talked about the web, that it's social, that it's a network, etc. He spent a while getting to the actual catalogue. When he did, he showed the basics of it: the RDF version, the frbr-related, different representations, etc.

Their blog: http://blog.libris.kb.se/semweb/

Ross Singer, Like a Can Opener Through Your Data Silo: Simple Access Throgh AtomPub and Jangle

http://jangle.org/

Problems with library systems APIs: not very good. OAI-PMH etc. Atom + REST = AtomPub, an Atom publishing model. Ed Summers pointed out a couple of years ago how well this would work. Google, Microsoft, IBM, WordPress, Drupal, Movable Type, etc., all use it.

Jangle applies a common data model to library information through AtomPub. Four kinds of entities: resources, items, actors (users etc.), collections.

Jangle vocabulary: http://jangle.org/vocab/

Ross is building connectors to various systems so that it doesn't matter what kind of ILS you're using (ideally) you can always talk to it with Jangle.

Godmar Back complimented the idea and the importance of interoperability.

Glen Newton, LuSQL: (Quickly and Easily) Getting Your Data from your DBMS into Lucene

Glen's from CISTI in the Digital Library Research Group. They have 8.5 million full-text articles and want to index them all to make them easy to search. They use Lucene but not Solr, they use LuSql. He showed how to do some queries with LuSQL, talked about how the indexing was done by Lucene, showed some output.

Terence Ingram, RESTafarian-ism at the National Library of Australia

My attention was caught when he said they used VuFind and asked how many people here used. Laughs when Andrew Nagy stuck up his hand. Ingram said they just needed something so had a quick look around and chose VuFind. Very casual. All those people who look to the NLA as a guide in their own choice of VuFind will be surprised!

Birkin James Diana, The Dashboard Initiative

http://library.brown.edu/dashboard/info/

Showed some widgets that make a dashboard for showing all of the activity going on at a library, with widgets for circ, ILL activity, other kinds of things. Has some nice visualizations in there using some kind of Google visualizer. People in channel also mentioned Flot. The code is available for download. Open source!

Lunch

Ed Summers and Michael Giarlo, Open Up Your Repository with a SWORD

http://swordapp.org/

"SWORD is a lightweight protocol for depositing content from one location to another. It stands for Simple Web-service Offering Repository Deposit and is a profile of the Atom Publishing Protocol (known as APP or ATOMPUB)." Lets you deposit into repositories without worrying exactly what kind of repository it is: Fedora, DSpace, whatever. Ed explained it and Mike showed some examples. Looks easy to use and very useful if you need to do what it does.

Mark Matienzo, How I Failed to Present on Using DVCS for Archival Metadata

He proposed to talk about how DVCS would work for archival metadata but found it was too complicated. Slide: "I failed, epically." He looked at bzr and git but then went with Mercurial. Problem: It works on line diffs (for code) but EAD is in XML. How do you diff the XML? There are various tools, but in the end the whole thing turned into a complicated mess.

Godmar Back, LibX 2.0

LibX

Summarized LibX 1.0. But a toolbar is great, though what about emerging technology trends (mashups, SOA) and educational trends (online tutorials, social tagging, visualizations)?

In 2.0, librarians can create Libapps (?) and users can add them into pages where they want. LibX runs/houses them. Libapps are made from reusable modules.

How to make a libapp? Modules is JS plus metadata description, a Libapp is a group of modules, a Package is a folder of Libapps.

Modules: Named at a URL, published with AtomPub. They use tuples in JSON. { isbn: "074322670" }

Example: user goes to ACM site. There's a LibX button. Click on it and a YouTube video shows up, with person giving some help! E.g. Annette Bailey.

He showed the code to implement that, just a bit of JS. Simple. Some other bits of code needed, but all pretty easy.

LibX 2.0: been rewritten in full browser-independent OO. about:libx gives built-in documentation. They have unit tests and it's all hot updatable (?).

Roles in the LibX 2.0 world: developers, adapters, user community.

LibX community repository will be built, to hold modules, packages, etc. Coming over the next two years.

Gradual transition from LibX 1.5, not a huge major release.

Kevin Clarke and John Fereira, Djatoka for Djummies

Kevin gave an overview of Djatoka, then John did some demos.

Could we use this for the big maps that in Maps? Instead of Zoomify? Lots of overhead, though---runs on Java, requires Tomcat, etc. But worth a look.

Breakout sessions

LibX 2.0, Zope3/Grok/Plone, Fedora, Solr, Jangle. I went upstairs and napped.

Lightning Talks

David Lindahl, XC

Quick overview of XC, but he lost me by showing a couple of short cartoons while was talking, which distracted me.

Casey Bisson, Scriblio

The internal data model is improved and now you can do original cataloguing in it. It's a digital library out of the box now, too, and people are using it for that. Some kind of tie-in to LibraryThing's Common Knowledge, too.

Mark Matienzo, enjoysthin.gs

Showed this new social bookmarking site.

Emily Lynema, E-Matrix

ERM she's working on at NCSU. FRBRy data model. E-resources can be "narrowly related" or "core" to a subject/discipline. Interesting! Gradations of relevancy to a program or subject. List them on subject guides.

Eric Lease Morgan, Alex4

http://infomotions.com/sandbox/alex4/

Goal: facilitate a person's ongoing liberal arts education.

Geoffrey Bilder, "Cool URIs Must Die."

He's from CrossRef. They're fighting linkrot. "Persistence isn't a technical issue, it's a social issue."

John Law, Summon

http://www.serialssolutions.com/summon/. Searches everything a library has: physical, virtual, etc. One search box.

Erik Hatcher, LucidFind

http://www.lucidimagination.com/search/

They're searching everything on lucene.apache.org: mailing lists, bugs, etc.

Mike Taylor (Index Data), Making Distributed Configuration Simple with the Torus

Metasearch engine: http://indexdata.com/pazpar2/

They've done a way of simply specifying exactly where you want to search and what fields you want to show in the results. Translucent Record Store = Torus

Jakub Skoczen, also from Index Data, on how Torus is implemented

Michael Klein and Jonathan Brinley on zoia's FOAF support

Fun!

Andy Ashton, Biblio

He's at the Scholarly Technology Group at Brown. One of many projects called Biblio. Biblography project.

Naomi Dushay, VuFind at Stanford

http://searchworks.stanford.edu/

Related results. Showed FROGS OF AUSTRALIA page. OK. LITTLE BEAR, good. ASSAULT WITH A DEADLY DONUT: bad!

searchworks-test.stanford.edu: They did a shelf-browse across all of the libraries, including storage. They show it in a right nav, which is a good idea. Should use this.

Mike Beccaria, Zoom Zoom Zoom

Paul Smith's College: paulsmiths.edu They've scanned in old yearbooks and put them in ContentDM. Didn't like it.

Showed Microsoft's Deep Zoom. Looked cool. Worth checking---for Maps? He put a whole yearbook into one image and then put it into the system, so you can zoom from a full overview right into something very small.

Photosynth: took pictures of the stacks and showed how you can move around them and zoom in on spines. Could link to the catalogue?

Dan Chudnov, BagIt

http://digitalpreservation.gov/ uses BagIt.

Wednesday 25 February 2009

Sebastian Hammer, Index Data

Talked about Index Data and their history. Got into books and libraries in general. Whithr libraries, with Google? They're digitizing things better than we are. Libraries could be swallowed by whipper-snapper technology.

Why (Local) Libraries: bearers and presevers of cultural heritage; conveyers of authoritative information; supporters of learning and research; pillars of democracy

"Even if we end up dying, we can't go quietly."

Libraries need to stay local but also come together: consortia, a group of libraries working together on web sites, turn into a "super-robot."

Quoted Lorcan Dempsey on "stitching costs." Most so-called APIs are loyalty schemes rather than interoperability devices. Standardization is hard/boring but essential for collaboration. Need more collaboration and working together, while keeping our business models. Systems and organizations need to surrender our data freely. Library hackers must become adocates within their orgs.

"MARC is a joke. Z39.50 is for old people. SRU is dumb. NCIP ... forget about it."

Tim McGreary, OLE Project: A New Frontier

OLE is "working to redefine the business processes for libraries."

They want to be "flexible, adaptable, community-developed," they want "improvement beyond the ILS." Format and resource agnostic. No need for a separate ERM. This is not just an ILS replacement: special collections, video, DRM, everything. They want a service-orientd architecture.

Community-based software development and governance. Raise the Library system to the enterprise level. Complement human interaction.

He reviewed the project timeline. Ah, project timelines.

They're using the National Library of Australia Services Framework. (What is this?)

In March: OLE core services will be defined. "Final design document" to be published in July, and then another group will do stuff.

Bess Sadler, Blacklight

http://blacklightopac.org/, http://blacklightrubyforge.org/

Give up on the idea that we can do a single interface that will work well for everyone. Different kinds of students have different needs.

Top question at music ref desk: I play violin, my boyfriend plays piano. What can we play together? Their old system wasn't indexing this information. Not everyone would want to search for that, but music students would. Should let them do it. Another q: can I find things by era?

They have a music portal: new books on home page are all music books.

Special behaviour for special objects: any musical recording, they go to Musicbrainz, get the unique ID, then go to other services to get metadata that key on that ID.

"College students are broke and they like music. If we are spending lots of money on music and not letting them know about it, that's a fail."

Blacklight runs on Rails. Uses Rails Engines so it's easier to keep up with the code base but make local changes.

TODO: marc4j is now much better ... look it up.

TODO: Install Blacklight and try it out. Look at Rails code.

Joshua Ferraro, biblios.net

He's from LibLime. Explained about their business. Open Library, Open Data Commons License.

Look at: extJS

"biblios.net is LibLime's free browser-based cataloguing service." 35 million freely-licensed records.

TODO: Make OpenFRBR query biblios.net (requires biblios.net account)

Look at: bws.biblios.net: APIs that let you interact with the database

He downloaded a MARCXML record from Blacklight and loaded it up into Biblios, using curl at the command line.

They're working with ONIX records, too, but keeping them separate from MARC because they're made with different rules.

Authority files are available. Useful?

This is a great project. York should be uploading everything there.

Chris Catalfa, Adding Functionality to Biblios with CouchDB

UI: extjs.com + jquery.org

1. Implementing an editor for Dublin Core. He showed a bit of XML that constructs an editor. (Faint code, couldn't read.)

2. Then they use CouchDB on the backend.

He was showing a lot of code snippets and a lot of unreadable XML.

Toke Eskildsen, Complete Faceting

Summa

Lunch

Erik Hatcher, The Rising Sun: Making the Most of Solr Power

10.
9. Performance
8. Memory
7. Query parser: default is to show the raw query parser
6. Data import: bring data into Solr with Data Import Handler, Solr Cell, CSV, LuSQL, APIs.
5. Request handlers: leverage Solr's configurability
4. Solr and IR toolkit
3. LocalSolr
2. Faceting: can do multi-select faceting
1. The interface is the app. (Solritas, Velocity)
0. Community

Lucid Imagination: they do support and training for Solr.

TODO: Look for screencast Erik Hatcher did showing off Solr and how to get it working.

Chris Shoemaker, FreeCite: An Open Source Free-text Citation Parser

http://freecite.library.brown.edu/

API URL: POST to http://freecite.library.brown.edu/citations/create

JSON responses implemented today. Give FreeCite an unformatted citation and it will give you back a formatted citation. Nice piece of work ... except it gives back almost-OpenURL?

Richard Wallis, Squeezing More from the OPAC

People may like facets and relevance, but they also want links to Amazon and Google Books and other places outside the catalogue.

JUICE: Showed injecting Javascript into pages on the fly, in the browser, so that when looking at something in a catalogue, it added links to WorldCat/Google Books/LibraryThing/Open Library/etc., either as text or as images.

liblists.sussex.ac.uk/lists/v3029.html

Inheriting CSS of the page means things look the same.

Just a matter of pasting in a bit of Javascript, built on a template, looks pretty easy to build.

Sean Hannan, Freebasing for Fun and Profit

Intro to Freebase. They slurp up info from Wikipedia, Musicbrainz, etc., plus people can also edit it.

REST API: http://api.freebase.com/

Example: embedddable snippet of code that links Academy Awards data to entries in own catalogue. Hmm! I could use this.

Example code: http://code4lib.mrdys.user.dev.freebaseapps.com/

Lightning Talks

Lots of good talks here. I was a bit distracted through some of them because I was getting people to sign the release forms and was keeping trak of who had signed what.

Chris Fitzpatrick: thinkbase? Check it out.

Thursday 26 February 2009

Ian Davis, If You Love Something ... Set It Free

The Semantic Web has fundamentally changed how people, computers and information interact, not because of the semantics, but because of the web and how easy it now is to connect information.

Conjecture 1: Data outlasts code. Therefore open data is more important than open source.

Conjecture 2: There is more structured data in the orld than unstructured.

Conjecture 3: Most of the value in our data will be unexpected and unintended. Therefore we should engineer for serendipity.

"The goal is not to build a web of data. The goal is to enrich lives through access to information."

"Technology grows exponentially, but society adapts linearly."

Warning: We should exchange and share information, but we must be careful that there are people involved, and some data should be kept private and protected. But most of it doesn't need to be.

Ed Corrado, The Open Platform Strategy: What It Means for Library Developers

Marshall Breeding: TOC of open source is roughly equivalent to commercial. But a big ARL found that open source software would costs 30% (?) more than commercial over a few years (cite: Grant?).

Talked about Ex Libris's Open Platform program.

Adam Soroka, A Modern Open Webservice-Based GIS Infrastructure

Chris Beer and Courtney Michael, Visualizing Media Archives

Cool demo built on the conference FOAF data: http://ratherinsane.com/~chris/c4l09/index2.php

Lightning Talks

Christopher Morgan, http://bookgenius.org/beta

Shows visual structure of subject headings. Try "evolution." He likes Thomas Mann's book on library research and emphasis on use of LCSH.

Ross Shanley-Roberts, Extracting Data from III with Expect

Rosalyn Metz

Showed how easy it is to set up a server in Amazon's EC2. Really only takes a few minutes! Not cheap, though. Great talk!

Heikki Levanto

Another Index Data guy talking about their stuff. This guy talked about regexes and parsing web pages.

Mikkel Erlandsen, Summa

Jonathan Rochkind, Umlaut

Richard Wallis, Open Catalogue Crawling Protocol

Xiaoming Liu, checking links and URIs

TODO: look at Firefox plugins: LinkChecker, LinkEvaluator.

TODO: look at command-line: linkchecker

Syndicate content