Tuesday, December 17, 2013

British Library releases a fascinating data set

I recently saw an article about the British Library releasing over a million images from historic texts to the public domain. Their stated goal was to have the public slice and dice and remix the data to see what we can get out of it. How could I resist?

I wanted to do something with this data so I figured I would start with what I know the best. Create a RESTful interface for the data to allow myself and others to easily build on top of the data using any language that provides some HTTP library (all of them).

The first step was to take the image metadata in TSV files the British Library kindly put up on GitHub and import it into a database to allow easy integration. MongoDB is my go to database and it fits well for this kind of data so I wrote a handy dandy python script to consume the TSV files and upload them to MongoDB. I love python for tasks like this. Clean and powerful code and I was able to get it written and tested in a couple hours.

Next I tackled the actual REST app to expose this data. There are countless options for implementing REST apps these days. I am hoping this app can serve as the backend for lots of people to create lots of interesting apps of their own so I need something performant that can handle concurrency and scale elegantly. The desire for modern concurrency eliminates about 99% of the REST development frameworks built on top of Python, Ruby, Node etc. Pretty much my only option is something running on the JVM. I'm not a masochist so a pure Java implementation was not an option. Scala and Clojure are really the best fit for the kind of app I wanted to create. Since Clojure is still fairly new to me and I wanted to get this up and running as soon as possible I went with the familiar stack of Scala using the fantastic Spray REST library which leverages Akka for its sophisticated concurrency model. To integrate with MongoDB I decided to use ReactiveMongo for the obvious benefit of non-blocking IO and as an opportunity to try a new tool.

I was able to implement this project with minimal code and I am hugely impressed with all of the tools I went with. The entire Scala ecosystem has matured so much since I began using it several years ago. You can see the code in its current form along with API docs here:

https://github.com/ctcarrier/bl-rest

I have this app deployed to Heroku so feel free to integrate with this for your own apps. As of this writing I only have two simple endpoints implemented but I will add more as necessary. I want to create a simple webapp front end to allow people to tag the photos by answering simple questions. I am thinking of storing the raw data in a Graph datastore like Neo4J which is new to me so if you have any experience or ideas on that front please get in touch.

No comments:

Post a Comment