Digital scholarship blog

Enabling innovative research with British Library digital collections


Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

31 October 2014

2014 Off the Map Competition Winners Announced at GameCity9 Festival

Add comment Comments (0)

Last night was the award ceremony at Nottingham Contemporary art gallery for the Off the Map 2014 competition, a partnership project with GameCity and Crytek. Now in its second year, Off the Map challenges UK Higher Education students to make videogames based on British Library collection items using Crytek's CRYENGINE software. Furthermore, for 2014, the competition had a gothic theme to accompany the British Library's current exhibition Terror and Wonder: The Gothic Imagination, which is open until Tuesday 20 January 2015 and is well worth a visit.

I've created a video, which you can see below, showing flythrough footage of last year's winning entry from Pudding Lane Productions, De Montfort University, Leicester. It also gives details of the 2014 gothic sub-themes and shows flythrough clips from this year's shortlisted entries.



The jury were impressed by the quality and creativity of the submitted entries, so there was passionate debate regarding the deciding the 2014 shortlist! The third winning entry was Team Shady Agents from University of South Wales in Newport with their Edgar Allan Poe inspired game Crimson Moon. The second winning entry was Team Flying Buttress from De Montfort University, who created a visually rich interpretation of Dracula's Whitby. 

I was delighted that British Library Chief Executive Roly Keating announced the winning entry:  Nix, this was created by Jackson Rolls-Gray, Sebastian Filby and Faye Allen from the University of South Wales. Using Oculus Rift, a revolutionary virtual reality headset for 3D gaming; it challenges players to reconstruct Fonthill Abbey via collecting hidden and moving glowing orbs in a spooky underwater world. You can see a flythrough of their game below:



My colleague Tim Pye, curator of Terror and Wonder and a member of this year's Off the Map jury, said: “The original architectural model of Fonthill Abbey is currently on display in Terror and Wonder.  What is so impressive about the Nix game is the way in which it takes the stunning architecture of the Abbey, combines it with elements from its troubled history and infuses it all with a very ghostly air.  The game succeeds in transforming William Beckford’s stupendously Gothic building into a magical, mysterious place reminiscent of the best Gothic novels.”

Nix also impressed fellow jury member Scott Fitzgerald, Crytek's CRYENGINE Sandbox Product Manager he said: “With the theme of Fonthill Abbey, the winning team took the fantasy route and twisted the story into something fresh and completely different.  The mechanics used to progress through the game and the switching between the two realities make a very interesting experience for the player.”

I'd like to thank this year's jury members: Tim Pye, Tom Harper, Kim Blake and Scott Fitzgerald. I also want to thank all of this year's Off the Map participating teams, far from being a terror, it has been a delight to follow the students' work via their blogs and YouTube channels.

Plans are currently underway for the third competition: "Alice's Adventures Off the Map", which will be launched at the British Library on Monday 8 December 2014, at one of the Digital Research team's Digital Conversation events. If you would like to come along to find out more, book here.


Stella Wisdom

Curator, Digital Research


30 October 2014

British Library Digital Scholarship Training Programme: a round-up of resources you can use

Add comment Comments (0)

The British Library Digital Scholarship Training Programme provides hands-on practical training for British Library staff delivered as one-day on-site workshops covering topics from communicating collections and cleaning up data to command line programming and geo-referencing. Since launching in November 2012 over 250 individual members of staff have attended one or more session with over 60 course days delivered.

We've blogged about the programme before (see '50th Anniversary!'), and the more we go around talking about it (most recently at Digital Humanities 2014 and Data Driven: DH in the Library) the more we hear from librarians, curators, academics, and other professionals in the cultural sector looking to build similar programmes and looking to learn from our model.

Although the British Library Digital Scholarship Training Programme is an internal programme, we've made efforts over the last year to release bits of the programme externally. In lieu of having a central home for these outputs, this post collates all those bits of the programme that have floated out onto the open web, usually under generous licences.

Crowdsourcing in Libraries, Museums and Cultural Heritage Institutions

Mia Ridge leads this course for us. Notes, links, and references relating to the course are on her blog.

 Data visualisation for analysis in scholarly research

Again, Mia Ridge leads this course for us. Notes, links, and references relating to the course are on her blog.

Information Integration: Mash-ups, APIs and the Semantic Web

Owen Stephens leads this course for us. Both his slides and the hands-on exercise he developed for the course are available on his blog and licensed under a Creative Commons Attribution 4.0 International License.

Programming in Libraries

There is a great deal of cross-over between this course and two lessons I wrote for the Programming Historian with Ian Milligan: Introduction to the Bash Command Line and Counting and mining research data with Unix. Both lessons are licensed under a Creative Commons Attribution 2.0 Generic License.

Managing Personal Digital Research Information

This course is led by Sharon Howard, the bulk of which covers Zotero. A wiki resource was developed by Sharon for the course attendees to work through and this was subsequently released under a Creative Commons Attribution-ShareAlike 3.0 Unported License as A Zotero Guide.


James Baker

Curator, Digital Research


Creative Commons License

This post is licensed under a Creative Commons Attribution 4.0 International License.

22 October 2014

Victorian Meme Machine - Extracting and Converting Jokes

Add comment Comments (0)

Posted on behalf of Bob Nicholson.

The Victorian Meme Machine is a collaboration between the British Library Labs and Dr Bob Nicholson (Edge Hill University). The project will create an extensive database of Victorian jokes and then experiment with ways to recirculate them over social media. For an introduction to the project, take a look at this blog post or this video presentation.

1 - intro image

In my previous blog post I wrote about the challenge of finding jokes in nineteenth century books and newspapers. There’s still a lot of work to be done before we have a truly comprehensive strategy for identifying gags in digital archives, but our initial searches scooped up a lot of low-hanging fruit. Using a range of keywords and manual browsing methods we quickly managed to identify the locations of more than 100,000 gags. In truth, this was always going to be the easy bit. The real challenge lies in automatically extracting these jokes from their home-archives, importing them into our own database, and then converting them into a format that we can broadcast over social media.

Extracting joke columns from the 19th Century British Library Newspaper Archive – the primary source of our material – presents a range of technical and legal obstacles. On the plus side, the underlying structure of the archive is well-suited to our purposes. Newspaper pages have already been broken up into individual articles and columns, and the XML for each these articles includes an ‘Article Title’ field. As a result, it should theoretically be possible to isolate every article with the title “Jokes of the Day” and then extract them from the rest of the database. When I pitched this project to the BL Labs, I naïvely thought that we’d be able to perform these extractions in a matter of minutes – unfortunately, it’s not that easy. 

1-5 -joke_syntaxMarking up a joke with tags

The archive’s public-facing platform is owned and operated by the commercial publisher Gale Cengage, who sells subscriptions to universities and libraries around the world (UK universities currently get free access via JISC). Consequently, access to the archive’s underlying content is restricted when using this interface. While it’s easy to identify thousands of joke columns using the archive’s search tools, it isn’t possible to automatically extract all of the results. The interface does not provide access to the underlying XML files, and images can only be downloaded one-by-one using a web browser’s ‘save image as’ button. In other words, we can’t use the commercial interface to instantly grab the XML and TIFF files for every article with the phrase “Jokes of the Week” in its title.

The British Library keeps its own copies these files, but they are currently housed in a form of digital deep-storage that is impossible for researchers to directly access and extremely cumbersome to discover content within it. In order to move forward with the automatic extraction of jokes we will need to secure access to this data, transfer it onto a more accessible internal server, custom build an index that allows us to search the full text of the articles and titles so that we may extract all of the relevant text and image files showing the areas of the newspaper scans from which the text was derived.

All of this is technically possible, and I’m hopeful that we’ll find a way to do it in the next stage of the project. However, given the limited time available to us we decided to press ahead with a small sample of manually extracted columns and focus our attention on the next stages of the project. This manually created sample will be of great use in future, as we and other research groups can use it to train computer models, which should enable us to automatically classify text from other corpora as potentially containing jokes that we would not have been able to find otherwise.

For our sample we manually downloaded all of the ‘Jokes of the Day’ columns published by Lloyd’s Weekly News in 1891. Here’s a typical example:

2 - joke column

These columns contain a mixture of joke formats – puns, conversations, comic stories, etc – and are formatted in a way that makes them broadly representative of the material found elsewhere in the database. If we can find a way to process 1,000 jokes from this source, we shouldn’t have too much difficulty scaling things up to deal with 100,000 similar gags from other newspapers.    

Our sample of joke columns was downloaded as a set of jpeg images. In order to make them keyword searchable, transform them into ‘memes’, and send them out over social media we first need to convert them into accurate, machine-readable text. We don’t have access to the existing OCR data, but even if this was available it wouldn’t be accurate enough for our purposes. Here’s an example of how one joke has been interpreted by OCR software:

  3 - OCR comparison
Some gags have been rendered more successfully than this, but many are substantially worse. Joke columns often appeared at the edge of a page, which makes them susceptible to fading and page bending. They also make use of unusual punctuation, which tends to confuse the scanning software. Unlike newspaper archives, which remain functional even with relatively low-quality OCR, our project requires 100% accuracy (or something very close) in order to republish the jokes in new formats.

So, even if we had access to OCR data we’d need to correct and improve it manually. We experimented with this process using OCR data taken from the British Newspaper Archive, but the time it took to identify and correct errors turned out to be longer than transcribing the jokes from scratch. Our volunteers reported that the correction process required them to keep looking back and forth between the image and the OCR in order to correct errors one-by-one, whereas typing up a fresh transcription was apparently quick and straightforward. It seems a shame to abandon the OCR, and I’m hopeful that we’ll eventually find a way to make it usable. The imperfect data might work as a stop-gap to make jokes searchable before they are manually corrected. We may be able to improve it using new OCR software, or speed up the correction process by making use of interface improvements like TILT. However, for now, the most effective way to convert the jokes into an accurate, machine-readable format is simply to transcribe directly from the image.