THE BRITISH LIBRARY

Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

22 October 2014

Victorian Meme Machine - Extracting and Converting Jokes

Add comment Comments (0)

Posted on behalf of Bob Nicholson.

The Victorian Meme Machine is a collaboration between the British Library Labs and Dr Bob Nicholson (Edge Hill University). The project will create an extensive database of Victorian jokes and then experiment with ways to recirculate them over social media. For an introduction to the project, take a look at this blog post or this video presentation.

1 - intro image

In my previous blog post I wrote about the challenge of finding jokes in nineteenth century books and newspapers. There’s still a lot of work to be done before we have a truly comprehensive strategy for identifying gags in digital archives, but our initial searches scooped up a lot of low-hanging fruit. Using a range of keywords and manual browsing methods we quickly managed to identify the locations of more than 100,000 gags. In truth, this was always going to be the easy bit. The real challenge lies in automatically extracting these jokes from their home-archives, importing them into our own database, and then converting them into a format that we can broadcast over social media.

Extracting joke columns from the 19th Century British Library Newspaper Archive – the primary source of our material – presents a range of technical and legal obstacles. On the plus side, the underlying structure of the archive is well-suited to our purposes. Newspaper pages have already been broken up into individual articles and columns, and the XML for each these articles includes an ‘Article Title’ field. As a result, it should theoretically be possible to isolate every article with the title “Jokes of the Day” and then extract them from the rest of the database. When I pitched this project to the BL Labs, I naïvely thought that we’d be able to perform these extractions in a matter of minutes – unfortunately, it’s not that easy. 

1-5 -joke_syntaxMarking up a joke with tags

The archive’s public-facing platform is owned and operated by the commercial publisher Gale Cengage, who sells subscriptions to universities and libraries around the world (UK universities currently get free access via JISC). Consequently, access to the archive’s underlying content is restricted when using this interface. While it’s easy to identify thousands of joke columns using the archive’s search tools, it isn’t possible to automatically extract all of the results. The interface does not provide access to the underlying XML files, and images can only be downloaded one-by-one using a web browser’s ‘save image as’ button. In other words, we can’t use the commercial interface to instantly grab the XML and TIFF files for every article with the phrase “Jokes of the Week” in its title.

The British Library keeps its own copies these files, but they are currently housed in a form of digital deep-storage that is impossible for researchers to directly access and extremely cumbersome to discover content within it. In order to move forward with the automatic extraction of jokes we will need to secure access to this data, transfer it onto a more accessible internal server, custom build an index that allows us to search the full text of the articles and titles so that we may extract all of the relevant text and image files showing the areas of the newspaper scans from which the text was derived.

All of this is technically possible, and I’m hopeful that we’ll find a way to do it in the next stage of the project. However, given the limited time available to us we decided to press ahead with a small sample of manually extracted columns and focus our attention on the next stages of the project. This manually created sample will be of great use in future, as we and other research groups can use it to train computer models, which should enable us to automatically classify text from other corpora as potentially containing jokes that we would not have been able to find otherwise.

For our sample we manually downloaded all of the ‘Jokes of the Day’ columns published by Lloyd’s Weekly News in 1891. Here’s a typical example:

2 - joke column

These columns contain a mixture of joke formats – puns, conversations, comic stories, etc – and are formatted in a way that makes them broadly representative of the material found elsewhere in the database. If we can find a way to process 1,000 jokes from this source, we shouldn’t have too much difficulty scaling things up to deal with 100,000 similar gags from other newspapers.    

Our sample of joke columns was downloaded as a set of jpeg images. In order to make them keyword searchable, transform them into ‘memes’, and send them out over social media we first need to convert them into accurate, machine-readable text. We don’t have access to the existing OCR data, but even if this was available it wouldn’t be accurate enough for our purposes. Here’s an example of how one joke has been interpreted by OCR software:

  3 - OCR comparison
Some gags have been rendered more successfully than this, but many are substantially worse. Joke columns often appeared at the edge of a page, which makes them susceptible to fading and page bending. They also make use of unusual punctuation, which tends to confuse the scanning software. Unlike newspaper archives, which remain functional even with relatively low-quality OCR, our project requires 100% accuracy (or something very close) in order to republish the jokes in new formats.

So, even if we had access to OCR data we’d need to correct and improve it manually. We experimented with this process using OCR data taken from the British Newspaper Archive, but the time it took to identify and correct errors turned out to be longer than transcribing the jokes from scratch. Our volunteers reported that the correction process required them to keep looking back and forth between the image and the OCR in order to correct errors one-by-one, whereas typing up a fresh transcription was apparently quick and straightforward. It seems a shame to abandon the OCR, and I’m hopeful that we’ll eventually find a way to make it usable. The imperfect data might work as a stop-gap to make jokes searchable before they are manually corrected. We may be able to improve it using new OCR software, or speed up the correction process by making use of interface improvements like TILT. However, for now, the most effective way to convert the jokes into an accurate, machine-readable format is simply to transcribe directly from the image.

16 October 2014

Curious Roads to Cross: British Library, Burning Man and the art of David Normal

Add comment Comments (0)

California-based artist, David Normal will be talking about how he used images from  part of the British Library Flickr commons, one million images release as inspiration for artwork he created for the Burning Man Festival tomorrow Friday 17 October, between 1500 - 1600, at the British Library, Chaucer Suite, Conference Centre, London (places are very limited if you are interested in attending, see below for booking information).

10677587_10152756150680087_1666190930_oCrossroads of Curiosity at Burning Man, Nevada, 25 August to 1 September 2014

With a special interest in 19th century illustration, David created the ‘Crossroads of Curiosity’, which was on display from 25 August – 1 September 2014 at the festival in Nevada.

David recently blogged about his work on the British Library's Digital Scholarship blog,

David will bring large prints of his work and talk about each painting and focus on specific details of the work. He will also explain the production process of the work right through to the de-installation of it at Burning Man.

Booking information

Don't miss out on this fantastic opportunity to see how the British Library's digital content is being used to inspire artists. If you are interested in attending, please email digitalresearch@bl.uk with the subject 'David Normal: BL and Burning Man' no later than Friday 17 October, 1100, 2014.

10 October 2014

Introducing Paper Machines

Add comment Comments (0)

In the welcome surroundings of the refurbished Institute of Historical Research, Jo Guldi (Brown University) kicked off the 2014 Autumn Term programme of the IHR Digital History Seminar. In town to discuss The History Manifesto, her new open access book co-authored with David Armitage, Guldi's talk ranged from the public role of the historians, the Digital Humanities and new model of publishing to impending environmental catastrophe, the need for deep history and data processing tools that can help citizen and scholars alike overcome the problems of modern bureaucracy. To see how Guldi weaved all this threads together, you'll need to watch the video below. Here I just want to tease in no particular order at a few of threads that stuck in my mind, threads that pertain to most, if not all, digital history projects that pass through the seminar.

Tools as provocations: Paper Machines is a research tool. But it is also a provocation, an experiment with using large swathes of information to inform historical research in the longue durée, a vantage point - the tools makers argue - historians take not often enough. The tool, in short, is the argument.

What we need now: As we sit on the precipice of environmental catastrophe, does it not behove us to think about what digital projects we need? Do we want digital projects that analyse art for art's sake, that recapitulate old research paradigms and do not address problems of a wider, public relevance?

Hypothesis generation: At the heart of Paper Machines is hypothesis generation. It allows the scholar to take a vast paper archive and facet that archive, make visualisations, select where to read closely. How that macro to micro scaling changes the history that is written, how scholarly debates mature to integrate the inevitable discrepancies between interpretations made at these scales is the challenge historians must re-engage with.

Being bold about method: Works that change the focus of disciplines usually open their accounts by stating 'you missed this because your method was wrong'. Digital history can and should do the same, it can and should be bold about how it comes to the conclusions it does rather than hide the methods, ways, and means that underpin its particular take on historical phenomena.

My partial, incomplete, CC BY notes on the seminar are available on GitHub Gist.

The next Digital History seminar, 'Interrogating the archived UK web: Historians and Social Scientists Research Experiences', will take place on 4 November and a full listing of Autumn Term seminars is available on the IHR Website.

James Baker

Curator, Digital Research

@j_w_baker

This was originally posted on the IHR Digital History Seminar blog.

-

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.