THE BRITISH LIBRARY

Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

16 May 2013

On metadata and cartoons

Add comment Comments (0)

I love cartoons. And few collections of cartoons excite me more than those held by the British Cartoon Archive. Thanks to some meticulous cataloguing its digital archive is a pleasure to explore, so it seemed fitting to me that the BCA was chosen to host a 'Digitising the Image' workshop on 15 May as part of the AHRC-funded Going Digital doctoral training programme. This programme includes events at The Courtauld Institute of Art, Goldsmiths (University of London), the Open University, and the Universities of East Anglia, Essex, Kent, and Sussex, and runs until the end of July this year. I was invited along to this particular event to talk about how archives of digital images can be used in research, and I chose to focus on how metadata can provide novel opportunities for discovering large corpora of digital images - if perhaps through a less appealing door than by going directly to the cartoons themselves (slides here). The rest of the day covered creating images, file types, publishing images, copyright, and metadata, and provided an excllent opportunity to reflect on how these skills - perhaps even more importantly the knowledge of the possibility of aquiring these skills - can be brought to wider audiences. Going Digital is a good start to this process, but only really the first tentative steps into fully integrating 'the digital' into how budding historians, art historians and literary critics are trained in higher education.

 

6245395188_d1ddafe27f
Yes it is... Metadata is a love note to the future photograph courtesy of Flickr user sarah0s / Creative Commons Licensed


So, back to metadata and cartoons. A few weeks before the event I asked the BCA to provide me with a dump of metadata. Quite wisely they came back with some sample .xml which - after some tests - I realised I could do something with at a technical level. I was also advised that the metadata was strongest for the 1960s and 1970s. This then became my focus and having received the full dataset I set about doing some quick and dirty transformations and visualisations for demonstration purposes (warning: quick and dirty are the operative words).

The content includes nearly 400,000 lines of data, with date, title, subject, author and various archival data. After doing a little cleaning of the 'Date' field - and where necesary some judicious removing - in Open Refine, I poked around the data for useful fields (I'll admit that plenty more cleaning could be done). By far the most interesting were the 'Title' field - in which is free text of any inscriptions within the cartoon - and the 'Subject' field - containing text entered by the BCA team in order to categorise the cartoon (so for a single cartoon the list of subjects might include 'backgardens', 'budgerigars', 'pigs', 'ballet', 'typewriters'). It is this latter field which makes the collection such a rich resource for researchers.

In order to force the data into Voyant - perhaps the easiest data discovery tool for newcomers to get to grips with - I had to sort the data by date and then remove the data column to create an artifical chronology: not ideal, but necesary as Voyant can only handle text not text vs. date. A fudged solution also had to be found to get the data into Zotero for use in Paper Machines. I wanted to demonstrate topic modeling given recent discussions on the subject in the Journal of Digital Humanities, yet getting the data into an easy to use tool such as Paper Machines proved troublesome:  converting the data to bibtex made Zotero (on top of which Paper Machines sits) fall over, so instead I crudely chopped the textual data into annual text files for the years 1960 to 1979 and uploaded them for comparison. Again not ideal at all, but it got the point across for at the event I was able to demonstrate manipulating the data in these tools live: risky perhaps, but if my object was for the audience to understand the power of the tools (which it was!) then static slides wouldn't do. And what more than justified the risk was the evident enthuasiam in the room for the tools and for the fresh discoveries this type of data driven analysis can enable. More evidence then - if any were needed - that doing trumps reading/hearing/seeing when it comes to encouraging critical tool use.

At this point you might be thinking, what did I actually discover in the data. In a sense I discovered what I expected to discover (and not for the first time). The themes of the cartoons in the corpus track the politics of the day, with for example clusters of words around 'Maggie' and 'Conservative' growing to a crescendo by the end of the 1970s. Equally expected, but nonetheless of interest, is the observation that textual content within cartoons during the same period tended toward natural language, with words such as 'british', 'harold', 'christmas' and 'strike' marginal (see below).

Word cloud of 'Title' field for data exported from British Cartoon Archive database for years between 1960 and 1979 (dataset, Voyant)

A more naunced discovery, and one which I think suggests the potential both of the data and of the method, is revealed by comparing visualisations of the 'Title' field and of the 'Title' and 'Subject' fields combined. In the latter case, the subjects overwhelm the titles. This is to be expected: as the subjects are chosen by curators of the data at the point of digitisation we might expect these entries to form clusters and to reuse categories. Hence although the addition of the 'Subject' field to the 'Title' only increased the number of unique words from 30,621 to 33,178, it increases the total words from 660,981 words to 1,208,082 and the most frequent word from 2,877 occurances for "it's" to 12,000 for "party" (note: all counts correct after the application of standard stop words - with a few manual additions - to the data).

Word trend graph of 'Title' and 'Subject' fields for data exported from British Cartoon Archive database for years between 1960 and 1979 (dataset, Voyant)

Word trend graph of 'Title' field for data exported from British Cartoon Archive database for years between 1960 and 1979 (dataset, Voyant)

This additional data also changes the trends within the corpus. So whilst comparing 'police', 'unions', and 'strikes' in the Subject+Title corpus shows 'police' and 'strikes' as occuring with relatively equal frequency over time (or across the length of the text), when we look at only the text within the cartoons 'police' occurs with far greater frequency across the period (see above). What is going on here demonstrates the value of capturing implied meaning in metadata as opposed to merely inscribed text. The word 'police' is simply more likely to appear in cartoons: think of stock phrases such as "Stop! Police!" (and derivations thereof) or the appearance of the words 'Police Station' above or around the door of a building. Words such as 'unions' and 'strikes' are more likely on the other hand to only appear in natural speech: "Who's still out? Any new strikes?", "We're not against pay strikes mate", "I dunno Denis - if these strikes go on". So whereas the word 'police' and the theme of policing might appear together, the theme of striking and unions is more likely to be implied within a cartoon and is then more available for this sort of corpus analysis when that impled meaning has been captured and translated into text.

In the case of the BCA 1960s and 1970s collections this capturing of implied meaning was undertaken by paid experts. Today some of this sort of work can be outsourced to volunteering crowd: our own Picturing Canada project is an excellent example of how this could work for digital images. In a future post I will discuss with Nick Hiley, Head of the British Cartoon Archive, the challenges of creating high-quality descrptive metadata in an era where crowdsourcing is so in vogue.

James Baker

@j_w_baker

15 May 2013

Remembering the Great War - the Harold Ward Letters

Add comment Comments (0)

As part of our activities with Europeana 1914 – 1918 to remember the 100th anniversary of the outbreak of WW1, we have recently digitised a fascinating collection of letters sent by Captain Harold Ward to his wife Louise Ward and son Kenneth Martin Ward between 1917 and 1918. This collection, lent to the BL by Captain Ward’s granddaughter for digitisation, comprises some 260 items including 9 field service post cards, written on paper of various shapes, sizes, colours and conditions. In his correspondence, Captain Ward gives a very poignant account of life in the fronts where he was serving with the 2/4th and 2/5th Lincolnshire Battalions, offering a vivid image of the everyday life of his soldiers in the battlefields and the rough conditions created by war.

July_1917_034But perhaps one the most striking aspects of the correspondence is the way Captain Ward express his personal view of the battlefields: the description of his experiences is always accompanied by comments of hope to return home and love towards his family, as we can see in the letter presented here which was written to his son in July 1917. By reading Captain Ward letters, rather than trying to understand the past through a mere description of events, one has the feeling of approaching history from a highly personal and human perspective as if we were transported to the very moment when these events were taking place.  The material offers indeed  a great resource for researchers and the general public interested in learning more about the Great War, especially for those keen to understand WW1 from the point of view of those fighting in the trenches. The full correspondence is available at http://bit.ly/10F80nU 

 

July_1917_035

Letter sent By Captain Harold Ward to his son Kenneth on July 1917

07 May 2013

Improved access to newspapers: The Europeana Newspapers Project

Add comment Comments (0)

Image source: National Library of Estonia

This is a brief post to highlight the activities of The Europeana Newspapers Project (ENP), a network of 18 partners (and 11 associated partners) working together to make more than 18 million digitised newspaper pages (including 10 million pages of full-text content) available via the Europeana ecosystem of online services, with aggregation carried out by The European Library.

The project will improve discoverability of content through the application of refinement methods for Optical Character Recognition (OCR), Optical Layout Recognition (OLR), named Entity Recognition (NER) and Page Class Recognition. It also addresses the challenges around quality evaluation for automatic refinement technologies, transformation of local metadata to the Europeana Data Model (EDM), and metadata standardisation in close collaboration with stakeholders from the public and private sector.

Demonstrations of the evaluation tools, OLR, NER tagging and the role of ground truth will take place at ENP first dissemination workshop on refinement and quality assessment at the University Library Svetozar Markovic, Belgrade, 13-14 June.

The British Library is a networking partner in the ENP and will be hosting an information day and a dissemination workshop in 2014.

For further information about the project, visit its website http://bit.ly/17WNlir and follow Europeana Newspapers on Facebook and @eurnews on Twitter.