Working with news data

The British Library has a vast news collection. We have some 60 million newspaper issues (around 450 million pages) dating from the 1620s to the present day, 60,000 television and radio news programmes from 2010 onwards, and we are archiving over 1,000 UK news websites on a regular basis. Just as the newspaper industry is moving into other media in a cross-platform world, so we are following in how we archive news, in order that we can offer the optimum research service fin the future.

News collections at the British Library

To make such a vision work we have to get the data right. The Library's Explore catalogue works well for finding a volume of newspapers to be delivered to a researcher's desk, but is not readily open for any sort of content analysis of the news collection. Our different news collections - for newspapers, web, TV and radio - come together via Explore, but not easily so, because of the different ways in which they are held and described (most of our newspaper records are at title level, the TV and radio news records are at programme level, while the web archive operates best at page level). We are some way off from presenting the unified news collection, and could be doing so much more to serve new kinds of research enquiry by taking a more data-drive approach to our news holdings.

These needs were the drivers behind a workshop on 7 September 2015, co-organised by BL Labs and the Library's News & Moving Image team, entitled Working with news data across different media. This brought together researchers, developers and content owners to look at ways in which changes in \archive news data management can be of benefit to researchers. The event was part of an ongoing process from BL Labs looking at how the Library's digital collections can be made available for researchers, but was the starting point for a discussion we need to have with researchers and content managers as to how best to pursue an archive news data strategy.

The day began with an introduction to the Library's digital collections and the work of BL Labs by Mahendra Mahey. Luke McKernan, Lead Curator News & Moving Image, then gave a talk on the Library's news collections. He outlined what the Library has to offer researchers at present in terms of news data for onsite analysis:

2 million 19th century British newspaper pages (XML, page images)
UK television news data 2010 onwards – EPG (Electronic Programme Guide) data for 45,000 programmes, subtitles (XML) for c.25,000 programmes, some speech-to-text files for 2011 broadcasts (XML)
UK radio news data 2010 onwards – EPG data for 15,000 programmes, some speech-to-text files for 2011 broadcasts (XML)
a possible selection of Web news data

Additionally there is selected data and page images from The Financial Times. The Financial Times is partnering with The British Library to make its historical archive available on a royalty free basis for academic research purposes. Any researcher interested in taking advantage of this should contact Luke McKernan for further information.

The British Library is also planning to make available title-level records for all 34,000 newspaper titles that it holds as open data. We will have more news on this initiative in due course.

There are goals beyond these that the Library could strive for. What about an open news dataset shared with other institutions? What about an archive news data model to bring together such collections? And how about the ultimate aim of having all of our news collections identified at issue rather than title level? That will be a huge undertaking, but the goal must be for us to be able to offer to future users a digital picture of what happened in any one place at any one time, contributing to an overall 'news' picture. This would mean not just what was reported in a local newspaper on any one day, but what people from that locality heard, read or saw that helped make up their understanding of the world. That's how we gather our news today; it is also a model for understanding how news has maybe always operated, certainly how news archives can be approached in their totality.

Laughing at Victorian jokes

A number of short presentations then followed, from projects either using the Library's news collections or with whom we have collaborated on news-related initiatives:

Glen Robson of the National Library of Wales spoke about implementing the IIIF image format for their public domain newspapers, which could lead to cross-institutional sharing of newspaper collections by using this standardised image retrieval framework.
Dr Katrina Navickas of the University of Hertfordshire, a BL Labs competitition winner talked about developing her winning idea, the 'Political Meetings Mapper' which is using automated processes to identify meetings of the Chartist movement in 19th century newspapers.
Dr Bob Nicholson, Edge Hill University, winner of the 2014 BL Labs competition, spoke about the Victorian Meme Machine project, tracking down jokes in Victorian newspapers and mapping these automatically to contemporary images. he called for focussed datasets rather than just presenting digitised newspapers in their entirety, and for newspaper data to be linked out to other forms of data.
Martin Stabe, Head of Interactive News at the Financial Times spoke on ways in which the newspaper's archive could be opened up for research. The newspaper is taking bold steps in exploring alternative ways of opening up its archives beyond the tried-and-tested subscription models.
Melvin Wevers, PhD student within the Translantis project at the University of Utrecht, introduced the Texcavator tool, which is being used to analyse the Dutch National Library's digital newspaper collection to study Dutch public discourse, and which has also been applied to some UK newspapers (including sample data from the Financial Times)
Ian Tester, director of Partner Products at Findmypast Ltd, who manage the British Newspaper Archive of digitised newspapers from the British Library, spoke on the diverse researchers opportunities that the archive now provides, with an emphasis on the many kinds of book now being published that have made often unexpected use of the digital archive.
Mark Flashman and Michael Satterthwaite from the BBC Rewind project spoke about the ways the BBC is applying innovative digital applications to digital storytelling and opening up news archives through projects such as the World Service Radio Archive, News Timeliner and Your Story. They stressed the importance of achieving good things with small amounts of data first, and of working for 'good enough' results rather than perfection.

The workshop then divided up into four groups to consider four questions which could help us shape how we develop things next. They were:

What’s the best way to get the most out of hack events? What have people learned, what are the issues, the best way to overcome them and how to get the most from them?
What is the best way to work across a heterogeneous collection of news data, with particularly focus on the data available from the British Library (though not exclusively). What are the challenges and how to get over them?
How might the British Library most usefully work with third parties to get the best out of news data. What are the issues and challenges?
What do researchers want from news data? What are the issues and the challenges?

We're still working on assimilating the answers to those questions, as we start to shape our new data plans, including further such events. our thanks to everyone who attended the workshop and who supplied such stimulating and useful contributions. The next step will be a news hackathon, which we will be hosting on November 16th at the British Library in London. More news on this will be published soon.

Digital scholarship blog

Working with news data

Comments