THE BRITISH LIBRARY

Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

25 March 2015

Enabling Complex Analysis of Large Scale Digital Collections

Add comment Comments (0)

Jisc have announced the projects that have been funded through their Research Data Spring programme. One of those chosen is 'Enabling Complex Analysis of Large Scale Digital Collections', a project led by Melissa Terras (Professor of Digital Humanities, UCL) in collaboration with the British Library Digital Research team.

Research Data Spring aims to find new technical tools, software, and service solutions which will improve researchers’ workflows and the use and management of their data. Following an invitational sandpit event in Birmingham last month aimed to encouraging co-design, 'Enabling Complex Analysis of Large Scale Digital Collections' was chosen from over 40 proposed projects to proceed to a three month development phase.

Our rationale for the project is that lots of money has been spent digitising heritage collections and that - as well as being objects that can be presented online for research and public use and reuse - digitised heritage collections are data. The problem of course is that non-computationally trained scholars often don't know what to ask of large quantities of data, it is common that they do not have access to high performance computing facilities, and the exemplar workflows that they need are hard to find. As a consequence, support from content providers for this category of work is regularly ad hoc and difficult to justify substantial investment in. 'Enabling Complex Analysis of Large Scale Digital Collections' aims to address this fundamental problem by extending research data management processes in order to enable novel research and a deeper understanding of emerging research needs. In the initial three month pilot period we will index a collection of circa 60,000 public domain digitised books (see 'A Million First Steps') at UCL Research IT Services and work with a small number of researchers to turn their research questions in computational analysis. The outputs from each research scenario - including derived data, queries, documentation, and indicative visualisations - will be made available as citeable, CC-BY workflow packages suitable for teaching, self-learning, and reuse. Moreover these workflows will deepen understanding of complex, poorly structured, and heterogeneous humanities data and the questions researchers could ask of that data, highlighting through use cases the potential for process and service development in the cultural sector. Details of the proposed work for after the initial three month phase are on the Figshare document embedded above.

We are also delighted that two other projects with British Library involvement have been funded through the Research Data Spring. 'Unlocking the UK's thesis data through persistent identifiers' will investigate integrating ORCID personal identifiers and DataCite DOIs into our ever growing and unique UK thesis collection. 'Methods for Accessing Sensitive Data', otherwise known as AMASED, will adapt and implement DataSHIELD technology in order to (legally) circumvent key copyright, licensing, and privacy obstacles preventing analysis of digital datasets in the humanities and academic publishing. The British Library will supply the same circa 60,000 public domain digitised books to this project to test the extension of DataSHIELD to textual data.

James Baker

Curator, Digital Research

@j_w_baker

---

Creative Commons Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: embeds to and from external sources

20 March 2015

Texcavator in Residence

Add comment Comments (0)

This is a guest post by Melvin Wevers, Utrecht University

As part of a three-week research stay at the British Library, I looked whether and how the British historical newspaper collection could be implemented within my own research project. I am a PhD candidate within the Translantis research program based at Utrecht University in the Netherlands. The program uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century. A sister-project Asymmetrical Encounters, which is based in Utrecht, London, and Trier looks at similar processes within a European context. Our main sources include Dutch newspapers that are part of the National Library of the Netherlands (KB).

My research at the British Library served two main goals. First, to investigate how the British newspaper data could be incorporated into my project’s research tool Texcavator. Second, to analyze to what extent the newspapers can be used for historical research using computational methods, such as Full-Text searching, Topic Modeling, and Named Entity Recognition.

Texcavator allows researchers to search through newspapers archives using full-text search and more advanced search strategies such as wildcards and fuzzy searching. Secondly, the tool allows researchers to create timelines and word clouds which can also be tagged with Named Entity annotators or sentiment mining modules. Thirdly, the tool has an export function that allows the researcher to create subcorpora that can be used within other tools and analytical software such as Mallet or Text Voyant. Texcavator uses Elastic Search (ES) as a search and analytics engine. ES makes use of JSON files that need to be formatted in a particular schema in order to work within Texcavator. This includes information on newspaper title, date of publication, article type, newspaper type, and its spatial distribution.

The newspaper data on the servers of the British Library makes use of an XML scheme. In order to parse the XML files into an JSON format, I have made use of a PERL parser that modified the scheme to work with Texcavator and converted it into a JSON file. A python script enabled me to batch index the files into an Elastic Search index. Next, I installed Texcavator and configured it to communicate with the Elastic Search index. This shows that it is fairly easy to include the BL newspaper data into an ES index that can be queried.

After this, I set out to determine whether the historical newspapers are suited for historical analysis. One of the challenges of working with this archive is the poor quality of the digitized texts. The articles are not legible as they include a lot of gibberish produced by the optical character recognition (OCR). Using Texcavator these newspaper still prove to be useful for historical research. As Texcavator is able to combine both OCR-ed texts with the images of the article, the research can use key-word searches to find (some of the) articles that contain this word, which can then be read by looking at the images: the OCR facilitates the searching, and the images are used to close-read the articles. Smart queries using wildcards as well as ES optimization can improve the precision and recall of searching.

The texts can also be cleaned up by for instance removing all stop words, or words that have a very low frequency (these often include bad OCR). After cleaning up the texts, techniques such as Topic Modeling and Named Entity recognition can still be applied. The OCR quality of some newspapers in the archive (such as the Pall Mall Gazette) has already been improved. Using the ES search index, I have exported this specific newspaper into a sub corpus that I have tagged with location entities and run through the Topic Modeling engine Mallet. I am using this corpus to analyze the international outlook between 1860 and 1900. For more on this see, my paper proposal "Reporting the Empire" for DH Benelux.

One of the first steps in working with the data at the British Library includes making the data available to researcher in an index which allows for the creation of derived datasets based on either specific queries or meta-data such as newspaper title or spatial distribution. The ability to create derived datasets is a mandatory step within the larger digital humanities workflow. The datasets can then be processed using freely available text mining tools and visualization libraries such as D3.JS or Gephi. I further explained this particular approach to Digital Humanities in a talk I gave at the UCL DH seminar on the 25th of February. The slides of my presentation "Doing Digital History" can be found on Slideshare.

17 March 2015

BL Labs Competition and Awards Roadshow 2015

Add comment Comments (0)

Mahendra Mahey, Manager of BL Labs
Closing date for Competition: Thursday 30th of April, 2015
Closing date for Award: Monday 14th of September 2015

The 2015 BL Labs Competition has been launched for the third time and again we want researchers to submit their ideas for projects that highlight the Library’s digital collections and winners work in residence with the Labs team to make their ideas real, please help us spread the word!

Previous finalists of the BL Labs Competition have helped us learn more about:

We have seen an amazing range of creative and innovative ideas in entries in 2013 and 2014 and we look forward to seeing even more in 2015! Winners will be chosen by Friday 29th of May 2015.

In addition, we are launching the new 2015 BL Labs Awards for outstanding work that has already been completed using British Library digital content. We are looking for examples in the categories of Research, Creativity, and Entrepreneurship. Shortlisted candidates will be informed by Monday 12th October 2015.

Competition winners will showcase their work and Award winners will be announced at the third Labs Symposium on Monday 2 November, 2015 in the British Library Conference Centre.

We organising a number of roadshows around the country promoting the competition, for more information and to register please see:

Contact us at labs@bl.uk or visit http://labs.bl.uk/Events