16 November 2015
BL Labs Awards (2015): Research category Award winning project
The winners of the British Library Labs Awards were announced at the British Library Labs Symposium, held on Monday 2nd November 2015, at the British Library. The Awards were launched in 2015 by the British Library Labs team in order to formally recognise outstanding and innovative work that has been created using the British Library’s digital collections and content.
This year, the Awards honoured projects within three key categories: Research, Creative/Artistic and Entrepreneurship. The winner of the Research Award (2015) was “Combining Text Analysis and Geographic Information Systems to investigate the representation of disease in nineteenth-century newspapers”, submitted by the Spatial Humanities project at Lancaster University: Paul Atkinson, Ian Gregory, Andrew Hardie, Amelia Joulain-Jay, Daniel Kershaw, Catherine Porter and Paul Rayson, a video presentation from Ian about the entry is available here.
The project examines the London based newspaper The Era (1838–1900, constituting over 377 million words), which has been digitised and made available by the British Library, through innovative and varied selections of qualitative and quantitative mechanisms in order to determine how the Victorian Era discussed and portrayed disease, both temporally and spatially.
The Award was accepted at the Symposium by Ian Gregory, Professor of Digital Humanities at Lancaster University, on behalf of the rest of the Spatial Humanities project team.
Below, Ian’s guest blog discusses the award winning project for us:
Lancaster University’s Spatial Humanities: Texts, GIS, Places is a European Research Council funded project concerned with understanding how we can analyse the geographies in large corpora while remaining sensitive to the subtleties and nuances within the texts. It does this by combining techniques from Geographical Information Systems (GIS) and corpus linguistics to create a set of techniques we call Geographical Text Analysis (GTA). GIS is effectively a mapping and database technology that is typically used with quantitative sources. Corpus linguistics is concerned with analysing large textual collections using a combination of quantitative and qualitative approaches. The project has been developing these techniques and applying them to studies concerned with Lake District literature and nineteenth century social history. Doing this requires a large and highly interdisciplinary team, currently Paul Atkinson (an historian), Ian Gregory (digital humanities), Andrew Hardie (linguistics), Daniel Kershaw (computer science), Amelia Joulain-Jay (linguistics), Catherine Porter (geography) and Paul Rayson (computer science).
One of the major challenges facing the team has been incorporating the British Library’s Nineteenth Century Newspapers collection. This consists of two million newspaper pages from 49 series of papers, most of which run continuously for the whole of the nineteenth century. Our best estimate is that it contains over 30 billion words. The sheer volume of material presents significant challenges, not least that to even strip out the unnecessary mark-up to make the texts suitable for analysis requires computing power that was only practical using parallel processing on a Hadoop cluster. A second challenge is that, as with many other historical sources, they were digitised using Optical Character Recognition (OCR) technology in which the computer attempts to convert a scanned image into digital letters and words. Being newsprint, the quality of the original text is frequently poor thus this is an error prone process. We have evaluated a range of post-OCR correction methods and found one to be promising. We have also explored the extent to which OCR error affects analytic results. We are particularly interested in a technique called collocation, which effectively asks what words are found near to a search-term, allowing us to understand what themes are associated with other themes. We have been able to show that, with certain caveats, the OCR quality of the newspapers collection does not undermine the effectiveness of collocation analysis.
The diagram above shows the frequency of mentions of two countries, France and Russia, in one newspaper, The Era. The spikes in the graph may suggest that much of the interest in these countries was driven by wars and crises such as the Crimean War in the 1850s, and the Franco-Prussian and Russo-Turkish wars of the 1870s.
Combining collocation with semantic tagging, in which words are classed according to their meaning, allows us to test this idea. The graphs above show the collocations between the two country names and the word ‘war’, and all words in semantic class ‘G3’, words associated with war. They show that although ‘war’ does co-occur with these countries, it does so no more than 10% of the time. Further collocation analyses can be used to show what other themes are associated with the countries in these periods.
We are interested in the representation of local places in The Era, as well as countries. A technique known as geoparsing allows us to identify place-names in the text and allocate them with coordinates. The map above shows the places – mainly towns and cities – associated with a range of common nineteenth century diseases. Being able to link between the map and the underlying text allows us to understand how patterns vary from place to place. For example, the mentions of disease in India tend to be associated with newspaper reports on the deaths of individual colonial officials and soldiers. Egypt, by contrast, is driven by personal testaments in medical advertisements by people who claim to have used a particular medicine whilst living in there.
This global geography is, however, dominated by references to places in England and the map above shows this in greater detail. This spatial depiction of disease mentions not only allows us to explore the temporal geography of newspaper interest in different diseases, it also allows for a comparison with other patterns and information such as those found in official reports and statistics.
A key point of this work is that research with digital sources in the humanities is not a simple two stage process in which a source is digitised and then findings appear. Digitisation has been criticised as being expensive and producing problematic results. Both are true, however the response should not be either to give up in despair or to carry on regardless ignoring the problems. Instead, issues such as OCR quality and its impacts will present significant research challenges for many years to come. It is important that the humanities play a key role in responding to these challenges. Beyond this, effectively exploiting the content within large digital sources requires much more than simple browsing and keyword searching. Research into developing new methodologies or adaption of existing ones to make them more appropriate to the humanities is essential. These need to allow and encourage the combination of the computer’s ability to summarise patterns in large volumes of data, with the more traditional humanities skills of understanding subtly and nuance in documents written by humans. Finally, while these stages present many possibilities, they are of little use unless applied research follows at the end. Getting to the applied stage can be a long journey requiring significant investment, interdisciplinary expertise and changing working practices. If followed, however, this journey will lead to both new knowledge about how to make full use of the digital sources that are ever more pervasive, and to major new contributions to our understanding of the past.