The Newsroom blog

News about yesterday's news, and where news may be going

2 posts categorized "Text mining"

11 January 2017

Analysing the past

There are exciting changes happening in how we use newspapers to study the past. After decades in which the use of newspapers in research meant leafing through volumes or scrolling through microfilms, digitisation made millions of newspapers more readily searchable and far more widely available. But now that digitisation that taken us to the next stage in development, which is using the data generated by the digitisation process to look at history on a grand scale. We are moving into the era of big data newspaper studies.

Peopleinhistory

From the University of Bristol study: People in history. (A)  famous personalities by occupation using all extracted entities associated with a Wikipedia entry; (B)  the probability that a given reference to a person is to a male or a female person

Big data newspaper studies have come about through a combination of large-scale digital resources and a growth in analysis tools. Most will be aware of OCR (optical character recognition), the mechanism by which archival texts can be converted into machine-readable texts by converting what a computer sees as an image (i.e. the arrangement of letters on a page) and matches these to letters that it knows. It is an imperfect science, because OCR can struggle to work with older forms of types and deteriorating page originals, but levels of accuracy continue to improve as new OCR software is developed, and the results are generally satisfactory - that is, most of the time a researcher will find what they are looking for, if it is there to be found.

But added to this are software tools that can extract further sense from the raw data set that generated by OCR. The field of what is called Natural Language Processing, by which computer come to understand human text and speech, includes the extraction of keywords, or named entities, and the matching of these to controlled lists of terms (such as DBpedia), further mapped to geographic areas and time periods, which enables researchers to undertake controlled, thematic analysis of large historical datasets. Our archive of words yields patterns of behaviour with much to tell about our past selves.

This is the theme of a major project undertaken by the Intelligent Systems Laboratory at the university of Bristol, led by Professor Nello Cristianini. As described in their paper 'Content analysis of 150 years of British periodicals', the project worked on a corpus of newspapers digitised from the British Library's collection by family history company Findmypast for the British Newspaper Archive website. The figures involved are huge. The project analysed 28.6 billion words from 35.9 million articles contained in 120 UK regional newspapers over the period 1800-1950, which they calculate forms 14% or all regional newspapers published in the UK over the period.

The project then used this study to explore changes in culture and society, determined by changes in the language. It looks at changes in values, political interests, the rise of 'Britishness' as a concept, the spread of technological innovations, the adoption of new communications technologies (the telegraph, telephone, radio, television etc), changing discussion of the economy, and social changes such as mentions of men and women, the growth in human interest news and the rising importance of popular culture. It is the stuff of multi-volume histories of the past, boiled down to eye-catching graphs.

This does not mean that we thrown away those multi-volume histories, however, The researchers are at pains to point out that such data analysis is an inexact science, with many caveats needed to explain how the entities have been arrived at and with what degree of caution they should be treated. The data derived from such tools can only work where it is supported by traditional studies, to gain the richer understanding of what happened. The machines may have taken the natural language of humans and converted it into data, but the results need to be converted back into human language to offer real understanding.

So it is that some of the results of the project yield results that may seem obvious. We could have guessed beforehand that the newspaper archive would show an increase in discussion of popular culture subjects, that politicians are more likely to achieve notoriety within their lifetimes than scientists, or that there was a rise in coverage of the Labour Party from the 1920s onwards. But the analyses reinforce through data what we have previously inferred through study, while discoveries such as the term 'British' overtaking the term 'English' at the end of the 19th century, or the decline in terms associated with ''Victorian values - such as 'duty', 'courage' and 'endurance' - call for new studies to explore these things further.

The project is at pains to point out the importance of using newspaper archives. Previously we have had big data analyses of millions of historical books, most familiar through the Google Ngram Viewer. This has caused controversy among some scholars, because of the unevenness of coverage of topics in books, and the limitations of merely counting words and making them searchable again. Opening up newspaper archives for comparable analysis widens the amount of content available, arguably with greater reliability overall, and now with tools to make analysis that much more scientific. The use of controlled terms will also enable the analysis across different datasets - so, books and newspapers, but also other news forms, as subtitle extraction and speech-to-text technologies now start to make our television and radio archives available for similar and shared analytical studies. Our big data is only going to get bigger.

There are limitations to this use of newspaper archives. The quality of OCR varies not only according to the original newspaper, but according to the microfilm where this has been used instead of print. Digitisation is quicker and cheaper this way than digitising from print, but older microfilm can be photographically poor, leading to inferior OCR (though there are promising tools appearing for improving poor OCR). The British Newspaper Archive is made up mostly of UK regional newspapers, because the main nationals have often been digitised by their current owners and are available separately. How different was the discourse in newspapers based in London from those around the rest of the country? That has to be the subject of another major study.

Meme

One of the better jokes from the Victorian Meme Machine project

The British Library has been engaged in its own big data analyses of newspaper archives. BL Labs is an initiative designed to support and inspire the public use of the British Library’s digital collections and data in exciting and innovative ways. It has facilitated several studies of British historical topics through the digital newspaper archive. These include Bob Nicholson of Edge Hill University's study of jokes in Victorian newspapers, with the concept of the Victorian Meme Machine (automatically matching jokes to an archive of contemporary images); Katrina Navickas of the University of Hertfordshire's mapping of nineteenth century protest; and Hannah-Rose Murray of University of Nottingham's tracing of black abolitionists in 19th century Britain. A major user of our newspaper data is M.H. Beals of Loughborough University, who is researching how ideas travel across the historical news media, creating new insights through understanding newspaper archives as structured data.

Such projects are just the start. The availability of large-scale newspaper archives in digital form, and the data derived from such archives, enables us both to seek answers to traditional questions more quickly, and to start asking new kinds of questions. The latter is the great challenge that newspaper data offers. We need to come up with new questions, because the technology enables us to do so, and because it may question what we previously thought that we knew. As the data from their archives comes more readily available, and more easily usable by the non-data specialist, so we will find that we have only just started to read the newspapers. We are going to find that they have much more yet to tell us.

Links:

 

19 September 2014

St Pancras Intelligencer no. 34

Your blogger has been away on his holidays, now returned refreshed, so this edition of the St Pancras Intelligencer is a leisurely look back at some of the news items about news that caught our eye over the past three weeks.

Scottishsun

Newspaper front pages show a divided Scotland: Mashable collects the memorable newspaper front pages from Thursday 18 September 2014, the day of the Scottish independence referendum.

Yes comes out on top amid more than 7 million tweets on #indyref, Twitter reveals: And demonstrating the limited value of using Twitter as a gauge of overall public opinion, The Drum reveals that pro-Scottish Independence came out on top according to social media.

Source confidentiality is 'in peril' and needs 'urgent action' to combat state spying: Alan Rusbridger, editor of The Guardian, came to the British Library and spoke on the urgent need to protect journalists' sources:

This whole thing that's supposedly sacred to journalists about confidentiality of sources is in peril. And that requires urgent action by journalists to make sure they understand the technologies that will enable them to communicate.

Press Gazette reports.

Accuracy, independence and impartiality: A Reuters Institute for the Study of Journalism report on how editorial standards are maintained in a digital age, focussing on three 'legacy organisations' (the Guardian, the New York Times, and the BBC) and three digital outlets (Quartz, BuzzFeed, and Vice News). 

Designer or journalist: Who shapes the news you read in your favorite apps?: Really interesting piece from Nieman Journalism Lab on who has influence over how news apps look.

Can news literacy grow up?: Thoughts from Linday Beyerstein at Columbia Journalism Review on the "critical-thinking skills necessary to discern what is trustworthy in this churning informational stew".

Here comes the papers: After a year, while we closed down our former newspaper library at colindale and began populating the new store at Boston Spa, the British Library is ready to make print newspapers available again for researchers. Some will be available from end of September; the remainder in November. Our blog post has the details.

Yep, BuzzFeed is building a games team: BuzzFeed is getting into games development, as Techcrunch reports.

How robots consumed journalism: An intriguing short history of the involvement of robots in news production, starting in the 1770s with Swiss watchmaker Pierre Jaquet-Droz who built “The Writer,” a 6,000-part automated doll that could be mechanically programmed to write with a quill. And for robots writing the news now (they're growing in number), there's this sobering Guardian piece: The journalists who never sleep (and one of the programme covered is called Quill).

The newsonomics of the Washington Post and New York Times network wars: Ken Doctor at Nieman Journalism Lab reviews the competition between the two titles through digital networks and niche print produts.

Sir Alan Moses says IPSO is not Leveson-compliant but insists that it will be independent: The Press Complaint Commission closed on 8 September, to be replaced with the (ndependent Press Standards Organisation (IPSO). The head of the new regulator tells Press Gazette that it will live up to the first word in its name.

NewsCorp: Google is a 'platform for piracy': NewsCorp has written to the European Commission to complain that Google's huge scale puts newspapers and news sites at a disadvantage.

The death of the political interview: Newsnight editor Ian Katz writes for the Financial Times on how the political interview has gone wrong and what might be done to change things:

The dizzying decline of Britain’s local newspapers: do you want the bad news, or the good news?: Ian Burrell at The Independent says print circulation figures for regional newspapers suggest they are facing imminent extinction, but sees some reasons for optimism in the rise on online audiences and associated revenues.

How to download bulk newspaper articles from Papers Past: One for the techies out there - software developer Conal Tuohy shows how to extra bulk data for the excellent Papers Past site of New Zealand historical newspapers, and to apply data mining tools to uncover patterns in the articles.

Do people remember news better if they read it in print?: Thought-provoking piece on news consumption, from The Atlantic.

Guardian building Guardian Space at King's Cross: The Guardian is renovating a 30,000 square foot space - Guardian Space - to host live activities at King's Cross. So, just around the corner for the British Library and its Newsroom. Hello there.