The Newsroom blog: Digital scholarship

31 March 2022

Visualising Victorian News

Back in 2016 we wrote a post on this blog entitled News is Beautiful. It looked at the art of infographics and data visualisations in explaining the news of today. How interesting it would be, we speculated, if the infographic artists of today could work with data from historical newspapers. What would the results look like and what would we learn from them? Six years on, we have some answers.

Advocates of Freedom infographic by Ciaran Hughes

On 1 April 2022 a small exhibition opens at the British Library's St Pancras site, entitled Visualising Victorian News. Inspired by the questions we raised back in 2016, a project was established as part of the Library's Heritage Made Digital programme to commission three infographic designers to work with data derived from nineteenth-century British newspapers digitised by the British Library to illustrate significant news themes from the Victorian era. The three artists we commissioned were Tiziana Alocci, Ciaran Hughes and Erik Nylund.

The work began in 2018 with the intention of exhibiting the results in the summer of 2020. Covid-19 put paid to such plans, but the extra two years turned out to be necessary for the learning process we needed to go through. Our original plan was to work from the raw text created by by the process of digitising a newspaper (known as OCR, or Optical Character Recognition), extracting keywords to show patterns of development that we could ask the designers to express visually. It soon became clear that the raw text, though forming an essential component, was too impressionistic on its own and needed to be supported by data from other sources. We learned the importance of have a strong story; of having datasets that complemented and contrasted with each other, enabling comparisons to be made clear; and of the balance required between text, tables and image.

We learned that some stories that we would like to have told did not have the right datasets available. We learned that some datasets were of great interest as datasets, but did not necessarily produce satisfactory stories. Crucially, we learned the importance of working with researchers who had already used data in their work, or who had produced datasets as an output of their research. They could supply the materials needed and explain the themes and arguments that such such data could best serve. Above all we learned that most important of all was a productive, co-operative relationship with the designers, sharing ideas and understanding by the process of building up a complex design what would work best in telling the story.

Visualising Victorian News is the result. There are seven designs, on the themes of Abolitionism, Newspapers, Crime, War, Health, Machines and Tea. Each uses data from digitised newspapers, augmented by data from other sources, to illustrate these news themes. Each follows the original brief we gave to the designers, of looking striking from a distance, then to be full of information for the viewer to discover as they get up close. Each design is accompanied by a panel naming the artist, researchers, data sources and other sources. The designs are on display in the upper ground area of our entrance hall. The exhibition is free, and runs to 21 August 2022, accompanying our major exhibition on British news, Breaking the News, which opens on 22 April and run to the same date.

As it says on the introductory panel of the exhibition, when we digitise a historic object, we do not simply reproduce what the original looks like, but that we untap a wealth of new information from the data it provides. To digitise is to create new histories.

To accompany the exhibition there is an event being held at the Library on 26 April, 19:15-21:00, Beautiful News / Visualising Victorian News. This will bring together the three designers behind our exhibition and David McCandless, the 'king of infographics', whose bestselling books Information is Beautiful and Knowledge is Beautiful have been joined by his latest, very appropriate title, Beautiful News. There will be a special viewing of the exhibition beforehand with the designers present between 17.45 – 19.00.

Links:

Visualising Victorian News exhibition: https://www.bl.uk/events/visualising-victorian-news
Beautiful News event: https://www.bl.uk/events/beautiful-news-visualising-victorian-news
David McCandless's Beautiful News: https://informationisbeautiful.net/beautifulnews
Tiziana Alocci: https://www.tizianaalocci.com
Ciaran Hughes: https://www.ciaranhughes.design
Erik Nylund: http://eriknylund.se

Posted by Luke McKernan at 6:25 PM in Digital scholarship , Events , Exhibitions , Newspapers | Permalink

07 September 2020

The news from Leeds

Announced as it was in the middle of March of this year, it is possible that not all may have read of the British Library's ambitions to extend its operations in some form through a new public space in Leeds. The government has made a £25 million commitment, as part of the West Yorkshire Devolution deal, to establish a British Library North in Leeds City Centre. Exploratory discussions are underway between Leeds City Council and the British Library and property developer CEG about the Grade 1 listed Temple Works site over the potential for its occupancy by the British Library.

From a Tiziana Alocci infographic on the Crimean War

As part of this process, we have been working with various Leeds organisations and group to explore shared interests through a programme of public events. One of these, the Leeds Digital Festival, takes place 21 September-2 October, and features two events (among 294) that feature the British Library news collections. As we digitise more and more of our news collections, and as research applications of a digital news library continue to develop and challenge us, we are pleased to be able to showcase two particularly interesting events that emphasise creativity and new thinking.

AI and the Headline Archive (24 September, 12:00-13:00 - tickets still available)

As part of the Heritage Made Digital newspapers project, where we are digitising poor condition out-of-copyright newspapers, we are keen to share in imaginative ways of extracting and re-using the data. For this events we have been working with artists Tom Schofield, Sam Skinner and Nathan Jones from Torque Editions, who are using artificial intelligence and speed reading technology to explore aspects of our nineteenth-century newspaper collections, focussing on headlines and story titles. This event will discuss how new discoveries can be made about human-computer reading capacity and media flows by applying artistic and ‘hacker’ techniques to historical data.

Creating Captivating Data Visualisations (29 September 13:00-16:00 - sold out)

In May 2021 the British Library will be hosting a small exhibition on infographics on nineteenth-century themes, created out of newspaper data and other datasets. We have worked with three designers on this projectors, one of whom, the award-winning information designer Tiziana Alocci with host this workshops, together with the British Library's Lead Curator, News, Luke McKernan, Alocci will lead attendees through a hands on, practical workshop in the creative process behind effective data visualisation, exploring best practices in the industry and how to make such work stand out. This project reflects our great interest in showing how historical news resources can be illuminated through current news applications, and in demonstrating creative applications of news data.

The Leeds development is one part of still larger plans to transform the British Library's existing site in the north of England, at Boston Spa in Yorkshire. Thanks to the Chancellor’s commitment, announced in the March budget, to invest up to £95 million, we will be able to renew and develop our Boston Spa site for the 21st century, securing its ability to store and make available our ever-growing national collection for generations to come. It is at Boston Spa that the majority of the nation's newspaper collection is held, in the National Newspaper Building.

Creating Captivating Data Visualisations has sold out already, but tickets are still available for AI and the Headline Archive, which is a free event. Do join us if you can, as we explore how today's technologies can make yesterday's news speak to us in new and exciting ways.

Posted by Luke McKernan at 4:34 PM in Digital scholarship , Events , Newspapers , Workshops | Permalink

02 April 2019

Vaccination and the media - a 19th century debate

Conspiracy theories capture the public’s interest and imagination. It’s evident in the documentaries about flat-earthers on Netflix, BBC podcasts about the anti-vaccination movement, and the panic surrounding the ‘Momo challenge’. The anti-vaccination movement, in particular, has been getting a lot of coverage lately, because of high-profile sympathisers and the potential damage to society’s health. There’s a lot of public and media interest in understanding this very modern-seeming phenomenon. But conspiracy theories are not new and neither are anti-vaccination movements.

Vaccination stories from 19th century British newspapers

The 19th century had an anti-vaccination movement which organised meetings, wrote letters and even paid the fines of those convicted of refusing to have their children vaccinated. They wrote letters denouncing enforced vaccination, arguing that it was an encroachment by the government on civil liberties, and that the vaccination was as or more dangerous than the disease it sought to prevent. They produced pamphlets and political cartoons. The movement spoke to fears about overreaching state power and technology encroaching on personal freedom and an imagined pastoral idyll. On the pro-vaccination side, the debate used science and statistics to prove that vaccines were necessary, and argued that they were compulsory because they ensured the safety of all, especially the weak.

It was long known that infecting patients with a mild dose of smallpox led to them developing resistance to the deadlier strains (apparently some places had a tradition of blowing powdered smallpox scabs up the noses of patients to inoculate them - another reason to be grateful for the advances of science). In 1796 Edward Jenner ‘discovered’ that those infected with cowpox (a very mild disease) also developed resistance to smallpox. He developed the world’s first vaccination: the word comes directly from the cowpox method used – vacca is the Latin word for cow. Jenner’s vaccine spread in popularity and was made compulsory in several European countries, including England in 1856. Children were to be vaccinated within six or seven months of birth, and a fine of up to £2 would be given in the event of failure. Failure to pay the fine could mean, eventually, a prison sentence. In 1867, another bill was introduced requiring re-vaccination after puberty. It was at this point that the anti-vaccination movement took hold.

The debate played out in the newspapers: there were articles and letters to the editor arguing both sides. The controversy even affected newspaper advertisements: entrepreneurs advertised ointments which supposedly eased the skin complaints of those recently vaccinated:

Nairnshire Telegraph and General Advertiser for the Northern Counties, 28th September 1859, via British Newspaper Archive

But the same advertisement is found in papers all over the country:

Coventry Evening Telegraph, 23 May 1892, via British Newspaper Archive

It’s hard to imagine a movement of this type existing without easy access to mass communication. Letters to the editor, for example, proved an easy way for those with fringe views to put their opinions on an even footing with more commonly-held opinions. This access to a platform allowed the movement to assume an authority it would not otherwise have had: access to the same media as mainstream material can mean that both sides of an argument are equally valid, even when then isn’t the case.

But how big was the conversation about vaccines, exactly? Looking at a large sample of newspapers published over the period can give us some clues. This data is from a set of around 62,000 19th century newspaper issues held by The British Library and digitised with JISC funding. It’s a simple approach: counting the relative frequency of a word over time can give an idea of how important the topic was at any time, although it doesn’t tell us anything about why it was being discussed or in what way. It also misses out alternative spellings or mis-spellings. But it can help us to identify general trends.

Unsurprisingly, we see some spikes. There are some small spikes in mentions of vaccination at the time the compulsory bill was introduced in 1856, and again for the re-vaccination bill in 1867. The interest in vaccination itself doesn’t really come until about 10 years later: an interesting indication, perhaps, of the lag between the conversation about the disease taking off, and policy (in the form of a compulsory vaccination bill) being formed. The first real spike of interest is in the early 1870s, and here I think we can detect the anti-vaccination movement. The mentions of vaccinations in this second spike are more related to the debate – on both sides. There are times, around 1888 and again in about 1896, when mentions of the disease are not really followed by mentions of vaccination. These may be times when enthusiasm for anti-vaccination groups falls on account of fears for the disease itself.

The debates themselves played out on the pages of the regional and national newspapers. They were bitter, and echoed those of today. A reprinted letter from The Lancet sums up some of the frustration on the side of the pro-vaccination:

The members of this league have some “talents for mischief,” not from the facts which the adduce, which are too insignificant to be noticed, nor from the arguments which they employ, which if they were only addressed to reasoning minds, would assuredly be recognised as puerile and contemptible, but these gentlemen wield more powerful arguments in support of the cause which they advocate. These are the hackneyed appeals to the ‘liberty of the subject: the resistance to a tyrannous enactment, and the publication of “striking” and dreadful cases of disease, and even death, as the results of vaccination.

Then, as now, the scientific and medical communities were frustrated by arguments invoking more abstract ideas: those that appealed to emotion over reason.

The insensibility of many persons to the danger of smallpox, and to the value of vaccination as a preventive, appears to arise from two causes; of which one is total ignorance of the horrors of the past, and the other is scepticism as to the representations of those who are well informed.

The author of an article in the Edinburgh Medical Journal, Dr. John Gairdner, used historical arguments to appeal to reason. He searched the archives to produce a list of royal family members who had died from smallpox. The influence of the monarchy on ordinary people was also used in other ways to promote vaccination: In February 1871 the Manchester Evening News reported that “The Queen has been revaccinated and wishes it to be generally known”. Perhaps these more narrative-focused, non data-driven arguments were seen to have more influence than statistics.

The anti-vaccination side had three main tactics. First was picking statistics which supported their argument. Second was appealing to arguments about personal freedom. In 1882, one letter to the editor of the Derby Chronicle tried to reason that vaccination should not be compulsory because the disease didn’t affect those already vaccinated:

When Mr. Cotteman has proved that doctors have a moral right to scratch us with a pin from which evil effects may follow, he may be able to prove that they have a right to insist upon vaccination. Yet this would be superfluous, since vaccination is a protection in his estimation. The protected being safe, why compel objectors?

This argument, of course, overlooked those who were unable to get vaccinated for health reasons, or the small percentage on which the vaccination had no effect.

The third tactic was supplying anecdotal evidence of individual cases where the vaccine had disastrous consequences. A writer to the Leicester Chronicle wrote in to describe a child that had been recently vaccinated, saying that it had been ‘fine, fair and healthy looking’ but after vaccination was covered all over with sores, “so much so that it is repulsive to see the poor thing”.

These groups were often hyper-local. Groups like the ‘Darlington Anti-Vaccine League’ had regular meetings and advertised them in local papers. The debate played out in the pages of the regional papers, rather than through national, official channels.

We can use news data to get some insight into the changing perceptions of the word ‘vaccination’. These word clouds illustrate the words that most commonly appear in sentences with the term:

In 1856, the words are mostly related to the financial and administrative aspects of vaccination. Thirty years later, the mostly commonly associated words have become a mix of administrative-type words, and some terms which clearly relate to suspicion and controversy surrounding compulsory vaccination. The conversation in the newspapers about vaccinations clearly changed in the intervening years. Now vaccination is mentioned with ‘child’ and ‘children’. It doesn’t prove that the conversation was negative, but it does show that newspapers were commenting on the more human element of vaccination. It’s a personal as well as a public conversation.

Compared to the fear of cholera, the attention given to smallpox by the newspapers was small, and despite spikes at the end of the century (when a ‘conscientious objector clause’ was inserted into a new vaccination bill), generally interest in the controversy surrounding vaccinations waned. What a good conspiracy really needs is air: studies have shown that more we are exposed to an idea, the more likely it is we’ll believe it is true, regardless of the evidence we’re given. It’s possible that the anti-vaccination movement lost steam because it wasn’t being talked about in the newspapers any more.

The debate surrounding smallpox vaccination tells us something about the ways in which information and communication can be used to spark debates that previously would have stayed hidden. Cultural movements, however small, are often facilitated by the expansion of access to new technology (such as newspapers in the latter half of the 19th century, or the internet at the beginning of the 21st). When these technologies reach a critical mass, they expand the ‘public sphere’ to take in the viewpoints of the minority - even when those views cause us discomfort. Opposition to anti-vaxxers proved difficult: work like Gairdner’s book might have helped to counter the movement in a way that statistics themselves didn’t seem to. Time was the best opposition: in the long run, it seems that the movement against smallpox vaccination simply petered out. Smallpox vaccinations continued, and a worldwide programme led to the virtual eradication of the disease by 1980.

It may be surprising to see such strong opposition to vaccination in a world with such a terrible problem with disease. Today these diseases can seem far removed from our lives, but in the 19th century the evidence was so incredibly clear: smallpox infection rates plummeted in areas with vaccinations. People lived with the fear and threat of infectious diseases, and most families would have been affected, at some point, by diseases like smallpox. Despite this, there was still resistance to compulsory vaccination. Even in the face of overwhelming evidence, when the alternative was a very real chance of disfigurement or death, illogical viewpoints can take hold.

Without an outlet like a regional newspaper or Reddit forum, these fringe viewpoints can often stay buried. It’s only when a place is found for them to be debated that the ideas can really spread. Regional newspapers allowed the debate to reach all parts of the United Kingdom and helped the creation of hyper-local interest groups. Today, the internet allows for the spread of ideas to any part of the world, in a very short space of time. Fringe movements can reach a critical mass even though their number in any one area may be tiny. Do new technologies breed conspiracy theories? Is the debate related to the ease with which people can communicate over long distances, to a large group of people? Does the democratization of media bring together communities of like-minded individuals, and what consequences does this have for society? These are crucial questions of both the 19th century and our own.

"The race of mankind would perish”, wrote a correspondent to the Isle of Wight Observer in 1856,

did they cease to aid each other. From the time that the mother binds the child’s head, till the moment that some kind assistant wipes the death-damp from the brow of the dying, we cannot exist without mutual help. All, therefore, that need aid, have a right to ask it of their fellow-mortals; no one who holds the power of granting can refuse it without guilt.

Those in favour of vaccination would argue that herd immunity ensures the safety of all: claiming a right personally to refuse vaccination means increasing the danger to those who are unable to get protection through no fault of their own. The debate about personal freedom and public good still continues.

Links:

There has been some interesting work on disease done by researchers at Lancaster, mapping mentions of cholera in 19th century: https://www.lancaster.ac.uk/fass/projects/spatialhum.wordpress/?page_id=652
Article about opposition to smallpox vaccination, with historical information about the disease: https://www.researchgate.net/publication/233898083_Smallpox_vaccination_and_opposition_by_anti-vaccination_societies_in_19th_century_Britain

Yann Ryan

Curator, Newspaper Data

Posted by Luke McKernan at 8:01 AM in Digital scholarship , Newspapers , Science | Permalink

12 March 2019

News counts

How do I love thee? Let me %>% group_by (ways) %>% count()

Counting is very simple. We’ve been doing it for 50,000 years. One of the first things we learn as a child is how to count: before or at the same time we learn the alphabet, we learn to count to ten. First we learn to count on our fingers, perhaps next we count on an abacus. Eventually we graduate to counting on a calculator or a computer. Computers are very good at it, too, which is useful. Give a computer some text, and it can really quickly count lots of things for you: things like the total number of words, the total number of characters or the number of unique words. Counting helps us do lots of useful things. Counting can help us to break codes or compress data. Samuel Morse counted the average frequency of letters in the English language and assigned the most frequent ones to shorter dot-dash combinations. Your computer is doing the same thing when it zips or unzips a file.

Corpus analysis is the study of lots and lots of words of a particular type. Google N-Gram browser finds words or short phrases in millions of digitised books. EEBO N-Gram browser does the same for millions of transcribed texts from the 17^th and 18^th century. At this scale, simple counting becomes really powerful. Using these tools, researchers can count the frequency of words, which can be the starting point for understanding how words were used and how ideas gained or lost momentum over time. These tools count the relative frequency of words: how unusual is it to have this word here? Are there many more instances of a word appearing than one would expect from the usual frequency? Simply counting can tell us the importance of terms, ideas, concepts in particular texts, or at particular times.

We can divide things up and then count them: How many times did a particular phrase appear in a particular location? At a particular time? In a particular title?

We can count counts: How many titles were printed in a particular year, and how many words did each of those titles contain?

What else can we count? How about whole documents: how many newspapers were printed in the 19^th century? How many titles? How many times was the word ‘Gladstone’ mentioned, vs ‘Disraeli’? Did mentions of ‘steam’ overtake mentions of ‘horse’? Counting can be a blunt tool, but it’s a starting point.

A couple of crude word searches using millions of pages of text from selected 19th century British newspapers

To take a concrete example: let’s do some counting on a single issue of one of the newspapers we’re digitising as part of our Heritage Made Digital project. We’ve taken the text of this issue and uploaded it to a web app called Voyant Tools. Voyant Tools takes text files and gives statistics and visualisations of the words within. What are the counts in this issue? This single issue has 29,734 words. It has 7,793 unique words, which could tell us something about the type of audience, or the ‘footprint’ of the author or title. What are the most common words?

Let’s quickly think about some of these words and their implications.

Mr tells us that news is, unsurprisingly bias towards reporting about one gender.

Street, house and place are intriguing, if not surprising. News is so much about space and place. Without a sense of time, news ceases, really, to be news. Perhaps the same can be said about news and space?

Which leads into the next word: Jan (the abbreviated version of January). This is a newspaper from 6 January 1821. This, alongside Dec (December shortened) tell us something about the age of news. Would you expect more or less mentions of December once news is transmitted via telegraph? There’s also day and time. It’s unlikely these words would be so common in, say, a novel, or a scientific paper. Can counting tell us something about genre?

We can count the counts: Can the words be divided into categories and counted?

What does this tell us? Well, it probably tells us more about the makeup of each individual page than anything else. We could probably guess the front page by looking its unique words. The front page was often mostly advertisements, and contact details would include words like street and Mr. It also confirms our belief that news is about information in space and time: clearly there’s a focus on place, time and people, in a way that would presumably not be so apparent in, say, a novel. If we counted the change in common words over time, we could get a picture of the changing makeup of the front page, as it moved from advertisements to headline news.

Counting is a most natural human urge and one that can have very interesting outcomes. It’s a start for all sorts of interesting research: a way to make all sorts of (often wrong) assumptions. Because counting is dangerous. It attempts to put numbers on things that may not be enumerable. We may find our attempts at counting frustrated by the stubborn fuzziness of the world, stymied by our need to put order on disorder. Over the coming months we hope to show some of the interesting things that can be done with the millions of pages being digitised by Heritage Made Digital, and lots of this research will involve, at its core, counting.

In digital scholarship, it sometimes feels like there is a move away from counting to produce results. Machine learning seems at a great distance from a chart of the most-commonly used words in a bunch of text. But machine learning still often takes a simple count as its raw material. The ‘features’ (the attributes of things we feed machine learning algorithms to make predictions about those things) are often elements like the total count of words in a particular document, or the count of unique words. No matter how sophisticated these methods get, they still, in the end, rely on counting.

Yann Ryan,

Curator, Newspaper Data

Posted by Luke McKernan at 2:12 PM in Digital scholarship , Newspapers | Permalink

02 August 2018

Wanted - a curator for newspaper data

We are currently advertising for a Curator, Newspaper Data to join our news curatorial team. This is a fixed-term post until March 2020, based at our St Pancras site in central London. The post is being advertised as part of the British Library's Heritage Made Digital programme, a major part of which involves digitising 19th century British newspapers, with a special focus on newspapers in a poor or unfit condition.

We are looking for someone who will help us to apply data journalism thinking to this historical news material. The person we are looking for will be responsible for the analysis and creative interpretation of data derived from Heritage Made Digital and related British Library newspaper digitisation projects. They will prepare derived newspaper data sets and promote these for use by researchers. They will work with researchers to develop projects using newspaper data.

In particular we want them to help us produce stylish visualisations using historical newspaper data, working with third-party designers as necessary. A couple of years ago on the Newsroom blog we wrote about the art of the news visualisation, and how this particular branch of data science was helping to illuminate the themes behind the news. We also said that it would be a good idea if such thinking, and with such outputs, could be applied to historical news data. Now we want someone to put those thoughts into action.

The post-holder will need to have a strong background in computer science and data science. They will have experience of working with or developing tools for large content and data volumes, and an interest in nineteenth-century history and/or news and current affairs. It's a terrific opportunity for the right person. Information on how to apply is on the British Library's vacancies site. The deadline for applications is 9 September 2018.

Posted by Luke McKernan at 2:06 PM in Digital scholarship , Newspapers | Permalink

07 August 2017

Help us make newspaper heritage digital

We are currently advertising for a Curator, Newspaper Digitisation to join our news curatorial team. This is a fixed-term post until March 2020, based at our St Pancras site in central London. The post is being advertised as part of a major new British Library undertaking, entitled Heritage Made Digital. As it says in the Library's recently-published Annual Report, the programme of work will include the digitisation of Indian printed books, key Ethiopic manuscripts, and fragile British newspapers from the 19th century.

Bound volumes in the National Newspaper Building at Boston Spa

The Heritage Made Digital programme is in its early stages of development, but our intention is to digitise over 1 million newspaper pages from print originals, complementing the digitisation of newspapers undertaken by Findmypast for the British Newspaper Archive, the greater part of which comes from microfilm copies.

Working with the News Curation and Heritage Made Digital teams, the Curator, Newspaper Digitisation will be responsible for the selection, description and curation of newspapers under the Heritage Made Digital programme. They will ensure that the newspapers selected for digitisation will match specific research needs, and will promote and interpret the digital newspaper collection for general and specialist audiences.

The post-holder will need to have a strong interest in historical newspapers and nineteenth-century history, with experience of working in an archive environment, backed up with good knowledge of research work in this field, and strong IT skills. It's a terrific opportunity for the right person. Information on how to apply is on the Library's vacancies site. The deadline for applications is 10 September 2017.

More information on Heritage Made Digital will be published in due course.

Posted by Luke McKernan at 4:00 PM in Digital scholarship | Permalink

11 January 2017

Analysing the past

There are exciting changes happening in how we use newspapers to study the past. After decades in which the use of newspapers in research meant leafing through volumes or scrolling through microfilms, digitisation made millions of newspapers more readily searchable and far more widely available. But now that digitisation that taken us to the next stage in development, which is using the data generated by the digitisation process to look at history on a grand scale. We are moving into the era of big data newspaper studies.

From the University of Bristol study: People in history. (A) famous personalities by occupation using all extracted entities associated with a Wikipedia entry; (B) the probability that a given reference to a person is to a male or a female person

Big data newspaper studies have come about through a combination of large-scale digital resources and a growth in analysis tools. Most will be aware of OCR (optical character recognition), the mechanism by which archival texts can be converted into machine-readable texts by converting what a computer sees as an image (i.e. the arrangement of letters on a page) and matches these to letters that it knows. It is an imperfect science, because OCR can struggle to work with older forms of types and deteriorating page originals, but levels of accuracy continue to improve as new OCR software is developed, and the results are generally satisfactory - that is, most of the time a researcher will find what they are looking for, if it is there to be found.

But added to this are software tools that can extract further sense from the raw data set that generated by OCR. The field of what is called Natural Language Processing, by which computer come to understand human text and speech, includes the extraction of keywords, or named entities, and the matching of these to controlled lists of terms (such as DBpedia), further mapped to geographic areas and time periods, which enables researchers to undertake controlled, thematic analysis of large historical datasets. Our archive of words yields patterns of behaviour with much to tell about our past selves.

This is the theme of a major project undertaken by the Intelligent Systems Laboratory at the university of Bristol, led by Professor Nello Cristianini. As described in their paper 'Content analysis of 150 years of British periodicals', the project worked on a corpus of newspapers digitised from the British Library's collection by family history company Findmypast for the British Newspaper Archive website. The figures involved are huge. The project analysed 28.6 billion words from 35.9 million articles contained in 120 UK regional newspapers over the period 1800-1950, which they calculate forms 14% or all regional newspapers published in the UK over the period.

The project then used this study to explore changes in culture and society, determined by changes in the language. It looks at changes in values, political interests, the rise of 'Britishness' as a concept, the spread of technological innovations, the adoption of new communications technologies (the telegraph, telephone, radio, television etc), changing discussion of the economy, and social changes such as mentions of men and women, the growth in human interest news and the rising importance of popular culture. It is the stuff of multi-volume histories of the past, boiled down to eye-catching graphs.

This does not mean that we thrown away those multi-volume histories, however, The researchers are at pains to point out that such data analysis is an inexact science, with many caveats needed to explain how the entities have been arrived at and with what degree of caution they should be treated. The data derived from such tools can only work where it is supported by traditional studies, to gain the richer understanding of what happened. The machines may have taken the natural language of humans and converted it into data, but the results need to be converted back into human language to offer real understanding.

So it is that some of the results of the project yield results that may seem obvious. We could have guessed beforehand that the newspaper archive would show an increase in discussion of popular culture subjects, that politicians are more likely to achieve notoriety within their lifetimes than scientists, or that there was a rise in coverage of the Labour Party from the 1920s onwards. But the analyses reinforce through data what we have previously inferred through study, while discoveries such as the term 'British' overtaking the term 'English' at the end of the 19th century, or the decline in terms associated with ''Victorian values - such as 'duty', 'courage' and 'endurance' - call for new studies to explore these things further.

The project is at pains to point out the importance of using newspaper archives. Previously we have had big data analyses of millions of historical books, most familiar through the Google Ngram Viewer. This has caused controversy among some scholars, because of the unevenness of coverage of topics in books, and the limitations of merely counting words and making them searchable again. Opening up newspaper archives for comparable analysis widens the amount of content available, arguably with greater reliability overall, and now with tools to make analysis that much more scientific. The use of controlled terms will also enable the analysis across different datasets - so, books and newspapers, but also other news forms, as subtitle extraction and speech-to-text technologies now start to make our television and radio archives available for similar and shared analytical studies. Our big data is only going to get bigger.

There are limitations to this use of newspaper archives. The quality of OCR varies not only according to the original newspaper, but according to the microfilm where this has been used instead of print. Digitisation is quicker and cheaper this way than digitising from print, but older microfilm can be photographically poor, leading to inferior OCR (though there are promising tools appearing for improving poor OCR). The British Newspaper Archive is made up mostly of UK regional newspapers, because the main nationals have often been digitised by their current owners and are available separately. How different was the discourse in newspapers based in London from those around the rest of the country? That has to be the subject of another major study.

One of the better jokes from the Victorian Meme Machine project

The British Library has been engaged in its own big data analyses of newspaper archives. BL Labs is an initiative designed to support and inspire the public use of the British Library’s digital collections and data in exciting and innovative ways. It has facilitated several studies of British historical topics through the digital newspaper archive. These include Bob Nicholson of Edge Hill University's study of jokes in Victorian newspapers, with the concept of the Victorian Meme Machine (automatically matching jokes to an archive of contemporary images); Katrina Navickas of the University of Hertfordshire's mapping of nineteenth century protest; and Hannah-Rose Murray of University of Nottingham's tracing of black abolitionists in 19th century Britain. A major user of our newspaper data is M.H. Beals of Loughborough University, who is researching how ideas travel across the historical news media, creating new insights through understanding newspaper archives as structured data.

Such projects are just the start. The availability of large-scale newspaper archives in digital form, and the data derived from such archives, enables us both to seek answers to traditional questions more quickly, and to start asking new kinds of questions. The latter is the great challenge that newspaper data offers. We need to come up with new questions, because the technology enables us to do so, and because it may question what we previously thought that we knew. As the data from their archives comes more readily available, and more easily usable by the non-data specialist, so we will find that we have only just started to read the newspapers. We are going to find that they have much more yet to tell us.

Links:

All of the regional newspapers used in the University of Bristol project are available at www.britishnewspaperarchive.co.uk (subscription site, free to use at British Library locations)
The paper 'Content analysis of 150 years of British periodicals' is frreely available in PDF format from Proceedings of the Nattional Academy of Sciences of the United States (PNAS)
Secondary data from the project, in the form of yearly n-grams and entities, is freely available to download from http://data.bris.ac.uk/data/dataset/dobuvuu00mh51q773bo8ybkdz
A book by Paul Gooding on the issues surrounding the digitisation and use of newspaper archives, Historic Newspapers in the Digital Age, is to be published shortly by Routledge

Posted by Luke McKernan at 4:46 PM in Digital scholarship , Digitisation , Newspapers , Text mining | Permalink

24 September 2015

Mining the FT

We're pleased to announce a partnership with the Financial Times to open up its archives to new kinds of research. The business news daily newspaper has been running since 1888, and has a wealth of information on national and international economic news, and in recent years reporting on general news, the arts and society. Its digital archive is available in the standard search-and-browse manner to institutional subscribers via Cengage Gale, but the newspaper is interested to explore different ways to makes its archives available, with an emphasis on what can be done with its data.

The full digital archive runs 1888-2010 and comprises 903,029 pages from 37,464 print editions. However, the collaboration is starting off with a relatively small amount of content, which may expand later. The FT has agreed a licence which permits use of the data for academic research purposes, either onsite at the British Library or via controlled remote access.

Four complete sample years of FT pages images (as JPEGs) and data (XML) are being made available to research teams: 1888, 1939, 1966 and 1991. The licence runs to the end of 2015, when we will review what has been learned and will see how access and use may be extended thereafter. So the sample years would be ideal for researchers developing data-driven projects who need some test content to scope future plans, or to test tools or applications that they may be developing.

Anyone who is interested should get in touch with Luke McKernan, Lead Curator News & Moving Image at the British Library, who can provide further details. Research teams may also be interested be to take part in the Library's first news hackathon, scheduled for November 16th, which will include FT data alongside data derived from the Library's own news collection. More news on this will be published soon.

The collaboration with the Financial Times is one part of emerging plans for British Library news data. The structure of news content offers numerous opportunities for analysing, interrogating, visualising and rethinking what news archives today, as well as creating new kinds of newspaper and and other news media history. We held a news data workshop on September 7th, where we brought together researchers, developers and content owners to look at ways we might develop plans for news data that would best benefit researchers. There's a report on the workshop on our Digital Scholarship blog.

We will hope to be issuing news on further news archive datasets that we can make available for research in the near future.

Posted by Luke McKernan at 10:59 AM in Digital scholarship , Metadata , Newspapers | Permalink | Comments( 0)

The Newsroom blog

8 posts categorized "Digital scholarship"

Visualising Victorian News

The news from Leeds

Vaccination and the media - a 19th century debate

News counts

Wanted - a curator for newspaper data

Help us make newspaper heritage digital

Analysing the past

Mining the FT

The Newsroom blog recent posts

Archives

Tags

The Newsroom links

Other British Library blogs