THE BRITISH LIBRARY

Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

20 July 2016

Dealing with Optical Character Recognition errors in Victorian newspapers

This second (of two) posts featuring speakers at an internal seminar on spatial humanities is by Amelia Joulain-Jay of Lancaster University. Let's hear from Amelia...

Have you browsed through the British Library’s Collection of Nineteenth Century Newspapers? Then you have probably searched for a word in an article, only to find that some instances of that word were highlighted, and not others. In the following article, for example, (which comes from the 24th August 1833 edition of the Leeds Mercury), searching for ‘Magistrates’ (without 'fuzzy search') highlights one instance in the second paragraph, but misses the instance in the first paragraph.

Screenshot from “COUNTY RATE”, Leeds Mercury, 24 Aug. 1833, British Library Newspapers
Figure 1. Image snap of “COUNTY RATE”, Leeds Mercury, 24 Aug. 1833, British Library Newspapers (login may be required). [Last accessed 13 Jul. 2016]


That’s because what you see is a picture of the original source, and you (as a human) are able to read it. But the search engine is searching through OCR output – text generated by Optical Character Recognition (OCR) software which tries to guess what characters are represented on an image. The OCR output for the passage above actually looks like this:

COUNTY RATE tvtaN s s fl s Loud complaintst have been madc and we believe jstly of the unequal pressure of the County Rate ripon the differenrt townships and parishes of and it has In consequence been deter inmosl to make a general survey and to establisB a new scale of ment To this the trading and tnanufacturing interests of the Riding do not object tiorgfl tile effect will doubtless be to advance their assessmcnts in coparlison with those of the agricultural parhitras But we confess that it wa with setrprise we heard that any of the Mogistrates in holding their Courts for the assessment of the respective townships had reated them into secret tribunals and that they lad excluded from their sittings thoso wlto are mainly interested in ascertaining the principles which goreen the raluation of propertt and the full and fair develtpmemnt of which can alone rcuider the decislons of their Courts either satisfactory or permaneent The frank and manly example set by tire township of Leeds dorg h0onour to tbe parish officers and we must say wIthout wishling to give offence to those for swhoimt we feel nothing but respect that the line of conduct r sued by ithe Magistrates at Bradford on Btoaday last in excludintgi a parist officer from their Court swhen they knew that he was tire organ of tie towvnship hltich contributes most targely to this impost il the ltole Riding and when lie lasi explained to them in latigniagr srfaitiently courteous anid respectful that lie sotght only rltv crlsis of public jusrice requires a anuch ittore satisfnectory explanation than toas either given on Lhat tccasion or than ee apprehendl con be give n for adopting one of the roost objectionrble characteristics of the Court of the Holy lrquisition

Figure 2. OCR data for “COUNTY RATE”, Leeds Mercury, 24 Aug. 1833, British Library Newspapers.

You can read a lot of it, but there are errors, including the first occurrence of ‘Magistrates’ which is spelt ‘Mogistrates’.

Guessing what characters are in an image is not an easy task for computers, especially when the images are of historical newspapers which can be in varying states of conservation, and often contain complex layouts with columns, illustrations and different font types and sizes all on the same page.

So, how much of a problem is this, and can the errors be corrected?

This is what I have been investigating for my PhD project, as part of the Spatial Humanities project and in association with the Centre for Corpus Approaches to the Social Sciences.

In a nutshell: it’s not very easy to correct OCR errors automatically because errors can be very dissimilar to their correct form – in the example above, for example, the phrase ‘language sufficiently courteous’ has become ‘latigniagr srfaitiently courteous’ in the OCR output. Normalization software (like spell-checkers) often assume that the errors and their corrections will have many letters in common (as if they were playing a game of anagrams), but this assumption is often incorrect, as in the example above. So how can OCR errors be corrected? One state-of-the-art commercial software package I tested, Overproof, uses a technique the designers call ‘reverse OCR’: basically, they compare images of correct words to the image of the source! A simple-sounding idea which turns out to work well; you can read more about it in 'Correcting noisy OCR: context beats confusion' (login may be required).

And how much of a problem are the errors? Well, it depends what you are using the texts for. Leaving aside the question of using search engines, and its 'traps for the unwary', if you are interested in analysing patterns of discourses in texts, the main problem you will face is that the errors are not distributed evenly throughout the texts. This makes it difficult to predict how the errors might affect the retrieval of a particular word/phrase you are interested in. But if you follow some common-sense advice, you can stay on safe ground:

  1. Don’t over-interpret absences. (In OCR’ed texts, something which is missing may simply be something which is irretrievable because it is affected by OCR errors.)
  2. Focus on patterns for which you can find many different examples: ‘real-word errors’ (errors which happen to coincide with a word which actually exists, such as ‘Prussia’ which becomes ‘Russia’ when the OCR misses out the ‘P’) do exist, but they do not normally occur very often. Keep an eye out for them, but if you form a hypothesis on the basis of many examples, you are on safe ground!

In conclusion, digitized historical texts may suffer from OCR errors. It is important to be aware of the issue, but do not let this hold you back from using such sources in your research – following some simple rules of thumb (such as not placing too much emphasis on absences and focussing on patterns for which there are many different examples) can keep you on safe ground.

12 July 2016

Ruby Dixon Work Experience

Posted by Ruby Dixon, currently a student at Graveney School and on work-experience at BL Labs.

Day 1: Monday 27/6/2016

Staff entrance pic
Arriving at the British Library staff entrance.

My name is Ruby Dixon and I am 16 years old. I am currently a student at Graveney School in south London and I have just finished my GSCEs. Today I began my first ever work experience placement which I am undertaking at the British Library. I have been placed with BL Labs in the Digital Scholarship department, where I am working with Mahendra Mahey (Project Manager of BL Labs) for the next two weeks. Following lunch, after I had completed a health and safety induction, Mahendra sat me down to tell me all about what BL Labs does. I have now discovered that BL Labs is all about making the digital collections of the British Library available to people to experiment with and this is achieved through various ways, such as by finding some of these collections and putting them online, running competitions, awards and working on projects. However, I also learnt some pretty astonishing statistics, for example the British Library is home to at least one hundred and eighty million items (this is probably an underestimate) and only one to two percent of these items are digitised, although the number of digital collections is always increasing. As well as finding out more about BL Labs, I also had the chance to attend my first work-related meeting which was looking at new ways to engage the public with the Library’s digital items, and this was a new and positive experience for me. Overall, I had an interesting introduction to the British Library on my first day here.

BL Labs leaflet pic

BL Labs leaflet describing their role in the Library.

Day 2: Tuesday 28/6/16

Myself and Mahendra kicked off my second day by introducing me to the http://data.bl.uk website, which will enable some of the Library’s digital collections and datasets to be available for direct download. As part of this work, I began looking at the Single sheet digitisation project, checking that existing draft entries were consistent and making appropriate amendments. Later on, Mahendra went through with me the plan of what I would be getting stuck into during my work placement, and I was very excited to get started. For most of the day I was checking these pages and in total I managed to go through twenty-seven pages. Towards the end of the day, I began drafting this blog and here I am now, writing it. In conclusion, I felt that today I started to do some proper, professional work which I found very enjoyable and, you may think that this is outlandish, but I also found it energising, a refreshing boost after many exams.

  Digitisation sheet pic
An example of the work I was doing: it shows part of one of the Single sheet digitisation sheets.

Day 3: Wednesday 29/6/16

During the first half of my day I began to work more on this very blog which is now cast upon your very eyes, editing bits here and there and adding new experiences in. This was essentially what my morning was made up of: drafting and re-drafting. Following lunch, I met Frances Bean (Programme Support Officer, St Pancras Transformed Programme. Operations Division) and learnt all about what the St Pancras Transformed project does. After being shown around and being told which parts of the project contribute directly to the Library, I was whisked away to a meeting about the Library’s soon-to-be new catering company. There were many more people at this meeting than my first meeting, making me feel a little more important and professional which was fun. Generally I had a good, hands-on day today, where I could really get down to doing some work and enjoy spending my time doing so.

 St Pancras Transformed picThe St Pancras Transformed logo.

Day 4: Thursday 30/6/16

The fourth day of my work placement arrived and I couldn’t wait to jump right into the working day. In the morning I met Karen Bradford and she told me about her job as a conservator, which - I learnt - includes protecting and conserving the physical items of the Library. I then had a tour around the conservation studio where I got to have a special sneak peak at some of the work that the conservators get up to, which was remarkably interesting. Then, in the second half of my day, I was with Ria Bartlett (Learning and Digital Programme Manager) in the Learning Centre. For the first bit of the afternoon I went along with a school/college group who had come to the Library for a Shakespeare workshop. 

Conservationist pic
A conservationist at work.

As they were being shown around the Shakespeare exhibition, I seized the opportunity to look around it myself. I thought it was fascinating and personally I found it very enjoyable too. Later on, I went to a talk which was based around the British Library’s sisterhood collection (more information can be found here: http://www.bl.uk/sisterhood). Ideas about feminism were discussed as well and the Women’s Liberation Movement and the Suffragettes were mentioned too, making it an interesting insight into the past and current lives of women.

Women's Liberation pic
The Women’s Liberation Movement protesting.

Friday 1/7/16:

Pacing into work this morning, a sense of enthusiasm stirred in the pit of my stomach making me feel raring to go, to tackle another day of work. When I got in, Hana (Project Officer of BL Labs) showed me how to upload competition entries to the Labs website, which I then went on to do by myself. This work involved me uploading some of the 2016 Competition entries to the BL Labs website, for example one of these entries was ‘Existing in your Mind’ by Jeremiah Ambrose:

 
Existing in your Mind picA snapshot of the entry from 'Existing in your mind' by Jeremiah Ambrose.

I uploaded two other entries as well and they can be accessed via the link above. Once I had finished this, I began to update the ‘Previous entries and ideas for the BL Labs Competition’ page on the Labs website, checking the text carefully for accuracy. Following lunch, I then did a review of my first week here, going through what my work placement has been like so far with Mahendra. I came to the conclusion that I was pleased with the progress that I had made this week and personally, I have found it really interesting and exciting. To be perfectly honest, however, I cannot quite believe that my first week is actually over!

Monday 4/7/16:

As I eagerly pulled the door to the staff entrance open, a slight shock of surprise hit me as I realised that this is already my second week of work experience. “That went quickly!” I silently think to myself as I slide my staff pass over the reader, allowing me to enter the building. I must admit that is one of the things I really love: being able to access ‘staff only’ areas. Anyway, today I was working on different projects and floating between each one. First of all I started to update the digital collections on the Labs website, by adding collections that did not exist on the site, and checking that the openly licensed collections on the Digital Asset Register (DAR) (an internal document which lists the Library's Digital Collections) were also on the Labs site. Next, during the second-half of my day after a tasty lunch, I began to create a collection of Finnish books out of the 65,000 digitised 19th Century Microsoft books which is proving to be an interesting task so far. This work involved me looking at the spreadsheet containing the 65,000 books themselves and filtering them out to find books relating to Finland. I have actually written a separate blog post about my work on this project which can be accessed here.

Flickr image 1
One of the British Library Flickr Commons images found by searching ‘Finland’. This image can be viewed here.

Tuesday 5/7/16:

After arriving for my eighth day of work experience I was taken to a meeting which brought the Digital Scholarship team together and discussed the progress made in different areas. Hearing about the different roles of each person on the team was interesting as I learnt all the different jobs needed to make the Digital Scholarship team work. Once the meeting had finished I continued to update the Labs website, again by checking that the digital collections on the DAR could also be found on the site itself.

DAR pic
A screenshot of the Openly Licensed collections that can be found on the DAR

During the second half of my day I continued with my Finnish project, developing various techniques to try to find items for the collection. Once I had finished doing this I started to organise my Finnish spreadsheet to make it clearer and neater. I also organised my Finnish blog post to make it look more presentable, as well as adding how I had continued with this project today. 

Sneak peak pic

                                  A sneak peak of part of my Finland blog post.

Thursday 7/07/16:

A sense of excitement rumbled deep inside my stomach this morning as I arrived into work and remembered what my plans were for the day. During the beginning of my day, I  worked with a few members of the Learning team and I was able to observe a workshop, run for a school group of pupils aged 9-10. They seemed very lively and bubbly and asked many questions, some of which I am pleased to say I could help answer. Afterwards, I met up with Karen again and she took me to a photo shoot. Elizabeth Hunter, Senior Imaging Technician, was photographing furniture from the 18th and 19th centuries that is going to be moved to Boston Spa, since it is not needed at St Pancras anymore. It was fascinating for me to watch how she had to set-up the different angles she was taking images from and the way lighting affected the quality of the picture. Personally, I find photography very interesting and I am taking it as an A-level in the next academic year but I do not actually have any experience in this field yet, so this was really a fantastic opportunity for me and I thought that it was a great insight into the art of photography. I even got to star in one of the photos!

Myself with chair picA photograph of myself with a 19th century chair

Overall I really enjoyed today and I felt that I was able to get really involved, making it a very interactive working day .

Friday 08/07/17:

Remembering that it is my last day as I yank the door open to the staff entrance for one last time, a cloud of gloom seems to hover over me. However, it soon passes as I realise that this is not a day to be miserable, it is a day to celebrate the wonderful time I have had at the British Library. With a refreshed smile cast upon my face, I head to my desk and set-up my work for the final time. During the morning, I made some concluding changes to each of my blogs - one of which you will be reading now! You can access my other blog here. Overall, I have really enjoyed my work experience at the British Library because it has been very interesting and I have had a great insight into the world of work. It has also made me realise what a fabulous place the British Library is as it is a lovely place to not only work, but to study and to socialise too. I would like to take this opportunity to thank everyone who has helped to provide me with a unique experience which I will forever remember, especially Mahendra who has helped me a lot along my way with many different things.

Until the next time,

Blog image 3Saying my last farewells to the Library (for now) on the 8/7/16

Ruby

 

11 July 2016

Finding digitised books and images about Finland in a collection of 65,000 books

Posted by Ruby Dixon, currently a student at Graveney School and on work-experience at BL Labs.

Background

The ‘Microsoft’ books are 65,000 digitised volumes - about 22.5 million pages - which were published between 1789 and 1914; they were digitised in partnership with Microsoft. They cover a wide range of subject areas including philosophy, poetry, history and literature and they include Optically Character Recognised (OCR) text from the millions of pages.

In discussion with Mahendra Mahey, Project Manager of BL Labs, we explored making a ‘sub collection’ from this larger set which will hopefully help researchers in the future. After thinking about making a collection of ‘works of fiction’, ‘bibles’ or titles about ‘slavery’ I decided that identifying a collection of books about Finland would be the most interesting and realistic thing to do as part of my mini-project at the Library.

The collection I am creating will hopefully help a project that the Library might be working on which celebrates the 100th year of independence of Finland in 2017.

Facts about Finland

When starting this mini-project, I thought it would be wise to do some background research about Finland. I thought this would be a great way to put my GSCEs in Geography and History to use. Knowing more about the history and geography of Finland would help me in my ‘detective’ hunt through the collection of books. I would learn about important keywords I might need to use to help me identify relevant books in the digitised collection.

Here are some useful facts that you may not know about Finland:

  • Finland had autonomy with Russia on 29 March 1809.
  • Finland received independence on 6 December 1917.
  • Finland joined the European Union on 1 January 1995.

These and more facts can be accessed online: https://en.wikipedia.org/wiki/Finland

Map of Finland picA map showing Finland, taken from Wikipedia: https://en.wikipedia.org/wiki/Finland

This gave me a clue in understanding that there may in fact be several books in the collection in the Russian Language that could cover Finland, given that Finland was given autonomy in 1809 from Russia. Looking at the map of Finland, I also realised that bordering countries would most likely have books about Finland as well.

Approach

Analysing the collection spreadsheet 

Master spreadsheet pic 2A screen shot of a section of the spreadsheet containing 65,000 records of digitised books in the ‘Microsoft Books’ collection.

My first task was to examine the huge spreadsheet containing information about the 65,000 books in the collection.

There were several lines of ‘attack’ we could take in finding information about Finland in this collection, some which involve using the ‘Filter’ function in Excel.

Master spreadsheet picScreen shot from Microsoft Books Spreadsheet: 1. The 'Filter' function in Excel. 2. Filter has been applied on the language code for Finland ‘fin’

We came up with the following strategy:

  1. Find words relating to 'Finland' in the Title field in the spreadsheet for the books.
  2. This task would have to be done in several languages as there are 28 languages listed in the language code field (column C). I decided I would prioritise English and languages of bordering nations around Finland and if I had time would look at the other languages too.
  3. I knew I would have to use Google translate (https://translate.google.co.uk/) to find equivalent words in that language relating to Finland to help me with filtering.

In terms of thinking of what words I might use for the filtering, Mahendra suggested that it might be useful to create a word cloud about all things 'Finnish'; this might help me decide which words were the most important and to use first in filtering.

I used https://tagul.com/ and here is the word cloud I made using the Wikipedia page about Finland:

Word cloud picWordcloud created using Tagul, based on the Wikipedia page in English about Finland.

From this, we decided to use the following words (the amount of words was limited due to time): Finland, Finnish, Helsinki and Finn. 

We also filtered using Danish, Swedish, German, English, Finnish and Russian languages and using related words about Finland in those languages.

Below is a summary table showing the number of books we found by applying a filter to the 'Title' field in the spreadsheet about words related to 'Finland'.

Table 1The table above shows the number of books I found using various filters in the digitised collection.

Please note, that I didn’t have time to look further into the collections we found in some of the non-English language collections, as I am not a native speaker in any of them. More time would be needed to filter this collection. The spreadsheet is available here.

What is interesting, however, is that we know there are 582 books in the collection in the Russian language, details of which I sent to Katya Rogatchevskaia, Lead Curator of East European Collections. 

Images in the books about Finland

I learned how the images from the 'Microsoft' books were extracted and placed on The British Library’s Flickr page. This slide from a BL Labs presentation nicely summarises how it all happened: 

Flickr process pic

Taken from the BL Labs Slideshare account, http://www.slideshare.net/labsbl

More information is available from a blog post written by Ben O’Steen, Technical Lead of BL Labs, which explains this process in much more detail.

What I realised was that there must be images identified in these books which relate to Finland. Mahendra suggested that I first look at some work done by the Wikimedia community on trying to find maps within these images.

Wikimedia commons synoptic index

The Wikimedia Commons Synoptic Index for the Mechanical Curator images, contains a really handy breakdown of the images by geographical place.

Wikimedia pic

Image taken from British Library/Mechanical Curator collection/Synoptic index, Europe.

From this, I was able to find that there were 12 books that had been identified as having images which had something to do with Finland in them.

Wikimedia Finland picImage taken from Wikimedia Commons page.

This was a great way to start, but now I thought I would try the British Library’s Flickr Commons site to see if there were more images about Finland that had been tagged with Finland-related words.

British Library Flickr Commons

As of 07/07/16 there are 1,023,705 images on the British Library’s Flickr Commons page; a large proportion of these come from images snipped out of the digitised books that I have been working on.

The site has had an incredible 400,000,000 plus views and users have tagged over 100,000 images with around 500,000 tags. I am really looking forward to see what the winners of the Labs Competition 2016 will do on their ShelockNet project as they are hoping to tag all the images using computers code!

For now, I wanted to use the tags already there to see if I could find images relating to Finland.

Here is an example image which has several tags added, some of which relate to Finland:

  Image from Flickr 1 Flickr tags pic
Tags added to an example image on the British Library Flickr Commons page.

Here you can see tags such as ‘Finland’, ‘Suomi’ (Finnish for ‘Finnish’), ‘Helsinki’, ‘Helsingfors’ (Swedish for ‘Helsinki’) etc. which have been added by Flickr users (grey tags). Please note that tags in white are those added automatically by Flickr itself.

I have summarised the images I have found on the British Library’s Flickr Commons collection below:

 Keyword(s) used and link to BL Flickr Commons   Number of images found 
Finland 917
Helsinki 18
Suomi 3
Suomen 418
Suomalaiset 15
Finns 42
Finnish 352
Gulf of Finland 43
Kulturbilder ur Finlands historie 1
Turku 3
Pori 4
Tampere 1
Kuopio 2
Hanko 177
Lapland 148
Suomenlinna 2
Kemi 1
Total 1997

 Table showing links and number of British Library Flickr Commons images about Finland

What is clear from this initial research is that there are definitely more books with images about Finland than the 12 identified through Wikimedia Commons. Much more work will be needed on this. Also, I would recommend that all the images that I have found be downloaded so that they may be used for the Finnish Institute project.

In conclusion, I have enjoyed being able to participate in this project and have loved getting involved in some work on it. Although it has been relatively challenging, this new experience has been very interesting and I have definitely enjoyed spending my time on it. On the other hand, I would say that more time is certainly needed on this project to find more books in the 65,000 collection as I have only had a limited amount of time to spend on it. Furthermore, I would recommend that more words relating to Finland should be found and used in several languages to filter the master spreadsheet, in order to add more books to the Finnish collection. Lastly, one other thing that could be done to develop this project even further is to work with the curators of other languages to help identify Finland-related books.

If you would like to find more sub collections in the Microsoft books collection, please email labs@bl.uk, they would love to hear from you!

Tomorrow I will blog about my work experience at the library.