This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected].
It’s been more than eight years, in June 2015, since the British Library launched its crowdsourcing platform, LibCrowds, with the aim of enhancing access to our collections. The first project series on LibCrowds was called Convert-a-Card, followed by the ever-so-popular In the Spotlight project. The aim of Convert-a-Card was to convert print card catalogues from the Library’s Asian and African Collections into electronic records, for inclusion in our online catalogue Explore.
A significant portion of the Library's extensive historical collections was acquired well before the advent of standard computer-based cataloguing. Consequently, even though the Library's online catalogue offers public access to tens of millions of records, numerous crucial research materials remain discoverable solely through searching the traditional physical card catalogues. The physical cards provide essential information for each book, such as title, author, physical description (dimensions, number of pages, images, etc.), subject and a “shelfmark” – a reference to the item’s location. This information still constitutes the basic set of data to produce e-records in libraries and archives.
Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis
The initial focus of Convert-a-Card was the Library’s card catalogues for Chinese, Indonesian and Urdu books – you can read more about this here and here. Scanned catalogue cards were uploaded to Flickr (and later to our Research Repository), grouped by the physical drawer in which they were originally located. Several of these digitised drawers became projects on LibCrowds.
Crowdsourcing Retroconversion
Convert-a-Card on LibCrowds included two tasks:
- Task 1 – Search for a WorldCat record match: contributors were asked to look at a digitised card and search the OCLC WorldCat database based on some of the metadata elements printed on it (e.g. title, author, publication date), to see if a record for the book already exists in some form online. If found, they select the matching record.
- Task 2 – Transcribe the shelfmark: if a match was found, contributors then transcribed the Library's unique shelfmark as printed on the card.
Online volunteers worked on Pinyin (Chinese), Indonesian and Urdu records, mainly between 2015 and 2019. Their valuable contributions resulted in lists of new records which were then ingested into the Library's Explore catalogue – making these items so much more discoverable to our users. For cards only partially matched with online records, curators and cataloguers had a special area on the LibCrowds platform through which they could address some of the discrepancies in partial matches and resolve them.
An example of an Urdu catalogue card
After much consideration, we have decided to sunset LibCrowds. However, you can see a good snapshot of it thanks to the UK Web Archive (with thanks to Mia Ridge and Filipe Bento for archiving it), or access its GitHub pages – originally set up and maintained by LibCrowds creator Alex Mendes. We have been using mainly Zooniverse for crowdsourcing projects (see for example Living with Machines projects), and you can see here some references to these and other crowdsourcing initiatives. Sunsetting LibCrowds provided us with the opportunity to rethink Convert-a-Card and consider alternative, innovative ways to automate or semi-automate the retroconversion of these valuable catalogue cards.
Text Recognition
As a first step, we were looking to automate the retrieval of text from the digitised cards using OCR/Machine Learning. As mentioned, this text includes shelfmark, title, author, place and date of publication, and other information. If extracted accurately enough, this text could be used for WorldCat lookup, as well as for enhancement of existing records. In most cases, the text was typewritten in English, often with additional information, or translation, handwritten in other languages. To start with, we’ve decided to focus only on the typewritten English – with the aspiration to address other scripts and languages in the future.
Last year, we ran some comparative testing with ABBYY FineReader Server (the software generally used for in-house OCR) and Transkribus, to see how accurately they perform this task. We trialled a set of cards with two different versions of ABBYY, and three different models for typewritten Latin scripts in Transkribus (Model IDs 29418, 36202, and 25849). Assessment was done by visually comparing the original text with the OCRed text, examining mainly the key areas of text which are important for this initiative, i.e. the shelfmark, author’s name and book title. For the purpose of automatically recognising the typewritten English on the catalogue cards, Transkribus Model 29418 performed better than the others – and more accurately than ABBYY’s recognition.
An example of a Pinyin card in Transkribus, showing segmentation and transcription
Using that as a base model, we incrementally trained a bespoke model to recognise the text on our Pinyin cards. We’ve also normalised the resulting text, for example removing spaces in the shelfmark, or excluding unnecessary bits of data. This model currently extracts the English text only, with a Character Error Rate (CER) of 1.8%. With more training data, we plan on extending this model to other types of catalogue cards – but for now we are testing this workflow with our Chinese cards.
Entities Extraction
Extracting meaningful entities from the OCRed text is our next step, and there are different ways to do that. One such method – if already using Transkribus for text extraction – is training and applying a bespoke P2PaLA layout analysis model. Such model could identify text regions, improve automated segmentation of the cards, and help retrieve specific regions for further tasks. Former colleague Giorgia Tolfo tested this with our Urdu cards, with good results. Trying to replicate this for our Chinese cards was not as successful – perhaps due to the fact that they are less consistent in structure.
Another possible method is by using regular expressions in a programming language. Research Software Engineer (RSE) Harry Lloyd created a Jupyter notebook with Python code to do just that: take the PAGE XML files produced by Transkribus, parse the XML, and extract the title, author and shelfmark from the text. This works exceptionally well, and in the future we’ll expand entity recognition and extraction to other types of data appearing on the cards. But for now, this information suffices to query OCLC WorldCat and see if a matching record exists.
One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis
Matching Cards to WorldCat Records
Entities extracted from the catalogue cards can now be used to search and retrieve potentially matching records from the OCLC WorldCat database. Pulling out WorldCat records matched with our card records would help us create new records to go into our cataloguing system Aleph, as well as enrich existing Aleph records with additional information. Previously done by volunteers, we aim to automate this process as much as possible.
Querying WorldCat was initially done using the z39.50 protocol – the same one originally used in LibCrowds. This is a client-server communications protocol designed to support the search and retrieval of information in a distributed network environment. With an excellent start by Victoria Morris and Giorgia Tolfo, who developed a prototype that uses PyZ3950 and PyMARC to query WorldCat, Harry built upon this, refined the code, and tested it successfully for data search and retrieval. Moving forward, we are likely to use the OCLC API for this – which should be a lot more straightforward!
Curator/Cataloguer Disambiguation
Getting potential matches from WorldCat is brilliant, but we would like to have an easy way for curators and cataloguers to make the final decision on the ideal match – which WorldCat record would be the best one as a basis to create a new catalogue record on our system. For this purpose, Harry is currently working on a web application based on Streamlit – an open source Python library that enables the building and sharing of web apps. Staff members will be able to use this app by viewing suggested matches, and selecting the most suitable ones.
I’ll leave it up to Harry to tell you about this work – so stay tuned for a follow-up blog post very soon!