12 December 2024
Automating metadata creation: an experiment with Parliamentary 'Road Acts'
This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.
The Living with Machines project used a range of diverse sources, including newspapers to maps and census data. This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.
What did we want to do?
Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris.
Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:
- Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
- OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
- Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)
To this end, we provided Cogapp with:
- Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
- The Library's metadata requirements
- Curators' support to explain and guide them through the metadata extraction and record creation process
Once all of this was put in place, the process started, however this is where we encountered the main problem.
First issue: the typeface
After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming.
Second issue: the marginalia
As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page.
This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.
OCR generated by ABBYY FineReader:
Qualisicatiori 6s Truitees;
Penalty on acting if not quaiified.
Anno Regni septimo Georgii III. Regis.
9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o?
OCR generated by the open source tool Tesseract:
586 Anno Regni ?eptimo Georgi III. Regis.
Qualification
of Truttees;
Penalty on
Gnd be it further enated, That no. Per?on ?hall bÈ
capable of ating as Tru?tËe in the Crecution of thig
A, unle?s he ?hall be, in his own Right, 02 in the
Right of his Wife, in the a‰ual Pofe??ion and En. |
joyment 02 Receipt of the Rents and P2zofits of Lands,
Tenements, and hereditaments, of the clear pearly
Ualue of Fifty Pounds z o? hall be Deir Apparent of
?ome Per?on having ?uch Cfitate of the clear yearly Uga-
lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled
unto, a Per?onal E?tate to the Amount 02 Ualue of One
thou?and Pounds : And if any Per?on hereby deemed
acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz
qualified.
OCR generated by Cogapp (without any enhancement)
of Trusteesi
586
Anno Regni ſeptimo Georgii III. Regis.
Qualihcation and be it further enałted, That no perſon thall be
capable of aging as Trulltee in the Erecution of this
ad, unlefs he thall be, in his own Right, of in the
Right of his Wife, in the ađual Polellion and En:
joyment or Receipt of the Rents and Profits of Lands,
Tenements, and hereditaments, of the clear pearly
Ualue of ffifty pounds : oi thall be peir apparent of
ſome Perſon having ſuch Etate of the clear yearly Ua:
lue of Dne hundred Pounds; ou podeled of, od intitled
unto, a Perſonal Elate to the amount ou Ualue of Dne
Penalty on thouſand Pounds : and if any perſon hereby deemed
acting if not incapable to ad, thall preſume to ađ, every ſuch Per-
Qualified.
As you can see, the OCR transcription results were too poor to use in our research.
Changing our focus: experimenting with metadata creation
Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).
Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.
Conclusion
This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources.
Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.
If you have suggestions or want to get in touch about the dataset, please email [email protected]