Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

18 December 2024

The challenges of AI for oral history: key questions

Oral History Archivist Charlie Morgan shares some key questions for oral historians thinking about AI, and shares some examples of automatic speech recognition (ASR) tools in practice in the first of two posts...

Oral history has always been a technologically mediated discipline and so has not been immune to the current wave of AI hype. Some have felt under pressure to ‘do some AI’, while others have gone ahead and done it. In the British Library oral history department, we have been adamant that any use of AI must align practically, legally and ethically with the Library’s AI principles (currently in draft form). While the ongoing effects of the 2023 cyber-attack have also stymied any integration of new technologies into archival workflows, we have begun to experiment with some tools. In September, I was pleased to present on this topic with Digital Curator Mia Ridge at the 7th World Conference of the International Federation for Public History in Belval, Luxembourg. Below is a summary of what I spoke about in our presentation, ‘Listening with machines? The challenges of AI for oral history and digital public history in libraries’.

The ‘boom’ in AI and oral history has mostly focussed on speech recognition and transcription, driven by the release of Trint (2014) and Otter (2016), but especially Whisper (2022). There have also been investigations into indexing, summarising and visualisation, notably from the Congruence Engine project. Oral historians are interested in how AI tools could help with documentation and analysis but many also have concerns. Concerns include, but are not limited to, ownership, data protection/harvesting, labour conditions, environmental costs, loss of human involvement, unreliable outputs and inbuilt biases.

For those of us working with archived collections there are specific considerations: How do we manage AI generated metadata? Should we integrate new technologies into catalogue searching? What are the ethics of working at scale and do we have the experience to do so? How do we factor in interviewee consent, especially since speakers in older collections are now likely dead or uncontactable?

With speech recognition, we are now at a point where we can compare different automated transcripts created at different times. While our work on this topic at the British Library has been minimal, future trials might help us build up enough research data to address the above questions.

Robert Gladders was interviewed by Alan Dein for the National Life Stories oral history project ‘Lives in Steel’ in 1991 and the extract below was featured on the 1993 published CD ‘Lives in Steel’.

The full transcripts for this audio clip are at the end of this post.

Sign Language

We can compare three automatic speech recognition (ASR) transcripts of the first line:

  • Human: Sign language was for telling the sample to the first hand, what carbon the- when you took the sample up into the lab, you run with the sample to the lab​
  • Otter 2020: Santa Lucia Chelan, the sound pachala fest and what cabin the when he took the sunlight into the lab, you know they run with a sample to the lab​
  • Otter 2024: Sign languages for selling the sample, pass or the festa and what cabin the and he took the samples into the lab. Yet they run with a sample to the lab.
  • Whisper 2024: The sand was just for telling the sand that they were fed down. What cabin, when he took the sand up into the lab, you know, at the run with the sand up into the lab

Gladders speaks with a heavy Middlesbrough accent and in all cases the ASR models struggle, but the improvements between 2020 and 2024 are clear. In this case, Otter in 2024 seems to outperform Whisper (‘The sand’ is an improvement on ‘Santa Lucia Chelan’ but it isn’t ‘Sign languages’), but this was a ‘small’ version of Whisper and larger models might well perform better.

One interesting point of comparison is how the models handle ‘sample passer’, mentioned twice in the short extract:

  • Otter 2020: Sentinel pastor / sound the password​
  • Otter 2024: Salmon passer / Saturn passes​
  • Whisper 2024: Santland pass / satin pass

While in all cases the models fail, this would be easy to fix. The aforementioned CD came with its own glossary, which we could feed into a large language model working on these transcriptions. Practically this is not difficult but it raises some larger questions. Do we need to produce tailored lexicons for every collection? This is time-consuming work so who is going to do it? Would we label an automated transcript in 2024 that makes use of a human glossary written in 1993 as machine generated, human generated, or both? Moreover, what level of accuracy we are willing to accept and how do we define accuracy itself?

 

Samplepasser: The top man on the melting shop with responsibility for the steel being refined. Sampling: The act of taking a sample of steel from a steel furnace, using a long-handled spoon which is inserted into the furnace and withdrawn. Sintering: The process of heating crushed iron-ore dust and particles (fines) with coke breeze in an oxidising atmosphere to reduce sulphur content and produce a more effective and consistent charge for the blast furnaces. This process superseded the earlier method of charging the furnaces with iron-ore and coke, and led to greatly increased tonnages of iron being produced
Sample glossary terms

Continue reading "The challenges of AI for oral history: key questions"

17 December 2024

Open cultural data - an open GLAM perspective at the British Library

Drawing on work at and prior to the British Library, Digital Curator Mia Ridge shares a personal perspective on open cultural data for galleries, libraries, archives and museums (GLAMs) based on a recent lecture for students in Archives and Records Management…

Cultural heritage institutions face both exciting opportunities and complex challenges when sharing their collections online. This post gives common reasons why GLAMs share collections as open cultural data, and explores some strategic considerations behind making collections accessible.

What is Open Cultural Data?

Open cultural data includes a wide range of digital materials, from individual digitised or born-digital items – images, text, audiovisual records, 3D objects, etc. – to datasets of catalogue metadata, images or text, machine learning models and data derived from collections.

Open data must be clearly licensed for reuse, available for commercial and non-commercial use, and ideally provided in non-proprietary formats and standards (e.g. CSV, XML, JSON, RDF, IIIF).

Why Share Open Data?

The British Library shares open data for multiple compelling reasons.

Broadening Access and Engagement: by releasing over a million images on platforms like Flickr Commons, the Library has achieved an incredible 1.5 billion views. Open data allows people worldwide to experience wonder and delight with collections they might never physically access in the UK.

Deepening Access and Engagement: crowdsourcing and online volunteering provide opportunities for enthusiasts to spend time with individual items while helping enrich collections information. For instance, volunteers have helped transcribe complex materials like Victorian playbills, adding valuable contextual information.

Supporting Research and Scholarship: in addition to ‘traditional’ research, open collections support the development of reproducible computational methods including text and data mining, computer vision and image analysis. Institutions also learn more about their collections through formal and informal collaborations.

Creative Reuse: open data encourages artists to use collections, leading to remarkable creative projects including:

Animation featuring an octopus holding letters and parcels on a seabed with seaweed
Screenshot from Hey There Young Sailor (Official Video) - The Impatient Sisters

 

16 illustrations of girls in sad postures
'16 Very Sad Girls' by Mario Klingemann

 

A building with large-scale projection
The BookBinder, by Illuminos, with British Library collections

 

Some lessons for Effective Data Sharing

Make it as easy as possible for people to find and use your open collections:

  • Tell people about your open data
  • Celebrate and highlight creative reuses
  • Use existing licences for usage rights where possible
  • Provide data in accessible, sustainable formats
  • Offer multiple access methods (e.g. individual items, datasets, APIs)
  • Invest effort in meeting the FAIR, and where appropriate, CARE principles

Navigating Challenges

Open data isn't without tensions. Institutions must balance potential revenue, copyright restrictions, custodianship and ethical considerations with the benefits of publishing specific collections.

Managing expectations can also be a challenge. The number of digitised or born-digital items available may be tiny in comparison to the overall size of collections. The quality of digitised records – especially items digitised from microfiche and/or decades ago – might be less than ideal. Automatic text transcription and layout detection errors can limit the re-usability of some collections.

Some collections might not be available for re-use because they are still in copyright (or are orphan works, where the creator is not known), were digitised by a commercial partner, or are culturally sensitive.

The increase in the number of AI companies scraping collections site to train machine learning models has also given some institutions cause to re-consider their open data policies. Historical collections are more likely to be out of copyright and published for re-use, but they also contain structural prejudices and inequalities that could be embedded into machine learning models and generative AI outputs.

Conclusion

Open cultural data is more than just making collections available—it's about creating dynamic, collaborative spaces of knowledge exchange. By thoughtfully sharing our shared intellectual heritage, we enable new forms of research, inspiration and enjoyment.

 

AI use transparency statement: I recorded my recent lecture on my phone, then generated a loooong transcription on my phone. I then supplied the transcription and my key points to Claude, with a request to turn it into a blog post, then manually edited the results.

16 December 2024

Closing the language gap: automated language identification in British Library catalogue records

What do you do when you have millions of books and no record of the language they were written in? Collection Metadata Analyst Victoria Morris looks back to describe how she worked on this in 2020...

Context

In an age of online library catalogues, recording the language in which a book (or any other textual resource) is written is vital to library curators and users alike, as it allows them to search for resources in a particular language, and to filter search results by language.

As the graph below illustrates, although language information is routinely added to British Library catalogue records created as part of ongoing annual production, fewer than 30% of legacy records (from the British Library’s foundation catalogues) contain language information. As of October 2018, nearly 4.7 million of records were lacking any explicit language information. Of these, 78% were also lacking information about the country of publication, so it would not be possible to infer language from the place of publication.

Chart showing language of content records barely increasing over time

The question is: what can be done about this? In most cases, the language of the resource described can be immediately identified by a person viewing the book (or indeed the catalogue record for the book). With such a large number of books to deal with, though, it would be infeasible to start working through them one at a time ... an automated language identification process is required.

Language identification

Language identification (or language detection) refers to the process of determining the natural language in which a given piece of text is written. The texts analysed are commonly referred to as documents.

There are two possible avenues of approach: using either linguistic models or statistical models. Whilst linguistic models have the potential to be more realistic, they are also more complex, relying on detailed linguistic knowledge. For example, some linguistic models involve analysis of the grammatical structure of a document, and therefore require knowledge of the morphological properties of nouns, verbs, adjectives, etc. within all the languages of interest.

Statistical models are based on the analysis of certain features present within a training corpus of documents. These features might be words, character n-grams (sequences of n adjacent characters) or word n-grams (sequences of n adjacent words). These features are examined in a purely statistical, ‘linguistic-agnostic’ manner; words are understood as sequences of letter-like characters bounded by non-letter-like characters, not as words in any linguistic sense. When a document in an unknown language is encountered, its features can be compared to those of the training corpus, and a predication can thereby be made about the language of the document.

Our project was limited to an investigation of statistical models, since these could be more readily implemented using generic processing rules.

What can be analysed?

Since the vast majority of the books lacking language information have not been digitised, the language identification had to be based solely on the catalogue record. The title, edition statement and series title were extracted from catalogue records, and formed the test documents for analysis.

Although there are examples of catalogue records where these metadata elements are in a language different to that of the resource being described (as in, for example, The Four Gospels in Fanti, below), it was felt that this assumption was reasonable for the majority of catalogue records.

A screenshot of the catalogue record for a book listed as 'The Four Gospels in Fanti'

Measures of success

The effectiveness of a language identification model can be quantified by the measures precision and recall; precision measures the ability of the model not to make incorrect language predictions, whilst recall measures the ability of the model to find all instances of documents in a particular language. In this context, high precision is of greater value than high recall, since it is preferable to provide no information about the language of content of a resource than to provide incorrect information.

Various statistical models were investigated, with only a Bayesian statistical model based on analysis of words providing anything approaching satisfactory precision. This model was therefore selected for further development.

The Bayesian idea

Bayesian methods are based on a calculation of the probabilities that a book is written in each language under consideration. An assumption is made that the words present within the book title are statistically independent; this is obviously a false assumption (since, for example, adjacent words are likely to belong to the same language), but it allows application of the following proportionality:

An equation: P(D" is in language " l "given that it has features"  f_1…f_n )∝P (D" is in language " l)∏_(i=1)^n▒├ P("feature " f_i " arises in language " l)

The right-hand side of this proportionality can be calculated based on an analysis of the training corpus. The language of the test document is then predicted to be the language which maximises the above probability.

Because of the assumption of word-independence, this method is often referred to as naïve Bayesian classification.

What that means in practice is this: we notice that whenever the word ‘szerelem’ appears in a book title for which we have language information, the language is Hungarian. Therefore, if we find a book title which contains the word ‘szerelem’, but we don’t have language information for that book, we can predict that the book is probably in Hungarian.

Screenshot of catalogue entry with the word 'szerelem' in the title of a book
Szerelem: definitely a Hungarian word => probably a Hungarian title

If we repeat this for every word appearing in every title of each of the 12 million resources where we do have language information, then we can build up a model, which we can use to make predictions about the language(s) of the 4.7 million records that we’re interested in. Simple!

Training corpus

The training corpus was built from British Library catalogue records which contain language information, Records recorded as being in ‘Miscellaneous languages’, ‘Multiple languages’, ‘Sign languages’, ‘Undetermined’ and ‘No linguistic content’ were excluded. This yielded 12,254,341 records, of which 9,578,175 were for English-language resources. Words were extracted from the title, edition statement, and series title, and stored in a ‘language bucket’.

Words in English, Hungarian and Volapuk shown above the appropriate language 'bucket'

Language buckets were analysed in order to create a matrix of probabilities, whereby a number was assigned to each word-language pair (for all words encountered within the catalogue, and all languages listed in a controlled list) to represent the probability that that word belongs to that language. Selected examples are listed in the table below; the final row in the table illustrates the fact that shorter words tend to be common to many languages, and are therefore of less use than longer words in language identification.

{Telugu: 0.750, Somali: 0.250}

aaaarrgghh

{English: 1.000}

aaavfleeße

{German: 1.000}

aafjezatsd

{German: 0.333, Low German: 0.333, Limburgish: 0.333}

aanbidding

{Germanic (Other): 0.048, Afrikaans: 0.810, Low German: 0.048, Dutch: 0.095}

نبوغ

{Persian: 0.067, Arabic: 0.200, Pushto: 0.333, Iranian (Other): 0.333, Azerbaijani: 0.067}

metodicheskiĭ

{Russian: 0.981, Kazakh: 0.019}

nuannersuujuaannannginneranik

{Kalâtdlisut): 1.000}

karga

{Faroese: 0.020, Papiamento: 0.461, Guarani: 0.010, Zaza: 0.010, Esperanto: 0.010, Estonian: 0.010, Iloko: 0.176, Maltese: 0.010, Pampanga: 0.010, Tagalog: 0.078, Ladino: 0.137, Basque: 0.029, English: 0.010, Turkish: 0.029}

Results

Precision and recall varied enormously between languages. Zulu, for instance, had 100% precision but only 20% recall; this indicates that all records detected as being in Zulu had been correctly classified, but that the majority of Zulu records had either been mis-classified, or no language prediction had been made. In practical terms, this meant that a prediction “this book is in Zulu” was a prediction that we could trust, but we couldn’t assume that we had found all of the Zulu books. Looking at our results across all languages, we could generate a picture (formally termed a ‘confusion matrix’) to indicate how different languages were performing (see below). The shaded cells on the diagonal represent resources where the language has been correctly identified, whilst the other shaded cells show us where things have gone wrong.

Language confusion matrix

The best-performing languages were Hawaiian, Malay, Zulu, Icelandic, English, Samoan, Finnish, Welsh, Latin and French, whilst the worst-performing languages were Shona, Turkish, Pushto, Slovenian, Azerbaijani, Javanese, Vietnamese, Bosnian, Thai and Somali.

Where possible, predictions were checked by language experts from the British Library’s curatorial teams. Such validation facilitated the identification of off-diagonal shaded areas (i.e. languages for which predictions which should be treated with caution), and enabled acceptance thresholds to be set. For example, the model tends to over-predict English, in part due to the predominance of English-language material in the training corpus, thus the acceptance threshold for English was set at 100%: predictions of English would only be accepted if the model claimed that it was 100% certain that the language was English. For other languages, the acceptance threshold was generally between 95% and 99%.

Outcomes

Two batches of records have been completed to date. In the first batch, language codes were assigned to 1.15 million records with 99.7% confidence; in the second batch, a further 1 million language codes were assigned with 99.4% confidence. Work on a third batch is currently underway, and it is hoped to achieve at least a further million language code assignments. The graph below shows the impact that this project is having on the British Library catalogue.

Graph showing improvement in the number of 'foundation catalogue' records with languages recorded

The project has already been well-received by Library colleagues, who have been able to use the additional language coding to assist them in identifying curatorial responsibilities and better understanding the collection.

Further reading

For a more in-depth, mathematical write-up of this project, please see a paper written for Cataloging & Classification Quarterly, which is available at: https://doi.org/10.1080/01639374.2019.1700201, and is also in the BL research repository at https://bl.iro.bl.uk/work/6c99ffcb-0003-477d-8a58-64cf8c45ecf5.

13 December 2024

Looking back on the Data Science Accelerator

From April to July this year an Assistant Statistician at the Cabinet Office and a Research Software Engineer at the British Library teamed up as mentee (Catherine Macfarlane, CO) and mentor (Harry Lloyd, BL) for the Data Science Accelerator. In this blog post we reflect on the experience and what it meant for us and our work.

Introduction to the Accelerator

Harry: The Accelerator has been around since 2015, set up as a platform to ‘accelerate’ civil servants at the start of their data science journey who have a business need project and a real willingness to learn. Successful applicants are paired with mentors from across the Civil Service who have experience in techniques applicable to the problem, working together one protected day a week for 12 weeks. I was lucky enough to be a mentee in 2020, working on statistical methods to combine different types of water quality data, and my mentor Charlie taught me a lot of what I know. The programme played a huge role in the development of my career, so it was a rewarding moment to come back as a mentor for the April cohort. 

Catherine: On joining the Civil Service in 2023, I had the pleasure of becoming part of a talented data team that has motivated me to continually develop my skills. My academic background in Mathematics with Finance provides me with a strong theoretical foundation, but I am striving to improve my practical abilities. I am particularly interested in Artificial Intelligence, which is gaining increasing recognition across government, sparking discussions on its potential to improve efficiency.

I saw the Data Science Accelerator as an opportunity to deepen my knowledge, address a specific business need, and share insights with my team. The prospect of working with a mentor and immersing myself in an environment where diverse projects are undertaken was particularly appealing. A significant advantage was the protected time this project offered - a rare benefit! I was grateful to be accepted and paired with Harry, an experienced mentor who had already completed the programme. Following our first meeting, I felt ready to tackle the upcoming 12 weeks to see what we could achieve!

Photo of the mentee and mentor on a video call
With one of us based in England and the other in Scotland virtual meetings were the norm. Collaborative tools like screen sharing and Github allowed us to work effectively together.

The Project

Catherine: Our team is interested in the annual reports and accounts of Arm’s Length Bodies (ALBs), a category of public bodies funded to deliver a public or government service.  The project addressed the challenge my team faces in extracting the highly unstructured information stored in annual reports and accounts. With this information we would be able to enhance the data validation process and reduce the burden of commissioning data from ALBs on other teams. We proposed using Natural Language Processing to retrieve this information, analysing and querying it using a Large Language Model (LLM).

Initially, I concentrated on extracting five features, such as full-time equivalent staff in the organisation, from a sample of ALBs across 13 departments for the financial year 22/23. After discussions with Harry, we decided to use Retrieval-Augmented Generation (RAG), to develop a question-answering system. RAG is a technique that combines LLMs with relevant external documents to improve the accuracy and reliability of the output. This is done by retrieving documents that are relevant to the questions asked and then asking the LLM to generate an answer based on the retrieved material. We carefully selected a pre-trained LLM while considering ethical factors like model openness.

RAG
How a retrieval augmented generation system works. A document in this context is a segmented chunk of a larger text that can be parsed by an LLM.

The first four weeks focused on exploratory analysis, data processing, and labelling, all completed in R, which was essential for preparing the data for input into the language model. The subsequent stages involved model building and evaluation in Python, which required the most time and focus. This was my first time using Python, and Harry’s guidance was extremely beneficial during our pair coding sessions. A definite highlight for me was seeing the pipeline start to generate answers!

To bring all our results together, I created a dashboard in Shiny, ensuring it was accessible to both technical and non-technical audiences. The final stage involved summarising all our hard work from the past 12 weeks in a 10 minute presentation and delivering it to the Data Science Accelerator cohort.

Harry: Catherine’s was the best planned project of the ones I reviewed, and I suspected she’d be well placed to make best use of the 12 weeks. I wasn’t wrong! We covered a lot of the steps involved in good reproducible analysis. The exploratory work gave us a great sense of the variance in the data, setting up quantitative benchmarks for the language model results drove our development of the RAG system, and I was so impressed that Catherine managed to fit in building a dashboard on top of all of that.

Our Reflections

Catherine: Overall this experience was fantastic. In a short amount of time, we managed to achieve a considerable amount. It was amazing to develop my skills and grow in confidence. Harry was an excellent mentor; he encouraged discussion and asked insightful questions, which made our sessions both productive and enjoyable. A notable highlight was visiting the British Library! It was brilliant to have an in-person session with Harry and meet the Digital Research team.

A key success of the project was meeting the objectives we set out to achieve. Patience was crucial, especially when investigating errors and identifying the root problem. The main challenge was managing such a large project that could be taken in multiple directions. It can be natural to spend a long time on one area, such as exploratory analysis, but we ensured that we completed the key elements that allowed us to move on to the next stage. This balance was essential for the project's overall success.

Harry: We divided our days between time for Catherine to work solo and pair programming. Catherine is a really keen learner, and I think this approach helped her drive the project forward while giving us space to cover foundational programming topics and a new programming language. My other role was keeping an eye on the project timeline. Giving the occasional steer on when to stick with something and when to move on helped (I hope!) Catherine to achieve a huge amount in three months. 

Dashboard
A page from the dashboard Catherine created in the last third of the project.

Ongoing Work

Catherine: Our team recognises the importance of continuing this work. I have developed an updated project roadmap, which includes utilising Amazon Web Services to enhance the speed and memory capacity of our pipeline. Additionally, I have planned to compare various large language models, considering ethical factors, and I will collaborate with other government analysts involved in similar projects. I am committed to advancing this project, further upskilling the team, and keeping Harry updated on our progress.

Harry: RAG, and the semantic rather than key word search that underlies it, represents a maturation of LLM technology that has the potential to change the way users search our collections. Anticipating that this will be a feature of future library services platforms, we have a responsibility to understand more about how these technologies will work with our collections at scale. We’re currently carrying out experiments with RAG and the linked data of the British National Bibliography to understand how searching like this will change the way users interact with our data.

Conclusions

Disappointingly the Data Science Accelerator was wound down by the Office for National Statistics at the end of the latest cohort, citing budget pressures. That has made us one of the last mentor/mentee pairings to benefit from the scheme, which we’re both incredibly grateful for and deeply saddened by. The experience has been a great one, and we’ve each learned a lot from it. We’ll continue to develop RAG at the Cabinet Office and the British Library, and hope to advocate for and support schemes like the Accelerator in the future!

12 December 2024

Automating metadata creation: an experiment with Parliamentary 'Road Acts'

This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.

The Living with Machines project used a range of diverse sources, including newspapers to maps and census data.  This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.

What did we want to do?

Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris. 

Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:

  • Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
  • OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
  • Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)

To this end, we provided Cogapp with:

  • Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
  • The Library's metadata requirements
  • Curators' support to explain and guide them through the metadata extraction and record creation process 

Once all of this was put in place, the process started, however this is where we encountered the main problem. 

First issue: the typeface

After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming. 

Image of a historical document
Figure 1: Page showing typefaces and layout. SPRMicP14_12_016

Second issue: the marginalia

As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page. 

This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.

Black and white image of printed page with comments in the margins
Figure 2 Printed page with marginalia. SPRMicP14_12_324

 

OCR generated by ABBYY FineReader:

Qualisicatiori 6s Truitees;

Penalty on acting if not quaiified.

Anno Regni septimo Georgii III. Regis.

9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o? 

 

OCR generated by the open source tool Tesseract:

586 Anno Regni ?eptimo Georgi III. Regis.

Qualification

of Truttees;

Penalty on

Gnd be it further enated, That no. Per?on ?hall bÈ

capable of ating as Tru?tËe in the Crecution of thig

A, unle?s he ?hall be, in his own Right, 02 in the

Right of his Wife, in the a‰ual Pofe??ion and En. |

joyment 02 Receipt of the Rents and P2zofits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of Fifty Pounds z o? hall be Deir Apparent of

?ome Per?on having ?uch Cfitate of the clear yearly Uga-

lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled

unto, a Per?onal E?tate to the Amount 02 Ualue of One

thou?and Pounds : And if any Per?on hereby deemed

acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz

qualified.

 

OCR generated by Cogapp (without any enhancement)

of Trusteesi

586

Anno Regni ſeptimo Georgii III. Regis.

Qualihcation and be it further enałted, That no perſon thall be

capable of aging as Trulltee in the Erecution of this

ad, unlefs he thall be, in his own Right, of in the

Right of his Wife, in the ađual Polellion and En:

joyment or Receipt of the Rents and Profits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of ffifty pounds : oi thall be peir apparent of

ſome Perſon having ſuch Etate of the clear yearly Ua:

lue of Dne hundred Pounds; ou podeled of, od intitled

unto, a Perſonal Elate to the amount ou Ualue of Dne

Penalty on thouſand Pounds : and if any perſon hereby deemed

acting if not incapable to ad, thall preſume to ađ, every ſuch Per-

Qualified.

 

As you can see, the OCR transcription results were too poor to use in our research.

Changing our focus: experimenting with metadata creation

Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).

Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.

Conclusion

This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources. 

Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.

If you have suggestions or want to get in touch about the dataset, please email [email protected]

11 December 2024

MIX 2025: Writing With Technologies Call for Submissions

One of the highlights of our Digital Storytelling exhibition last year was hosting the 2023 MIX conference at the British Library in collaboration with Bath Spa University and the MyWorld programme, which explores the future of creative technology innovation.

MIX is an established forum for the discussion and celebration of writing and technology, bringing together researchers, writers, technologists and practitioners from around the world.  Many of the topics covered are relevant to work in the British Library as part of our research into collecting, curating and preserving interactive digital works and emerging formats.

Image text says MIX 2025 Writing With Technologies 2nd July 2025, with organisation logos underneath the text

As a new year draws near, we are looking forward to upcoming events. MIX will be back in Bath at the Locksbrook Campus on Wednesday 2 July 2025 and their call for submissions  is currently open until early February. Organisers are looking for proposals for 15 minute papers/presentations or 5 minute lightening talks from technologists, artists, writers and poets, academic researchers and independent scholars, on the following themes:

  • Issues of trust and truth in digital writing
  • The use of generative AI tools by authors, poets and screenwriters
  • Debates around AI and ethics for creative practitioners
  • Emerging immersive storytelling practices

MIX 2025 will investigate the intersection between these themes, including the challenges and opportunities for interactive and locative works, poetry film, screenwriting and writing for games, as well as digital preservation, archiving, enhanced curation and storytelling with AI. Conference organisers are also welcoming papers and presentations on the innovative use of AI tools in creative writing pedagogy. Deadline for submissions is 5pm GMT on Monday 10 February 2025, if you have any enquiries email [email protected].

As part of the programme, New York Times bestselling writer and publisher Michael Bhaskar, currently working for Microsoft AI and co-author of the book The Coming Wave: AI, Power and the 21st Century’s Greatest Dilemma, will appear in conversation.

To whet your appetite ahead of the next MIX you may want to check out the Writing with Technologies webinar series presented by My World with Bath Spa University’s Centre for Cultural and Creative Industries and the Narrative and Emerging Technologies Lab. This series examines AI’s emerging influence across writing and publishing in various fields through talks from writers, creators, academics, publishing professionals and AI experts. The next webinar will be on Wednesday 22nd January 2025, 2-3pm GMT discussing AI And Creative Expression, book your free place here.

26 November 2024

Working Together: The UV Community Sprint Experience

How do you collaborate on a piece of software with a community of users and developers distributed around the world? Lanie and Saira from the British Library’s Universal Viewer team share their recent experience with a ‘community sprint’... 

Back in July, digital agency Cogapp tested the current version of the Universal Viewer (UV) against Web Content Accessibility Guidelines (WCAG) 2.2 and came up with a list of suggestions to enhance compliance.  

As accessibility is a top priority, the UV Steering Group decided to host a community sprint - an event focused on tackling these suggestions while boosting engagement and fostering collaboration. Sprints are typically internal, but the community sprint was open to anyone from the broader open-source community.

Zoom call showing participants
18 participants from 6 organisations teamed up to make the Universal Viewer more accessible - true collaboration in action!

The sprint took place for two weeks in October. Everyone brought unique skills and perspectives, making it a true community effort.

Software engineers worked on development tasks, such as improving screen reader compatibility, fixing keyboard navigation problems, and enhancing element visibility. Testing engineers ensured functionality, and non-technical participants assisted with planning, translations and management.

The group had different levels of experience, which made it important to provide a supportive environment for learning and collaboration.  

The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.
The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.

Some of those involved shared their thoughts on the sprint: 

Bruce Herman - Development Team Lead, British Library: 'It was a great opportunity to collaborate with other development teams in the BL and the UV Community.'

Demian Katz - Director of Library Technology, Villanova University: 'As a long-time member of the Universal Viewer community, it was really exciting to see so many new people working together effectively to improve the project.'

Sara Weale - Head of Web Design & Development, Llyfrgell Genedlaethol Cymru - National Library of Wales: 'Taking part in this accessibility sprint was an exciting and rewarding experience. As Scrum Master, I had the privilege of facilitating the inception, daily stand-ups, and retrospective sessions, helping to keep the team focused and collaborative throughout. It was fantastic to see web developers from the National Library of Wales working alongside the British Library, Falvey Library (Villanova University), and other members of the Universal Viewer Steering Group.

This sprint marked the first time an international, cross-community team came together in this way, and the sense of shared purpose and camaraderie was truly inspiring. Some of the key lessons I took away from the sprint was the need for more precise task estimation, as well as the value of longer sprints to allow for deeper problem-solving. Despite these challenges, the fortnight was defined by excellent communication and a strong collective commitment to addressing accessibility issues.

Seeing the team come together so quickly and effectively highlighted the power of collaboration to drive meaningful progress, ultimately enhancing the Universal Viewer for a more inclusive future.'

BL Test Engineers: 

Damian Burke: 'Having worked on UV for a number of years, this was my first community sprint. What stood out for me was the level of collaboration and goodwill from everyone on the team. How quickly we formed into a working agile team was impressive. From a UV tester's perspective, I learned a lot from using new tools like Vercel and exploring GitHub's advanced functionality.'

Alex Rostron: 'It was nice to collaborate and work with skilled people from all around the world to get a good number of tickets over the line.'

Danny Taylor: 'I think what I liked most was how organised the sprints were. It was great to be involved in my first BL retrospective.'

Miro board with answers to the question 'what went well during this sprint?'

 

Positive reactions to 'how I feel after the sprint'
A Miro board was used for Sprint planning and the retrospective – a review meeting after the Sprint where we determined what went well and what we would improve for next time.

Experience from the sprint helped us to organise a further sprint within the UV Steering Group for admin-related work, aimed at improving documentation to ensure clearer processes and better support for contributors. Looking ahead, we're planning to release UV 4.1.0 in the new year, incorporating the enhancements we've made - we’ll share another update when the release candidate is ready for review.

Building on the success of the community sprint, we're excited to make these collaborative efforts a key part of our strategic roadmap. Join us and help shape the future of UV!

22 November 2024

Collaborating to improve usability on the Universal Viewer project

Open source software is a valuable alternative to commercial software, but its decentralised nature often leads to less than polished user interfaces. This has also been the case for the Universal Viewer (UV), despite attempts over the years to improve the user experience (UX) for viewing digital collections. Improving the usability of the UV is just one of the challenges that the British Library's UV team have taken on. We've even recruited an expert volunteer to help!

Digital Curator Mia Ridge talks to UX expert Scott Jenson about his background in user experience design, his interest in working with open source software, and what he's noticed so far about the user experience of the Universal Viewer.

Mia: Hi Scott! Could you tell our readers a little about your background, and how you came to be interested in the UX of open source software?

Scott: I’ve been working in commercial software my entire life (Apple, Google and a few startups) and it became clear over time that the profit motive is often at odds with users’ needs. I’ve been exploring open source as an alternative.

Mia: I noticed your posts on Mastodon about looking for volunteer opportunities as you retired from professional work at just about the time that Erin (Product Owner for the Universal Viewer at the British Library) and I were wondering how we could integrate UX and usability work into the Library's plans for the UV. Have you volunteered before, and do you think it'll become a trend for others wondering how to use their skills after retirement?

Scott: Google has a program where you can leave your position for 3 months and volunteer on a project within Google.org. I worked on a project to help California Forestry analyse and map out the most critical areas in need of treatment. It was a lovely project and felt quite impactful. It was partly due to that project that put me on this path.

Mia: Why did you say 'yes' when I approached you about volunteering some time with us for the UV?

Scott: I lived in London for 4 years working for a mobile OS company called Symbian so I’ve spent a lot of time in London. While living in London, I even wrote my book in the British Library! So we have a lot in common. It was an intersection of opportunity and history I just couldn’t pass up.

Mia: And what were your first impressions of the project? 

Scott: It was an impactful project with a great vision of where it needed to go. I really wanted to get stuck in and help if I could.

Mia: we loved the short videos you made that crystallised the issues that users encounter with the UV but find hard to describe. Could you share one?

Scott: The most important one is something that happens to many projects that evolve over time: a patchwork of metaphors that accrue. In this case the current UV has at least 4 different ways to page through a document, 3 of which are horizontal and 1 vertical. This just creates a mishmash of conflicting visual prompts for users and simplifying that will go a long way to improve usability.

Screenshot of the Viewer with target areas marked up
A screenshot from Scott's video showing multiple navigation areas on the UV

How can you help improve the usability of the Universal Viewer?

We shared Scott's first impressions with the UV Steering Group in September, when he noted that the UV screen had 32 'targets' and 8 areas where functionality had been sprinkled over time, making it hard for users to know where to focus. We'd now to like get wider feedback on future directions.

Scott's made a short video that sets out some of the usability issues in the current layout of the Universal Viewer, and some possible solutions. We think it's a great provocation for discussion by the community! To join in and help with our next steps, you can post on the Universal Viewer Slack (request to join here) or GitHub.