Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

08 January 2025

2024 Year in Review - Digital Scholarship Training Programme

Nora McGregor, Digital Curator and manager of the Digital Scholarship Training Programme reflects on a year of delivering digital upskilling training to colleagues at British Library, part of the Digital Research Team's focus on Embedding Digital Humanities in the British Library | 39 | The Digital.

2024 was a strange and difficult year, to say the least, for us and all our lovely colleagues across the whole of the British Library as we contended daily with the ongoing effects of a cyber-attack disrupting just about every aspect of our work. Not to be cowed by criminality however, the Digital Research Team dug in and ensured the Digital Scholarship Training Programme (DSTP) continued without fail.

From our experience during the pandemic, we knew that in times of major disruption, British Library staff do not stand still. They focus on what they can do, including prioritising their upskilling and have come to count on the DSTP as a kind of refuge whilst temporarily separated from their collections and normal workload.

So it’s with gratefulness to my colleagues in the Digital Research Team, and to BL staff for their engagement, that I reflect proudly on a challenging year where we managed to deliver a whopping 39 individual training events with nearly 900 attendees!   

What we learned in 2024

Our training programme this year covered these topic priorities through a variety of talks, hands-on sessions, reading groups and formal workshops & courses: 

  • State-of-the-art Automatic Text Recognition (ATR) technologies
  • Useful data science, machine learning and AI applications for analysing and enhancing GLAM digital collections and data​
  • The intersection of climate change + Digital Humanities
  • Digital tools and methods to support the Library's Race Equality Action Plan
  • WikiData, WikiSource, Wikimedia Commons
  • OpenRefine for data-wrangling 
  • Collections as Data
  • Making the most of the IIIF standard

We’re especially thankful for all the academics & professionals who contributed to our learning throughout the year by sharing their projects, experience and expertise with us! If you’d like to be part of our programme in 2025 get in touch with us at [email protected] with your idea, we’d love to hear from you.

2024 Year in Review-External Infographic by Nora McGregor

My Personal Highlights 

In the coming months I will be interviewing my fellow Digital Curators to get their views on highlights from the 2024 Digital Scholarship Training Programme, either favourite events they attended or programmed in 2024 and topic areas they’re excited about this year. No easy ask actually, as I know they, like me, will have found every event spectacularly interesting and useful, but to highlight just a few for you...

21st Century Talks

Our 21st Century Curatorship talk series is looked after by Digital Curators Stella Wisdom and Adi Kienan-Schoonbaert. They are 1 hour invited guest lectures held once or twice a month where we learn about exciting, innovative, projects and research at the intersection of cultural heritage collections and new technologies. These talks are pitched for complete beginners – we try not to assume knowledge so that anyone from any department can come along! A few of my favourite talks in particular were from these projects:

  • DE-BIAS - Detecting and cur(at)ing harmful language in cultural heritage collections | Europeana PRO
    Kerstin Herlt and Kerstin Arnold introduced us to the DE-BIAS project which aims to detect and contextualise potentially harmful language in cultural heritage collections. Working with themes like migration and colonial past, gender and sexual identity, ethnicity and ethno-religious identity, the project collaborates with minority communities to better understand the stories behind the language used - or behind the gaps apparent. We learned about the development of the vocabulary and the tools the project has created.

  • The Print and Probability Project: From Restoration Era Printing to an Interim English Short Title Catalogue
    Nikolai Vogler gave us an entertaining view of a selection of findings from the University of California’s Print & Probability project, an interdisciplinary research group at the intersection of book history, computer vision, and machine learning that seeks to discover Restoration-era letterpress printers whose identities have eluded scholars for several hundred years. He also presented his work on creating an interim English Short Title Catalogue (ESTC) in response to the cyber-attack on the Library in 2023, a pursuit for which colleagues were incredibly grateful for!

  • “Dark Matter: X%” - how many early modern Hungarian books disappeared without any trace?
    This was such a fascinating talk by Péter Király, software developer and digital humanities researcher at the Göttingen computation centre, Germany. Estimating the unknown is always an interesting endeavour. There is a registry of surviving books, and we have collective knowledge about lost books, but how many early Hungarian printings have been lost without any historical trace? Their research group transformed the analytical bibliography "Régi Magyarországi Nyomtatványok" (Early Hungarian Printings) into a database and the use of mathematical models from the toolbox of biologists were employed to help estimate it. The analysis of the database also highlights unknown or less investigated areas and enables them to extend previous research focusing on a particular time range to the whole period (such as religious trends during reformation and counter reformation, the changes of genres over times).

Hack & Yacks

I have the privilege of programming and leading this particular series of events and they are my favourite days in the calendar! These are our casual, 2hr monthly meet ups where we all take some time to have a hands-on exploration of new tools, techniques, and applications. No previous experience is ever needed, these are aimed at complete beginners (we’re usually learning something new too!) and we welcome colleagues from across the Library to come have a play! Some sessions are more "yack" than "hack", while others are more quiet hacking depending on the topic but no matter the balance they're always illuminating.

  • Introduction to AI and Machine learning was great fun for me personally as I had the chance to give staff an interactive and hands-on introduction to concepts around AI and ML, as it relates to library work, and play around with some open machine learning tools. The session was based on much of the text and activities offered in this topic guide AI & ML in Libraries Literacies – Digital Scholarship & Data Science Essentials for Library Professionals and it was a useful way for me to test the content directly with its intended audience!

  • Catalogues as Data was a session run by Harry Lloyd our Research Software Engineer Extraordinaire and Rossitza Atanassova, Digital Curator, as a two part guided exploration of printed Catalogues as data, working with OCR output and corpus linguistic analysis. In the first half we followed steps in a Jupyter Notebook to extract catalogue entries from OCR text, troubleshoot errors in the algorithm, and investigate Named Entity Recognition techniques. In the second half we explore catalogue entries using corpus linguistic techniques using AntConc, gaining a sense of how cataloguing practice and the importance of different terms changes over time.

Digital Scholarship Reading Group

These monthly discussions led by Digital Curators Mia Ridge and Rossitza Atanassova, are always open to any of our BL colleagues & students, regardless of job title or department. Discussions are regularly attended by colleagues from a range of departments including curators, reference specialists, technology, and research services.

My favourite session of the year by far was “No stupid questions, AI in Libraries”, a lovely meandering session we held in December and a great way to wrap up the year. Instead of discussing any particular reading, we all shared bits about what we had read or learned about independently on the topic of AI in Libraries and had some good-natured debate about where we believe it’s all headed for us on personal and professional levels. Though no readings were required, these were offered in case folks wanted to swot up:

Formal Workshops

We also programme formal courses as needed and this year we focussed very much on building our knowledge of the Wikimedia Universe. I thoroughly enjoyed the lessons we got from Lucy Hinnie and Stuart Prior which covered nearly every aspect of Wikimedia, and we’ll doing much more with this new knowledge, particularly WikiData in 2025!

 

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. If you can't wait that long, join us for the 'AI Debates' at the British Library.

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

18 December 2024

The challenges of AI for oral history: theoretical and practical issues

Oral History Archivist Charlie Morgan provides examples of how AI-based tools integrated into workflows might affect oral historians' consideration of orality and silence in the second of two posts on a talk he gave with Digital Curator Mia Ridge at the 7th World Conference of the International Federation for Public History in Belval, LuxembourgHis first post proposed some key questions for oral historians thinking about AI, and shared an example automatic speech recognition (ASR) tools in practice. 

While speech to text once seemed at the cutting edge of AI, software designers are now eager to include additional functions. Many incorporate their own chatbots or other AI ‘helpers’ and the same is true of ‘standard’ software. Below you can see what happened when I asked the chatbots in Otter and Adobe Acrobat some questions about other transcribed clips from the ‘Lives in Steel’ CD:

Screenshot of search and chatbot interactions with transcribed text
A composite image of chatbot responses to questions about transcribed clips

In Otter, the chatbot does well at answering a question on sign language but fails to identify the accent or dialect of the speaker. This is a good reminder of the limits of these models and how, without any contextual information, they cannot understand the interview beyond textual analysis. Oral historians in the UK have long understood interviews as fundamentally oral sources and current AI models risk taking us away from this.

In Adobe I tried asking a much more subjective question around emotion in the interview. While the chatbot does answer, it is again worth remembering the limits of this textual analysis, which, for example, could not identify crying, laughter or pitch change as emotion. It would also not understand the significance of any periods of silence. On our panel at the IFPH2024 conference in Luxembourg Dr Julianne Nyhan noted how periods of silence tend to lead speech-to-text models to ‘hallucinate’ so the advice is to take them out; the problem is that oral history has long theorised the meaning and importance of silence.

Alongside the chatbot, Adobe also includes a degree of semantic searching where a search for steel brings up related words. This in itself might be the biggest gift new technologies offer to catalogue searching (shown expertly in Placing the Holocaust) – helping us to move away from what Mia Ridge calls ‘the tyranny of the keyword’.

However, the important thing is perhaps not how well these tools perform but the fact they exist in the first place. Oral historians and archivists who, for good reasons, are hesitant about integrating AI into their work might soon find it has happened anyway. For example, Zencastr, the podcasting software we have used since 2020 for remote recordings, now has an in-built AI tool. Robust principles on the use of AI are essential then not just for new projects or software, but also for work we are already doing and software we are already using.

The rise of AI in oral history raises theoretical questions around orality and silence, but must also be considered in terms of practical workflows: Do participation and recording agreements need to be amended?​ How do we label AI generated metadata in catalogue records, and should we be labelling human generated metadata too? Do AI tools change the risks and rewards of making oral histories available online? We can only answer these questions through though critical engagement with the tools themselves.