Digital scholarship blog

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

13 March 2025

Fantastic Futures 2025 (FF2025) Call for Proposals

Fantastic Futures 2025: AI Everywhere, All at Once 

AI4LAM’s annual conference, December 3 – 5, 2025, British Library, London 

The British Library and the Programme Committee for the Fantastic Futures 2025 conference are delighted to invite proposals for presentations and workshops for the Fantastic Futures 2025 conference.  

Fantastic Futures is the annual conference for the AI4LAM (Artificial Intelligence, for Libraries, Archives, Museums) community. Submissions are invited from colleagues around the world about organisations, collections, interest and experience with Artificial Intelligence (AI) and Machine Learning (ML) technologies applied to or developed with cultural, research and heritage collections. This includes practitioners in the GLAM (Galleries, Libraries, Archives, Museums) sector and Digital Humanities, Arts and Social Sciences, Data, Information and Computer Science researchers in Higher Education. 

Key dates 

  • Call for proposals shared: Thursday 13 March 2025 
  • Conference submission form opens: TBC 
  • Proposal submission deadline: midnight anywhere, Sunday 18 May 2025 
  • Notification of acceptance: 14 July 2025 
  • Conference dates: December 3 – 5, 2025 
  • Location: British Library, London, onsite – with some livestreams and post-event videos 

FF2025 Theme: AI Everywhere, All at Once 

We invite presentations on the theme of 'AI Everywhere, All at Once’. While AI has a long history in academia and practice, the release of public language models like ChatGPT propelled AI into public consciousness. The sudden appearance of AI ‘tools’ in the software we use every day, government consultations on AI and copyright and the hype about Artificial Intelligence mean that libraries, museums and archives must understand what AI means for them. Should they embrace it, resist it or fear it? How does it relate to existing practices and services, how can it help or undermine staff, and how do we keep up with rapid changes in the field? 

There are many opportunities and many challenges in delivering AI that create rich, delightful and immersive experiences of GLAM collections and spaces for the public, and meet the needs of researchers for relevant, reliable and timely information. Challenges range from the huge – environmental and economic sustainability, ensuring alignment with our missions, ethical and responsible AI, human-centred AI, ensuring value for money – to the practical – evaluation, scalability, cyber security, multimodal collections – and throughout it all, managing the pace of change. 

Our aim is to promote interdisciplinary conversations that foster broader understandings of AI methods, practices and technologies and enable critical reflections about collaborative approaches to research and practice. 

Themes   

We’re particularly interested in proposals that cover these themes:   

  • Ethical and Responsible AI 
  • Human-Centred AI / the UX of AI 
  • Trust, AI literacy and society 
  • Building AI systems for and with staff and users 
  • Cyber-security and resilience 
  • Interoperability and standards  
  • Benchmarking AI / machine learning 
  • Regional, national, international approaches to AI 
  • Environmental sustainability  

Formats for presentations (Thursday, Friday December 4-5) 

  • Lightning talk: 5 mins. These might pitch an idea, call for collaborators, throw out a provocation or just provide a short update 
  • Poster  - perfect for project updates – what went well, what would you do differently, what lessons can others take? 
  • Short presentation: 15 mins   
  • Long presentation: 30 mins 
  • Panel: 45 mins, multiple presenters with short position statements then discussion 

Formats for workshops or working group sessions (Wednesday December 3) 

  • Formal, instructor-led sessions, including working groups, tutorials, hands-on workshops – 1 or 2 hours 
  • Informal, unstructured sessions, including unconferences, meetups, hacking – 1 or 2 hours 
  • Digital showcase (demo): 30 mins 

We value the interactions that an in-person event enables, so the default mode for this event is in-person presentations. However, if your proposal is accepted for inclusion in the conference but you are not able to travel to London, we can consider arrangements for making a virtual presentation on a case-by-case basis. Please contact the Programme Committee at [email protected] to discuss. 

The conference will be held over three days: one day of workshops and other events, and two days of formal sessions. The social programme will include opportunities for informal networking.  

Plenary sessions on Thursday and Friday will be livestreamed, recorded and published. 

Find out more and get updates 

  • Organisers: Rossitza Atanassova, Neil Fitzgerald and Mia Ridge, British Library

Further details about the conference submission process and registration will be supplied soon. 

24 January 2025

Universal Viewer v4.1.0 is here!

We’re excited to announce the release of Universal Viewer (UV) version 4.1.0, packed with new updates and features. 

Universal Viewer image controls
New image manipulation controls in UV 4.1.0

 

This version builds on the momentum from our community accessibility sprint, where the wider UV community came together to address key usability challenges. Highlights of the new release include: 

Accessibility Improvements:

  • Easier navigation for keyboard-only users.
  • Better support for assistive technologies such as screen readers.
  • Improved contrast and visibility of page elements.

New Features:

  • Image Controls: Adjust brightness, contrast, and saturation directly within the viewer.
  • Index Panel Configuration: A new setting allows the index panel to open by default when viewing collections. 

Bug Fixes & Security Updates:

  • Several bugs resolved to enhance stability and performance.
  • Dependency updates to ensure the Universal Viewer remains secure and up to date.

For the full details of what’s new, check out the release notes on GitHub.

Interested in joining the Universal Viewer community? To get involved join us on Slack, or follow UV on Bluesky or Mastodon to stay connected.

08 January 2025

2024 Year in Review - Digital Scholarship Training Programme

Nora McGregor, Digital Curator and manager of the Digital Scholarship Training Programme reflects on a year of delivering digital upskilling training to colleagues at British Library, part of the Digital Research Team's focus on Embedding Digital Humanities in the British Library | 39 | The Digital.

2024 was a strange and difficult year, to say the least, for us and all our lovely colleagues across the whole of the British Library as we contended daily with the ongoing effects of a cyber-attack disrupting just about every aspect of our work. Not to be cowed by criminality however, the Digital Research Team dug in and ensured the Digital Scholarship Training Programme (DSTP) continued without fail.

From our experience during the pandemic, we knew that in times of major disruption, British Library staff do not stand still. They focus on what they can do, including prioritising their upskilling and have come to count on the DSTP as a kind of refuge whilst temporarily separated from their collections and normal workload.

So it’s with gratefulness to my colleagues in the Digital Research Team, and to BL staff for their engagement, that I reflect proudly on a challenging year where we managed to deliver a whopping 39 individual training events with nearly 900 attendees!   

What we learned in 2024

Our training programme this year covered these topic priorities through a variety of talks, hands-on sessions, reading groups and formal workshops & courses: 

  • State-of-the-art Automatic Text Recognition (ATR) technologies
  • Useful data science, machine learning and AI applications for analysing and enhancing GLAM digital collections and data​
  • The intersection of climate change + Digital Humanities
  • Digital tools and methods to support the Library's Race Equality Action Plan
  • WikiData, WikiSource, Wikimedia Commons
  • OpenRefine for data-wrangling 
  • Collections as Data
  • Making the most of the IIIF standard

We’re especially thankful for all the academics & professionals who contributed to our learning throughout the year by sharing their projects, experience and expertise with us! If you’d like to be part of our programme in 2025 get in touch with us at [email protected] with your idea, we’d love to hear from you.

2024 Year in Review-External Infographic by Nora McGregor

My Personal Highlights 

In the coming months I will be interviewing my fellow Digital Curators to get their views on highlights from the 2024 Digital Scholarship Training Programme, either favourite events they attended or programmed in 2024 and topic areas they’re excited about this year. No easy ask actually, as I know they, like me, will have found every event spectacularly interesting and useful, but to highlight just a few for you...

21st Century Talks

Our 21st Century Curatorship talk series is looked after by Digital Curators Stella Wisdom and Adi Kienan-Schoonbaert. They are 1 hour invited guest lectures held once or twice a month where we learn about exciting, innovative, projects and research at the intersection of cultural heritage collections and new technologies. These talks are pitched for complete beginners – we try not to assume knowledge so that anyone from any department can come along! A few of my favourite talks in particular were from these projects:

  • DE-BIAS - Detecting and cur(at)ing harmful language in cultural heritage collections | Europeana PRO
    Kerstin Herlt and Kerstin Arnold introduced us to the DE-BIAS project which aims to detect and contextualise potentially harmful language in cultural heritage collections. Working with themes like migration and colonial past, gender and sexual identity, ethnicity and ethno-religious identity, the project collaborates with minority communities to better understand the stories behind the language used - or behind the gaps apparent. We learned about the development of the vocabulary and the tools the project has created.

  • The Print and Probability Project: From Restoration Era Printing to an Interim English Short Title Catalogue
    Nikolai Vogler gave us an entertaining view of a selection of findings from the University of California’s Print & Probability project, an interdisciplinary research group at the intersection of book history, computer vision, and machine learning that seeks to discover Restoration-era letterpress printers whose identities have eluded scholars for several hundred years. He also presented his work on creating an interim English Short Title Catalogue (ESTC) in response to the cyber-attack on the Library in 2023, a pursuit for which colleagues were incredibly grateful for!

  • “Dark Matter: X%” - how many early modern Hungarian books disappeared without any trace?
    This was such a fascinating talk by Péter Király, software developer and digital humanities researcher at the Göttingen computation centre, Germany. Estimating the unknown is always an interesting endeavour. There is a registry of surviving books, and we have collective knowledge about lost books, but how many early Hungarian printings have been lost without any historical trace? Their research group transformed the analytical bibliography "Régi Magyarországi Nyomtatványok" (Early Hungarian Printings) into a database and the use of mathematical models from the toolbox of biologists were employed to help estimate it. The analysis of the database also highlights unknown or less investigated areas and enables them to extend previous research focusing on a particular time range to the whole period (such as religious trends during reformation and counter reformation, the changes of genres over times).

Hack & Yacks

I have the privilege of programming and leading this particular series of events and they are my favourite days in the calendar! These are our casual, 2hr monthly meet ups where we all take some time to have a hands-on exploration of new tools, techniques, and applications. No previous experience is ever needed, these are aimed at complete beginners (we’re usually learning something new too!) and we welcome colleagues from across the Library to come have a play! Some sessions are more "yack" than "hack", while others are more quiet hacking depending on the topic but no matter the balance they're always illuminating.

  • Introduction to AI and Machine learning was great fun for me personally as I had the chance to give staff an interactive and hands-on introduction to concepts around AI and ML, as it relates to library work, and play around with some open machine learning tools. The session was based on much of the text and activities offered in this topic guide AI & ML in Libraries– Digital Scholarship & Data Science Topic Guides for Library Professionals and it was a useful way for me to test the content directly with its intended audience!

  • Catalogues as Data was a session run by Harry Lloyd our Research Software Engineer Extraordinaire and Rossitza Atanassova, Digital Curator, as a two part guided exploration of printed Catalogues as data, working with OCR output and corpus linguistic analysis. In the first half we followed steps in a Jupyter Notebook to extract catalogue entries from OCR text, troubleshoot errors in the algorithm, and investigate Named Entity Recognition techniques. In the second half we explore catalogue entries using corpus linguistic techniques using AntConc, gaining a sense of how cataloguing practice and the importance of different terms changes over time.

Digital Scholarship Reading Group

These monthly discussions led by Digital Curators Mia Ridge and Rossitza Atanassova, are always open to any of our BL colleagues & students, regardless of job title or department. Discussions are regularly attended by colleagues from a range of departments including curators, reference specialists, technology, and research services.

My favourite session of the year by far was “No stupid questions, AI in Libraries”, a lovely meandering session we held in December and a great way to wrap up the year. Instead of discussing any particular reading, we all shared bits about what we had read or learned about independently on the topic of AI in Libraries and had some good-natured debate about where we believe it’s all headed for us on personal and professional levels. Though no readings were required, these were offered in case folks wanted to swot up:

Formal Workshops

We also programme formal courses as needed and this year we focussed very much on building our knowledge of the Wikimedia Universe. I thoroughly enjoyed the lessons we got from Lucy Hinnie and Stuart Prior which covered nearly every aspect of Wikimedia, and we’ll doing much more with this new knowledge, particularly WikiData in 2025!

 

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. The conference themes have been published and the Call for Proposals will be open soon.

You can also watch public events we've previously hosted on AI in libraries (January 2025) and Safeguarding Tomorrow: The Impact of AI on Media and Information Industries (February 2024).

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

18 December 2024

The challenges of AI for oral history: theoretical and practical issues

Oral History Archivist Charlie Morgan provides examples of how AI-based tools integrated into workflows might affect oral historians' consideration of orality and silence in the second of two posts on a talk he gave with Digital Curator Mia Ridge at the 7th World Conference of the International Federation for Public History in Belval, LuxembourgHis first post proposed some key questions for oral historians thinking about AI, and shared an example automatic speech recognition (ASR) tools in practice. 

While speech to text once seemed at the cutting edge of AI, software designers are now eager to include additional functions. Many incorporate their own chatbots or other AI ‘helpers’ and the same is true of ‘standard’ software. Below you can see what happened when I asked the chatbots in Otter and Adobe Acrobat some questions about other transcribed clips from the ‘Lives in Steel’ CD:

Screenshot of search and chatbot interactions with transcribed text
A composite image of chatbot responses to questions about transcribed clips

In Otter, the chatbot does well at answering a question on sign language but fails to identify the accent or dialect of the speaker. This is a good reminder of the limits of these models and how, without any contextual information, they cannot understand the interview beyond textual analysis. Oral historians in the UK have long understood interviews as fundamentally oral sources and current AI models risk taking us away from this.

In Adobe I tried asking a much more subjective question around emotion in the interview. While the chatbot does answer, it is again worth remembering the limits of this textual analysis, which, for example, could not identify crying, laughter or pitch change as emotion. It would also not understand the significance of any periods of silence. On our panel at the IFPH2024 conference in Luxembourg Dr Julianne Nyhan noted how periods of silence tend to lead speech-to-text models to ‘hallucinate’ so the advice is to take them out; the problem is that oral history has long theorised the meaning and importance of silence.

Alongside the chatbot, Adobe also includes a degree of semantic searching where a search for steel brings up related words. This in itself might be the biggest gift new technologies offer to catalogue searching (shown expertly in Placing the Holocaust) – helping us to move away from what Mia Ridge calls ‘the tyranny of the keyword’.

However, the important thing is perhaps not how well these tools perform but the fact they exist in the first place. Oral historians and archivists who, for good reasons, are hesitant about integrating AI into their work might soon find it has happened anyway. For example, Zencastr, the podcasting software we have used since 2020 for remote recordings, now has an in-built AI tool. Robust principles on the use of AI are essential then not just for new projects or software, but also for work we are already doing and software we are already using.

The rise of AI in oral history raises theoretical questions around orality and silence, but must also be considered in terms of practical workflows: Do participation and recording agreements need to be amended?​ How do we label AI generated metadata in catalogue records, and should we be labelling human generated metadata too? Do AI tools change the risks and rewards of making oral histories available online? We can only answer these questions through though critical engagement with the tools themselves.

The challenges of AI for oral history: key questions

Oral History Archivist Charlie Morgan shares some key questions for oral historians thinking about AI, and shares some examples of automatic speech recognition (ASR) tools in practice in the first of two posts...

Oral history has always been a technologically mediated discipline and so has not been immune to the current wave of AI hype. Some have felt under pressure to ‘do some AI’, while others have gone ahead and done it. In the British Library oral history department, we have been adamant that any use of AI must align practically, legally and ethically with the Library’s AI principles (currently in draft form). While the ongoing effects of the 2023 cyber-attack have also stymied any integration of new technologies into archival workflows, we have begun to experiment with some tools. In September, I was pleased to present on this topic with Digital Curator Mia Ridge at the 7th World Conference of the International Federation for Public History in Belval, Luxembourg. Below is a summary of what I spoke about in our presentation, ‘Listening with machines? The challenges of AI for oral history and digital public history in libraries’.

The ‘boom’ in AI and oral history has mostly focussed on speech recognition and transcription, driven by the release of Trint (2014) and Otter (2016), but especially Whisper (2022). There have also been investigations into indexing, summarising and visualisation, notably from the Congruence Engine project. Oral historians are interested in how AI tools could help with documentation and analysis but many also have concerns. Concerns include, but are not limited to, ownership, data protection/harvesting, labour conditions, environmental costs, loss of human involvement, unreliable outputs and inbuilt biases.

For those of us working with archived collections there are specific considerations: How do we manage AI generated metadata? Should we integrate new technologies into catalogue searching? What are the ethics of working at scale and do we have the experience to do so? How do we factor in interviewee consent, especially since speakers in older collections are now likely dead or uncontactable?

With speech recognition, we are now at a point where we can compare different automated transcripts created at different times. While our work on this topic at the British Library has been minimal, future trials might help us build up enough research data to address the above questions.

Robert Gladders was interviewed by Alan Dein for the National Life Stories oral history project ‘Lives in Steel’ in 1991 and the extract below was featured on the 1993 published CD ‘Lives in Steel’.

The full transcripts for this audio clip are at the end of this post.

Sign Language

We can compare three automatic speech recognition (ASR) transcripts of the first line:

  • Human: Sign language was for telling the sample to the first hand, what carbon the- when you took the sample up into the lab, you run with the sample to the lab​
  • Otter 2020: Santa Lucia Chelan, the sound pachala fest and what cabin the when he took the sunlight into the lab, you know they run with a sample to the lab​
  • Otter 2024: Sign languages for selling the sample, pass or the festa and what cabin the and he took the samples into the lab. Yet they run with a sample to the lab.
  • Whisper 2024: The sand was just for telling the sand that they were fed down. What cabin, when he took the sand up into the lab, you know, at the run with the sand up into the lab

Gladders speaks with a heavy Middlesbrough accent and in all cases the ASR models struggle, but the improvements between 2020 and 2024 are clear. In this case, Otter in 2024 seems to outperform Whisper (‘The sand’ is an improvement on ‘Santa Lucia Chelan’ but it isn’t ‘Sign languages’), but this was a ‘small’ version of Whisper and larger models might well perform better.

One interesting point of comparison is how the models handle ‘sample passer’, mentioned twice in the short extract:

  • Otter 2020: Sentinel pastor / sound the password​
  • Otter 2024: Salmon passer / Saturn passes​
  • Whisper 2024: Santland pass / satin pass

While in all cases the models fail, this would be easy to fix. The aforementioned CD came with its own glossary, which we could feed into a large language model working on these transcriptions. Practically this is not difficult but it raises some larger questions. Do we need to produce tailored lexicons for every collection? This is time-consuming work so who is going to do it? Would we label an automated transcript in 2024 that makes use of a human glossary written in 1993 as machine generated, human generated, or both? Moreover, what level of accuracy we are willing to accept and how do we define accuracy itself?

 

Samplepasser: The top man on the melting shop with responsibility for the steel being refined. Sampling: The act of taking a sample of steel from a steel furnace, using a long-handled spoon which is inserted into the furnace and withdrawn. Sintering: The process of heating crushed iron-ore dust and particles (fines) with coke breeze in an oxidising atmosphere to reduce sulphur content and produce a more effective and consistent charge for the blast furnaces. This process superseded the earlier method of charging the furnaces with iron-ore and coke, and led to greatly increased tonnages of iron being produced
Sample glossary terms

Continue reading "The challenges of AI for oral history: key questions"

17 December 2024

Open cultural data - an open GLAM perspective at the British Library

Drawing on work at and prior to the British Library, Digital Curator Mia Ridge shares a personal perspective on open cultural data for galleries, libraries, archives and museums (GLAMs) based on a recent lecture for students in Archives and Records Management…

Cultural heritage institutions face both exciting opportunities and complex challenges when sharing their collections online. This post gives common reasons why GLAMs share collections as open cultural data, and explores some strategic considerations behind making collections accessible.

What is Open Cultural Data?

Open cultural data includes a wide range of digital materials, from individual digitised or born-digital items – images, text, audiovisual records, 3D objects, etc. – to datasets of catalogue metadata, images or text, machine learning models and data derived from collections.

Open data must be clearly licensed for reuse, available for commercial and non-commercial use, and ideally provided in non-proprietary formats and standards (e.g. CSV, XML, JSON, RDF, IIIF).

Why Share Open Data?

The British Library shares open data for multiple compelling reasons.

Broadening Access and Engagement: by releasing over a million images on platforms like Flickr Commons, the Library has achieved an incredible 1.5 billion views. Open data allows people worldwide to experience wonder and delight with collections they might never physically access in the UK.

Deepening Access and Engagement: crowdsourcing and online volunteering provide opportunities for enthusiasts to spend time with individual items while helping enrich collections information. For instance, volunteers have helped transcribe complex materials like Victorian playbills, adding valuable contextual information.

Supporting Research and Scholarship: in addition to ‘traditional’ research, open collections support the development of reproducible computational methods including text and data mining, computer vision and image analysis. Institutions also learn more about their collections through formal and informal collaborations.

Creative Reuse: open data encourages artists to use collections, leading to remarkable creative projects including:

Animation featuring an octopus holding letters and parcels on a seabed with seaweed
Screenshot from Hey There Young Sailor (Official Video) - The Impatient Sisters

 

16 illustrations of girls in sad postures
'16 Very Sad Girls' by Mario Klingemann

 

A building with large-scale projection
The BookBinder, by Illuminos, with British Library collections

 

Some lessons for Effective Data Sharing

Make it as easy as possible for people to find and use your open collections:

  • Tell people about your open data
  • Celebrate and highlight creative reuses
  • Use existing licences for usage rights where possible
  • Provide data in accessible, sustainable formats
  • Offer multiple access methods (e.g. individual items, datasets, APIs)
  • Invest effort in meeting the FAIR, and where appropriate, CARE principles

Navigating Challenges

Open data isn't without tensions. Institutions must balance potential revenue, copyright restrictions, custodianship and ethical considerations with the benefits of publishing specific collections.

Managing expectations can also be a challenge. The number of digitised or born-digital items available may be tiny in comparison to the overall size of collections. The quality of digitised records – especially items digitised from microfiche and/or decades ago – might be less than ideal. Automatic text transcription and layout detection errors can limit the re-usability of some collections.

Some collections might not be available for re-use because they are still in copyright (or are orphan works, where the creator is not known), were digitised by a commercial partner, or are culturally sensitive.

The increase in the number of AI companies scraping collections site to train machine learning models has also given some institutions cause to re-consider their open data policies. Historical collections are more likely to be out of copyright and published for re-use, but they also contain structural prejudices and inequalities that could be embedded into machine learning models and generative AI outputs.

Conclusion

Open cultural data is more than just making collections available—it's about creating dynamic, collaborative spaces of knowledge exchange. By thoughtfully sharing our shared intellectual heritage, we enable new forms of research, inspiration and enjoyment.

 

AI use transparency statement: I recorded my recent lecture on my phone, then generated a loooong transcription on my phone. I then supplied the transcription and my key points to Claude, with a request to turn it into a blog post, then manually edited the results.

16 December 2024

Closing the language gap: automated language identification in British Library catalogue records

What do you do when you have millions of books and no record of the language they were written in? Collection Metadata Analyst Victoria Morris looks back to describe how she worked on this in 2020...

Context

In an age of online library catalogues, recording the language in which a book (or any other textual resource) is written is vital to library curators and users alike, as it allows them to search for resources in a particular language, and to filter search results by language.

As the graph below illustrates, although language information is routinely added to British Library catalogue records created as part of ongoing annual production, fewer than 30% of legacy records (from the British Library’s foundation catalogues) contain language information. As of October 2018, nearly 4.7 million of records were lacking any explicit language information. Of these, 78% were also lacking information about the country of publication, so it would not be possible to infer language from the place of publication.

Chart showing language of content records barely increasing over time

The question is: what can be done about this? In most cases, the language of the resource described can be immediately identified by a person viewing the book (or indeed the catalogue record for the book). With such a large number of books to deal with, though, it would be infeasible to start working through them one at a time ... an automated language identification process is required.

Language identification

Language identification (or language detection) refers to the process of determining the natural language in which a given piece of text is written. The texts analysed are commonly referred to as documents.

There are two possible avenues of approach: using either linguistic models or statistical models. Whilst linguistic models have the potential to be more realistic, they are also more complex, relying on detailed linguistic knowledge. For example, some linguistic models involve analysis of the grammatical structure of a document, and therefore require knowledge of the morphological properties of nouns, verbs, adjectives, etc. within all the languages of interest.

Statistical models are based on the analysis of certain features present within a training corpus of documents. These features might be words, character n-grams (sequences of n adjacent characters) or word n-grams (sequences of n adjacent words). These features are examined in a purely statistical, ‘linguistic-agnostic’ manner; words are understood as sequences of letter-like characters bounded by non-letter-like characters, not as words in any linguistic sense. When a document in an unknown language is encountered, its features can be compared to those of the training corpus, and a predication can thereby be made about the language of the document.

Our project was limited to an investigation of statistical models, since these could be more readily implemented using generic processing rules.

What can be analysed?

Since the vast majority of the books lacking language information have not been digitised, the language identification had to be based solely on the catalogue record. The title, edition statement and series title were extracted from catalogue records, and formed the test documents for analysis.

Although there are examples of catalogue records where these metadata elements are in a language different to that of the resource being described (as in, for example, The Four Gospels in Fanti, below), it was felt that this assumption was reasonable for the majority of catalogue records.

A screenshot of the catalogue record for a book listed as 'The Four Gospels in Fanti'

Measures of success

The effectiveness of a language identification model can be quantified by the measures precision and recall; precision measures the ability of the model not to make incorrect language predictions, whilst recall measures the ability of the model to find all instances of documents in a particular language. In this context, high precision is of greater value than high recall, since it is preferable to provide no information about the language of content of a resource than to provide incorrect information.

Various statistical models were investigated, with only a Bayesian statistical model based on analysis of words providing anything approaching satisfactory precision. This model was therefore selected for further development.

The Bayesian idea

Bayesian methods are based on a calculation of the probabilities that a book is written in each language under consideration. An assumption is made that the words present within the book title are statistically independent; this is obviously a false assumption (since, for example, adjacent words are likely to belong to the same language), but it allows application of the following proportionality:

An equation: P(D" is in language " l "given that it has features"  f_1…f_n )∝P (D" is in language " l)∏_(i=1)^n▒├ P("feature " f_i " arises in language " l)

The right-hand side of this proportionality can be calculated based on an analysis of the training corpus. The language of the test document is then predicted to be the language which maximises the above probability.

Because of the assumption of word-independence, this method is often referred to as naïve Bayesian classification.

What that means in practice is this: we notice that whenever the word ‘szerelem’ appears in a book title for which we have language information, the language is Hungarian. Therefore, if we find a book title which contains the word ‘szerelem’, but we don’t have language information for that book, we can predict that the book is probably in Hungarian.

Screenshot of catalogue entry with the word 'szerelem' in the title of a book
Szerelem: definitely a Hungarian word => probably a Hungarian title

If we repeat this for every word appearing in every title of each of the 12 million resources where we do have language information, then we can build up a model, which we can use to make predictions about the language(s) of the 4.7 million records that we’re interested in. Simple!

Training corpus

The training corpus was built from British Library catalogue records which contain language information, Records recorded as being in ‘Miscellaneous languages’, ‘Multiple languages’, ‘Sign languages’, ‘Undetermined’ and ‘No linguistic content’ were excluded. This yielded 12,254,341 records, of which 9,578,175 were for English-language resources. Words were extracted from the title, edition statement, and series title, and stored in a ‘language bucket’.

Words in English, Hungarian and Volapuk shown above the appropriate language 'bucket'

Language buckets were analysed in order to create a matrix of probabilities, whereby a number was assigned to each word-language pair (for all words encountered within the catalogue, and all languages listed in a controlled list) to represent the probability that that word belongs to that language. Selected examples are listed in the table below; the final row in the table illustrates the fact that shorter words tend to be common to many languages, and are therefore of less use than longer words in language identification.

{Telugu: 0.750, Somali: 0.250}

aaaarrgghh

{English: 1.000}

aaavfleeße

{German: 1.000}

aafjezatsd

{German: 0.333, Low German: 0.333, Limburgish: 0.333}

aanbidding

{Germanic (Other): 0.048, Afrikaans: 0.810, Low German: 0.048, Dutch: 0.095}

نبوغ

{Persian: 0.067, Arabic: 0.200, Pushto: 0.333, Iranian (Other): 0.333, Azerbaijani: 0.067}

metodicheskiĭ

{Russian: 0.981, Kazakh: 0.019}

nuannersuujuaannannginneranik

{Kalâtdlisut): 1.000}

karga

{Faroese: 0.020, Papiamento: 0.461, Guarani: 0.010, Zaza: 0.010, Esperanto: 0.010, Estonian: 0.010, Iloko: 0.176, Maltese: 0.010, Pampanga: 0.010, Tagalog: 0.078, Ladino: 0.137, Basque: 0.029, English: 0.010, Turkish: 0.029}

Results

Precision and recall varied enormously between languages. Zulu, for instance, had 100% precision but only 20% recall; this indicates that all records detected as being in Zulu had been correctly classified, but that the majority of Zulu records had either been mis-classified, or no language prediction had been made. In practical terms, this meant that a prediction “this book is in Zulu” was a prediction that we could trust, but we couldn’t assume that we had found all of the Zulu books. Looking at our results across all languages, we could generate a picture (formally termed a ‘confusion matrix’) to indicate how different languages were performing (see below). The shaded cells on the diagonal represent resources where the language has been correctly identified, whilst the other shaded cells show us where things have gone wrong.

Language confusion matrix

The best-performing languages were Hawaiian, Malay, Zulu, Icelandic, English, Samoan, Finnish, Welsh, Latin and French, whilst the worst-performing languages were Shona, Turkish, Pushto, Slovenian, Azerbaijani, Javanese, Vietnamese, Bosnian, Thai and Somali.

Where possible, predictions were checked by language experts from the British Library’s curatorial teams. Such validation facilitated the identification of off-diagonal shaded areas (i.e. languages for which predictions which should be treated with caution), and enabled acceptance thresholds to be set. For example, the model tends to over-predict English, in part due to the predominance of English-language material in the training corpus, thus the acceptance threshold for English was set at 100%: predictions of English would only be accepted if the model claimed that it was 100% certain that the language was English. For other languages, the acceptance threshold was generally between 95% and 99%.

Outcomes

Two batches of records have been completed to date. In the first batch, language codes were assigned to 1.15 million records with 99.7% confidence; in the second batch, a further 1 million language codes were assigned with 99.4% confidence. Work on a third batch is currently underway, and it is hoped to achieve at least a further million language code assignments. The graph below shows the impact that this project is having on the British Library catalogue.

Graph showing improvement in the number of 'foundation catalogue' records with languages recorded

The project has already been well-received by Library colleagues, who have been able to use the additional language coding to assist them in identifying curatorial responsibilities and better understanding the collection.

Further reading

For a more in-depth, mathematical write-up of this project, please see a paper written for Cataloging & Classification Quarterly, which is available at: https://doi.org/10.1080/01639374.2019.1700201, and is also in the BL research repository at https://bl.iro.bl.uk/work/6c99ffcb-0003-477d-8a58-64cf8c45ecf5.