Digital scholarship blog

69 posts categorized "LIS research"

09 July 2025

A Geographer’s Initiation Into Digital Humanities: Part 1

A post by Dr Huw Rowlands on his Coleridge Fellowship 2025, 'Cross-cultural Encounters in the Survey of India in the Mid-nineteenth Century'.

To begin at the beginning. 

My 2021 doctoral thesis focused on cross-cultural encounters in Aotearoa – New Zealand. I started with an overview of the 18th century voyage of the Endeavour, led by James Cook, to Te Moana nui a Kiwa – the Pacific Ocean. I went on to examine the histories that continue to be created about them in official reports, academic research, museum exhibitions, and documentary film. 

Since then, I have been working with the many thousands of maps produced by the Survey of India held in the India Office Records (IOR) Map Collection. I soon became aware of the virtual invisibility of work by the Indian, Burmese and other staff on the maps themselves. With this tucked away at the back of my mind, I have followed my curiosity about digital humanities in British Library and other seminars and workshops, and actively followed the Library’s work on its Race Equality Action Plan. When I came across three series of printed annual reports produced by Survey of India Survey Parties, which listed all survey staff, including those they called ‘Native Surveyors’, these strands quickly came together in my mind and eventually led to my Coleridge Fellowship proposal. The Coleridge Fellowship offers British Library staff the opportunity to pursue a piece of original research and further understanding of the Library’s collections. It was established in 2017 through the generosity of Professor Heather Jackson and her late husband Professor J.R. de J. Jackson, and is named after Samuel Taylor Coleridge (1772-1834)

My aims with the Fellowship are to show the opportunities in the IOR Map Collection to identify a range of individuals involved in mapping what is called in the reports ‘British India’, to learn and demonstrate how data can be extracted and managed, and to reveal its potential in understanding cross-cultural relationships in this context. 

Black Boxes 

With great support from the Library’s Digital Research and Heritage Made Digital teams among others, particularly Harry Lloyd, Mia Ridge, and Valentina Vavassori, I drew up a plan for the project. The first step was to evaluate the series of reports and choose one set. The next stages are focused on digital methods: firstly to acquire and verify digital images of the chosen reports, use OCR (Optical Character Recognition) to create text files, extract and structure the information I need from them, and lastly visualise the information to create a foundation to help answer my research questions. Each of these stages looked to me like a black box – something clear and present but whose internal workings are a bit of a mystery. At an early planning meeting with the team, we started to explore each black box stage. Black boxes were unpacked onto three white boards: Inputs/Sources, Process, and Results. These initial sketches have become the foundations of my detailed research plan for the digital stages of the project. 

photo of a whiteboard with text and sketches of information needed from the source documents
One of the whiteboards from our first digital planning meeting

Potentially hidden away in or between each black box were what Mia called ‘magic elves’, imaginary creatures who undertake essential but unresourced tasks such as converting information from one form to another. We unpacked the boxes and set out a series of smaller steps, banishing numerous phantom elves.  

My work is currently focused on learning the skills needed to achieve each smaller step. I have been getting to grips with OCR application Transkribus, ably guided by Valentina. Crucial to making the most of such tools is referring forwards to the next digital stage and its own tools, as well as backwards to my research questions. In doing so, the image of a series of discrete black boxes has now given way to a relay race, passing a baton of information on from one stage to the next. The way I use one tool can make the transition onto the next easier or harder. So, while firmly focused on Transkribus, Harry has been guiding me through the stage that follows, so that the data baton can be passed on as smoothly as possible. 

Digitised page of a survey report showing numbered paragraphs and an inset list of members of the topographical party
Digital image before uploading to Transkribus

As well as relying on some unsophisticated metaphors, my vocabulary has been changing, with both some new words, and some old words with different, or more specific meanings. Regions and tags are two from Transkribus. Regions are a way of segregating areas of the original image so that Transkribus organises the text into separate sections. I have been using the pre-existing Heading and Marginalia, for example, and have added a new Region, Credit, where staff are credited with work undertaken during the year. Using regions should help the data extraction stage by enabling me to focus on areas of text where the data most useful for my research questions is to be found. Tags label individual words or phrases as entities such as People, Places and Organisations. ‘Tag’ is a short word but using tags involves a careful examination of what I need to tag and why, as well as consideration of each tag’s attributes. Transkribus’ default Person tag, for example, includes the Attributes First Name, Last name and dates of Birth and Death. To track promotion over time, I have added a new attribute – Title. Tagging is an intriguing, interpretive process and I expect to have more to say about it later in the project. 

Screenshot of a printed page with sections outlined, and names from the page set out in the Transkribus tool
Transkribus screenshot showing regions applied to the digital image on the left, and the tagged transcription on the right.

As I move onto the data extraction stage, I will no doubt be acquiring and understanding more vocabulary. I have so far spotted entities, triples, NLP, Python, LLM, and NER, to name a few. I also expect to need a new metaphor or two. 

Dr Huw Rowlands 

British Library Coleridge Fellow 2025 

Processing Coordinator and Cataloguer 

India Office Records Map Project 

17 June 2025

The Digital Research team at DH2025

Several of the Digital Research team had proposals accepted and will be attending the Digital Humanities 2025 conference in Lisbon. To help get conversations started, we’ve compiled some information about the work we’ll be discussing below.

We’ll be on social media – Mastodon (@[email protected], @[email protected], @[email protected],  @[email protected]) and BlueSky (@bldigischol.bsky.social, @adi-keinan.bsky.social, @miaout.bsky.social, @universalviewer.io) – and we’re looking forward to talking to people there!

In order of appearance…

On July 16, Digital Curator Adi Keinan-Schoonbaert is presenting on ‘Digital Humanities and Environmental Sustainability at the British Library’:

‘In this paper, Adi will explore a heritage organisation’s journey into digital sustainability, looking at the British Library as a case study. She’ll discuss initiatives aimed at increasing literacy and capacity building, both within the Library but also externally, fostering personal agency, and encouraging action using both bottom-up and top-down approaches. Framing this within the context of the Library’s Sustainability and Climate Change Strategy, Adi will examine the role of internal capacity-building efforts—including staff-led networks, targeted training, and collaborative workshops such as those with the Digital Humanities Climate Coalition—in promoting sustainable digital literacy and embedding environmentally conscious decision-making across the organisation.’

The 'Future of Digital Sustainability' workshop, as part of the 'Discover Digital Sustainability' training series
The 'Future of Digital Sustainability' workshop, as part of the 'Discover Digital Sustainability' training series

On July 17, our Universal Viewer team - Lanie Okorodudu, Saira Akhter, James Misson and Erin Burnand – and Digital Curator Mia Ridge are sharing their work in ‘Radically inclusive software development for digital cultural heritage’. The Universal Viewer is a community-developed open source project on a mission to help share digital collections. Fresh from community sprints focused on improving the developer experience, the team will share:

‘Sustaining open source software can be challenging. We discuss collaboration on the Universal Viewer (UV), software designed to display cultural heritage collections. We highlight methods including innovative, inclusive and multi-institution sprints. We showcase UV’s evolution, including accessibility and user experience enhancements, future plans and ways for others to contribute.’

We might also attend the 'Decade of IIIF' panel on Friday.

Sally Chambers contributed to a group poster on Computational Literary Studies Infrastructure (CLS INFRA): Leveraging Literary Methods for FAIR(er) Science shown on July 18.

In the final session of the conference on July 18, Mia is part of a panel, Openness in GLAM: Analysing, Reflecting, and Discussing Global Case Studies, with Nadezhda Povroznik, Paul L. Arthur, T. Leo Cao, Samantha Callaghan and Luis Ramos Pinto:

‘This panel explores diverse dimensions of openness within the galleries, libraries, archives and museums (GLAM) sector globally, shaping discussions about accessibility, inclusivity, participation, and knowledge democratisation. Cultural heritage institutions are responsible “to all citizens”. Yet there are gaps relating to collections, knowledge, policy, technology, engagement, IP, ethics, infrastructure and AI.’

Mia is particularly interested in ‘the Paradoxes of Open Data in Libraries, Archives and Museums’, including:

  • The lack of robust, sector-wide shared infrastructure providing long-term access to GLAM collections, despite decades of evidence for its value and the difficulties many institutions have in maintaining individual repositories
  • The tension between making data open for exploration and re-use (including, scraping by generative AI companies), while respecting copyright and the right of creators to receive income from their writing, art, music, etc.
  • Balancing the FAIR principles - making open collections Findable, Accessible, Interoperable and Reusable - with the CARE principles for Indigenous Data Governance, to support Indigenous people in “asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit” (Global Indigenous Data Alliance, 2018). Operationalizing the CARE principles might require an investment of time in building relationships and trust with Indigenous communities before releasing open data - or perhaps choosing to keep data closed in some ways - that counters the urge for speed. What changes are required for organisations to meaningfully address the CARE principles, and what can individual staff do if resources to invest in community relationships aren’t available?
  • The need for financial models to fund collections digitisation that don't rely on individual users paying for access to collections and the overhead required to provide evidence for the use and impact of open data

13 March 2025

Fantastic Futures 2025 (FF2025) Call for Proposals

Fantastic Futures 2025: AI Everywhere, All at Once 

AI4LAM’s annual conference, December 3 – 5, 2025, British Library, London

The British Library and the Programme Committee for the Fantastic Futures 2025 conference are delighted to invite proposals for presentations and workshops for the Fantastic Futures 2025 conference.  

Fantastic Futures is the annual conference for the AI4LAM (Artificial Intelligence, for Libraries, Archives, Museums) community. Submissions are invited from colleagues around the world about organisations, collections, interest and experience with Artificial Intelligence (AI) and Machine Learning (ML) technologies applied to or developed with cultural, research and heritage collections. This includes practitioners in the GLAM (Galleries, Libraries, Archives, Museums) sector and Digital Humanities, Arts and Social Sciences, Data, Information and Computer Science researchers in Higher Education. 

Key information

  • Call for proposals shared: Thursday 13 March 2025 
  • Conference submission form opens: May 2025
  • Proposal submission deadline: midnight anywhere, Sunday 1 June 2025 
  • Notification of acceptance: 25 July 2025 
  • Conference dates: December 3 – 5, 2025 
  • Location: British Library, London, onsite – with some livestreams and post-event videos 

FF2025 Theme: AI Everywhere, All at Once 

We invite presentations on the theme of 'AI Everywhere, All at Once’. While AI has a long history in academia and practice, the release of public language models like ChatGPT propelled AI into public consciousness. The sudden appearance of AI ‘tools’ in the software we use every day, government consultations on AI and copyright and the hype about Artificial Intelligence mean that libraries, museums and archives must understand what AI means for them. Should they embrace it, resist it or fear it? How does it relate to existing practices and services, how can it help or undermine staff, and how do we keep up with rapid changes in the field? 

There are many opportunities and many challenges in delivering AI that create rich, delightful and immersive experiences of GLAM collections and spaces for the public, and meet the needs of researchers for relevant, reliable and timely information. Challenges range from the huge – environmental and economic sustainability, ensuring alignment with our missions, ethical and responsible AI, human-centred AI, ensuring value for money – to the practical – evaluation, scalability, cyber security, multimodal collections – and throughout it all, managing the pace of change. 

Our aim is to promote interdisciplinary conversations that foster broader understandings of AI methods, practices and technologies and enable critical reflections about collaborative approaches to research and practice. 

Themes   

We’re particularly interested in proposals that cover these themes:   

  • Ethical and Responsible AI 
  • Human-Centred AI / the UX of AI 
  • Trust, AI literacy and society 
  • Building AI systems for and with staff and users 
  • Cyber-security and resilience 
  • Interoperability and standards
  • FAIR, CARE, rights and copyright
  • Benchmarking AI / machine learning 
  • Regional, national, international approaches to AI 
  • Environmental sustainability  

Formats for presentations (Thursday, Friday December 4-5) 

  • Lightning talk: 5 mins. These might pitch an idea, call for collaborators, throw out a provocation or just provide a short update 
  • Poster  - perfect for project updates – what went well, what would you do differently, what lessons can others take? 
  • Short presentation: 15 mins   
  • Long presentation: 30 mins 
  • Panel: 45 mins, multiple presenters with short position statements then discussion 

Formats for workshops or working group sessions (Wednesday December 3) 

  • Formal, instructor-led sessions, including working groups, tutorials, hands-on workshops – 1 or 2 hours 
  • Informal, unstructured sessions, including unconferences, meetups, hacking – 1 or 2 hours 
  • Digital showcase (demo): 30 mins 

We value the interactions that an in-person event enables, so the default mode for this event is in-person presentations. However, if your proposal is accepted for inclusion in the conference but you are not able to travel to London, we can consider arrangements for making a virtual presentation on a case-by-case basis. Please contact the Programme Committee at [email protected] to discuss. 

The conference will be held over three days: one day of workshops and other events, and two days of formal sessions. The social programme will include opportunities for informal networking.  

Plenary sessions on Thursday and Friday will be livestreamed, recorded and published. 

Find out more and get updates 

  • Check the AI4LAM FF2025 page for updates, including Frequently Asked Questions and information on our review criteria
  • Organisers: Rossitza Atanassova, Neil Fitzgerald and Mia Ridge, British Library

Further details about the conference submission process and registration will be supplied soon. 

This post was last updated 16 May 2025.

23 December 2024

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. The conference themes have been published and the Call for Proposals will be open soon.

You can also watch public events we've previously hosted on AI in libraries (January 2025) and Safeguarding Tomorrow: The Impact of AI on Media and Information Industries (February 2024).

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.
From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'. 

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

18 December 2024

The challenges of AI for oral history: theoretical and practical issues

Oral History Archivist Charlie Morgan provides examples of how AI-based tools integrated into workflows might affect oral historians' consideration of orality and silence in the second of two posts on a talk he gave with Digital Curator Mia Ridge at the 7th World Conference of the International Federation for Public History in Belval, LuxembourgHis first post proposed some key questions for oral historians thinking about AI, and shared an example automatic speech recognition (ASR) tools in practice. 

While speech to text once seemed at the cutting edge of AI, software designers are now eager to include additional functions. Many incorporate their own chatbots or other AI ‘helpers’ and the same is true of ‘standard’ software. Below you can see what happened when I asked the chatbots in Otter and Adobe Acrobat some questions about other transcribed clips from the ‘Lives in Steel’ CD:

Screenshot of search and chatbot interactions with transcribed text
A composite image of chatbot responses to questions about transcribed clips

In Otter, the chatbot does well at answering a question on sign language but fails to identify the accent or dialect of the speaker. This is a good reminder of the limits of these models and how, without any contextual information, they cannot understand the interview beyond textual analysis. Oral historians in the UK have long understood interviews as fundamentally oral sources and current AI models risk taking us away from this.

In Adobe I tried asking a much more subjective question around emotion in the interview. While the chatbot does answer, it is again worth remembering the limits of this textual analysis, which, for example, could not identify crying, laughter or pitch change as emotion. It would also not understand the significance of any periods of silence. On our panel at the IFPH2024 conference in Luxembourg Dr Julianne Nyhan noted how periods of silence tend to lead speech-to-text models to ‘hallucinate’ so the advice is to take them out; the problem is that oral history has long theorised the meaning and importance of silence.

Alongside the chatbot, Adobe also includes a degree of semantic searching where a search for steel brings up related words. This in itself might be the biggest gift new technologies offer to catalogue searching (shown expertly in Placing the Holocaust) – helping us to move away from what Mia Ridge calls ‘the tyranny of the keyword’.

However, the important thing is perhaps not how well these tools perform but the fact they exist in the first place. Oral historians and archivists who, for good reasons, are hesitant about integrating AI into their work might soon find it has happened anyway. For example, Zencastr, the podcasting software we have used since 2020 for remote recordings, now has an in-built AI tool. Robust principles on the use of AI are essential then not just for new projects or software, but also for work we are already doing and software we are already using.

The rise of AI in oral history raises theoretical questions around orality and silence, but must also be considered in terms of practical workflows: Do participation and recording agreements need to be amended?​ How do we label AI generated metadata in catalogue records, and should we be labelling human generated metadata too? Do AI tools change the risks and rewards of making oral histories available online? We can only answer these questions through though critical engagement with the tools themselves.

The challenges of AI for oral history: key questions

Oral History Archivist Charlie Morgan shares some key questions for oral historians thinking about AI, and shares some examples of automatic speech recognition (ASR) tools in practice in the first of two posts...

Oral history has always been a technologically mediated discipline and so has not been immune to the current wave of AI hype. Some have felt under pressure to ‘do some AI’, while others have gone ahead and done it. In the British Library oral history department, we have been adamant that any use of AI must align practically, legally and ethically with the Library’s AI principles (currently in draft form). While the ongoing effects of the 2023 cyber-attack have also stymied any integration of new technologies into archival workflows, we have begun to experiment with some tools. In September, I was pleased to present on this topic with Digital Curator Mia Ridge at the 7th World Conference of the International Federation for Public History in Belval, Luxembourg. Below is a summary of what I spoke about in our presentation, ‘Listening with machines? The challenges of AI for oral history and digital public history in libraries’.

The ‘boom’ in AI and oral history has mostly focussed on speech recognition and transcription, driven by the release of Trint (2014) and Otter (2016), but especially Whisper (2022). There have also been investigations into indexing, summarising and visualisation, notably from the Congruence Engine project. Oral historians are interested in how AI tools could help with documentation and analysis but many also have concerns. Concerns include, but are not limited to, ownership, data protection/harvesting, labour conditions, environmental costs, loss of human involvement, unreliable outputs and inbuilt biases.

For those of us working with archived collections there are specific considerations: How do we manage AI generated metadata? Should we integrate new technologies into catalogue searching? What are the ethics of working at scale and do we have the experience to do so? How do we factor in interviewee consent, especially since speakers in older collections are now likely dead or uncontactable?

With speech recognition, we are now at a point where we can compare different automated transcripts created at different times. While our work on this topic at the British Library has been minimal, future trials might help us build up enough research data to address the above questions.

Robert Gladders was interviewed by Alan Dein for the National Life Stories oral history project ‘Lives in Steel’ in 1991 and the extract below was featured on the 1993 published CD ‘Lives in Steel’.

The full transcripts for this audio clip are at the end of this post.

Sign Language

We can compare three automatic speech recognition (ASR) transcripts of the first line:

  • Human: Sign language was for telling the sample to the first hand, what carbon the- when you took the sample up into the lab, you run with the sample to the lab​
  • Otter 2020: Santa Lucia Chelan, the sound pachala fest and what cabin the when he took the sunlight into the lab, you know they run with a sample to the lab​
  • Otter 2024: Sign languages for selling the sample, pass or the festa and what cabin the and he took the samples into the lab. Yet they run with a sample to the lab.
  • Whisper 2024: The sand was just for telling the sand that they were fed down. What cabin, when he took the sand up into the lab, you know, at the run with the sand up into the lab

Gladders speaks with a heavy Middlesbrough accent and in all cases the ASR models struggle, but the improvements between 2020 and 2024 are clear. In this case, Otter in 2024 seems to outperform Whisper (‘The sand’ is an improvement on ‘Santa Lucia Chelan’ but it isn’t ‘Sign languages’), but this was a ‘small’ version of Whisper and larger models might well perform better.

One interesting point of comparison is how the models handle ‘sample passer’, mentioned twice in the short extract:

  • Otter 2020: Sentinel pastor / sound the password​
  • Otter 2024: Salmon passer / Saturn passes​
  • Whisper 2024: Santland pass / satin pass

While in all cases the models fail, this would be easy to fix. The aforementioned CD came with its own glossary, which we could feed into a large language model working on these transcriptions. Practically this is not difficult but it raises some larger questions. Do we need to produce tailored lexicons for every collection? This is time-consuming work so who is going to do it? Would we label an automated transcript in 2024 that makes use of a human glossary written in 1993 as machine generated, human generated, or both? Moreover, what level of accuracy we are willing to accept and how do we define accuracy itself?

 

Samplepasser: The top man on the melting shop with responsibility for the steel being refined. Sampling: The act of taking a sample of steel from a steel furnace, using a long-handled spoon which is inserted into the furnace and withdrawn. Sintering: The process of heating crushed iron-ore dust and particles (fines) with coke breeze in an oxidising atmosphere to reduce sulphur content and produce a more effective and consistent charge for the blast furnaces. This process superseded the earlier method of charging the furnaces with iron-ore and coke, and led to greatly increased tonnages of iron being produced
Sample glossary terms

Continue reading" The challenges of AI for oral history: key questions" »

17 December 2024

Open cultural data - an open GLAM perspective at the British Library

Drawing on work at and prior to the British Library, Digital Curator Mia Ridge shares a personal perspective on open cultural data for galleries, libraries, archives and museums (GLAMs) based on a recent lecture for students in Archives and Records Management…

Cultural heritage institutions face both exciting opportunities and complex challenges when sharing their collections online. This post gives common reasons why GLAMs share collections as open cultural data, and explores some strategic considerations behind making collections accessible.

What is Open Cultural Data?

Open cultural data includes a wide range of digital materials, from individual digitised or born-digital items – images, text, audiovisual records, 3D objects, etc. – to datasets of catalogue metadata, images or text, machine learning models and data derived from collections.

Open data must be clearly licensed for reuse, available for commercial and non-commercial use, and ideally provided in non-proprietary formats and standards (e.g. CSV, XML, JSON, RDF, IIIF).

Why Share Open Data?

The British Library shares open data for multiple compelling reasons.

Broadening Access and Engagement: by releasing over a million images on platforms like Flickr Commons, the Library has achieved an incredible 1.5 billion views. Open data allows people worldwide to experience wonder and delight with collections they might never physically access in the UK.

Deepening Access and Engagement: crowdsourcing and online volunteering provide opportunities for enthusiasts to spend time with individual items while helping enrich collections information. For instance, volunteers have helped transcribe complex materials like Victorian playbills, adding valuable contextual information.

Supporting Research and Scholarship: in addition to ‘traditional’ research, open collections support the development of reproducible computational methods including text and data mining, computer vision and image analysis. Institutions also learn more about their collections through formal and informal collaborations.

Creative Reuse: open data encourages artists to use collections, leading to remarkable creative projects including:

Animation featuring an octopus holding letters and parcels on a seabed with seaweed
Screenshot from Hey There Young Sailor (Official Video) - The Impatient Sisters

 

16 illustrations of girls in sad postures
'16 Very Sad Girls' by Mario Klingemann

 

A building with large-scale projection
The BookBinder, by Illuminos, with British Library collections

 

Some lessons for Effective Data Sharing

Make it as easy as possible for people to find and use your open collections:

  • Tell people about your open data
  • Celebrate and highlight creative reuses
  • Use existing licences for usage rights where possible
  • Provide data in accessible, sustainable formats
  • Offer multiple access methods (e.g. individual items, datasets, APIs)
  • Invest effort in meeting the FAIR, and where appropriate, CARE principles

Navigating Challenges

Open data isn't without tensions. Institutions must balance potential revenue, copyright restrictions, custodianship and ethical considerations with the benefits of publishing specific collections.

Managing expectations can also be a challenge. The number of digitised or born-digital items available may be tiny in comparison to the overall size of collections. The quality of digitised records – especially items digitised from microfiche and/or decades ago – might be less than ideal. Automatic text transcription and layout detection errors can limit the re-usability of some collections.

Some collections might not be available for re-use because they are still in copyright (or are orphan works, where the creator is not known), were digitised by a commercial partner, or are culturally sensitive.

The increase in the number of AI companies scraping collections site to train machine learning models has also given some institutions cause to re-consider their open data policies. Historical collections are more likely to be out of copyright and published for re-use, but they also contain structural prejudices and inequalities that could be embedded into machine learning models and generative AI outputs.

Conclusion

Open cultural data is more than just making collections available—it's about creating dynamic, collaborative spaces of knowledge exchange. By thoughtfully sharing our shared intellectual heritage, we enable new forms of research, inspiration and enjoyment.

 

AI use transparency statement: I recorded my recent lecture on my phone, then generated a loooong transcription on my phone. I then supplied the transcription and my key points to Claude, with a request to turn it into a blog post, then manually edited the results.

16 December 2024

Closing the language gap: automated language identification in British Library catalogue records

What do you do when you have millions of books and no record of the language they were written in? Collection Metadata Analyst Victoria Morris looks back to describe how she worked on this in 2020...

Context

In an age of online library catalogues, recording the language in which a book (or any other textual resource) is written is vital to library curators and users alike, as it allows them to search for resources in a particular language, and to filter search results by language.

As the graph below illustrates, although language information is routinely added to British Library catalogue records created as part of ongoing annual production, fewer than 30% of legacy records (from the British Library’s foundation catalogues) contain language information. As of October 2018, nearly 4.7 million of records were lacking any explicit language information. Of these, 78% were also lacking information about the country of publication, so it would not be possible to infer language from the place of publication.

Chart showing language of content records barely increasing over time

The question is: what can be done about this? In most cases, the language of the resource described can be immediately identified by a person viewing the book (or indeed the catalogue record for the book). With such a large number of books to deal with, though, it would be infeasible to start working through them one at a time ... an automated language identification process is required.

Language identification

Language identification (or language detection) refers to the process of determining the natural language in which a given piece of text is written. The texts analysed are commonly referred to as documents.

There are two possible avenues of approach: using either linguistic models or statistical models. Whilst linguistic models have the potential to be more realistic, they are also more complex, relying on detailed linguistic knowledge. For example, some linguistic models involve analysis of the grammatical structure of a document, and therefore require knowledge of the morphological properties of nouns, verbs, adjectives, etc. within all the languages of interest.

Statistical models are based on the analysis of certain features present within a training corpus of documents. These features might be words, character n-grams (sequences of n adjacent characters) or word n-grams (sequences of n adjacent words). These features are examined in a purely statistical, ‘linguistic-agnostic’ manner; words are understood as sequences of letter-like characters bounded by non-letter-like characters, not as words in any linguistic sense. When a document in an unknown language is encountered, its features can be compared to those of the training corpus, and a predication can thereby be made about the language of the document.

Our project was limited to an investigation of statistical models, since these could be more readily implemented using generic processing rules.

What can be analysed?

Since the vast majority of the books lacking language information have not been digitised, the language identification had to be based solely on the catalogue record. The title, edition statement and series title were extracted from catalogue records, and formed the test documents for analysis.

Although there are examples of catalogue records where these metadata elements are in a language different to that of the resource being described (as in, for example, The Four Gospels in Fanti, below), it was felt that this assumption was reasonable for the majority of catalogue records.

A screenshot of the catalogue record for a book listed as 'The Four Gospels in Fanti'

Measures of success

The effectiveness of a language identification model can be quantified by the measures precision and recall; precision measures the ability of the model not to make incorrect language predictions, whilst recall measures the ability of the model to find all instances of documents in a particular language. In this context, high precision is of greater value than high recall, since it is preferable to provide no information about the language of content of a resource than to provide incorrect information.

Various statistical models were investigated, with only a Bayesian statistical model based on analysis of words providing anything approaching satisfactory precision. This model was therefore selected for further development.

The Bayesian idea

Bayesian methods are based on a calculation of the probabilities that a book is written in each language under consideration. An assumption is made that the words present within the book title are statistically independent; this is obviously a false assumption (since, for example, adjacent words are likely to belong to the same language), but it allows application of the following proportionality:

An equation: P(D" is in language " l "given that it has features"  f_1…f_n )∝P (D" is in language " l)∏_(i=1)^n▒├ P("feature " f_i " arises in language " l)

The right-hand side of this proportionality can be calculated based on an analysis of the training corpus. The language of the test document is then predicted to be the language which maximises the above probability.

Because of the assumption of word-independence, this method is often referred to as naïve Bayesian classification.

What that means in practice is this: we notice that whenever the word ‘szerelem’ appears in a book title for which we have language information, the language is Hungarian. Therefore, if we find a book title which contains the word ‘szerelem’, but we don’t have language information for that book, we can predict that the book is probably in Hungarian.

Screenshot of catalogue entry with the word 'szerelem' in the title of a book
Szerelem: definitely a Hungarian word => probably a Hungarian title

If we repeat this for every word appearing in every title of each of the 12 million resources where we do have language information, then we can build up a model, which we can use to make predictions about the language(s) of the 4.7 million records that we’re interested in. Simple!

Training corpus

The training corpus was built from British Library catalogue records which contain language information, Records recorded as being in ‘Miscellaneous languages’, ‘Multiple languages’, ‘Sign languages’, ‘Undetermined’ and ‘No linguistic content’ were excluded. This yielded 12,254,341 records, of which 9,578,175 were for English-language resources. Words were extracted from the title, edition statement, and series title, and stored in a ‘language bucket’.

Words in English, Hungarian and Volapuk shown above the appropriate language 'bucket'

Language buckets were analysed in order to create a matrix of probabilities, whereby a number was assigned to each word-language pair (for all words encountered within the catalogue, and all languages listed in a controlled list) to represent the probability that that word belongs to that language. Selected examples are listed in the table below; the final row in the table illustrates the fact that shorter words tend to be common to many languages, and are therefore of less use than longer words in language identification.

{Telugu: 0.750, Somali: 0.250}

aaaarrgghh

{English: 1.000}

aaavfleeße

{German: 1.000}

aafjezatsd

{German: 0.333, Low German: 0.333, Limburgish: 0.333}

aanbidding

{Germanic (Other): 0.048, Afrikaans: 0.810, Low German: 0.048, Dutch: 0.095}

نبوغ

{Persian: 0.067, Arabic: 0.200, Pushto: 0.333, Iranian (Other): 0.333, Azerbaijani: 0.067}

metodicheskiĭ

{Russian: 0.981, Kazakh: 0.019}

nuannersuujuaannannginneranik

{Kalâtdlisut): 1.000}

karga

{Faroese: 0.020, Papiamento: 0.461, Guarani: 0.010, Zaza: 0.010, Esperanto: 0.010, Estonian: 0.010, Iloko: 0.176, Maltese: 0.010, Pampanga: 0.010, Tagalog: 0.078, Ladino: 0.137, Basque: 0.029, English: 0.010, Turkish: 0.029}

Results

Precision and recall varied enormously between languages. Zulu, for instance, had 100% precision but only 20% recall; this indicates that all records detected as being in Zulu had been correctly classified, but that the majority of Zulu records had either been mis-classified, or no language prediction had been made. In practical terms, this meant that a prediction “this book is in Zulu” was a prediction that we could trust, but we couldn’t assume that we had found all of the Zulu books. Looking at our results across all languages, we could generate a picture (formally termed a ‘confusion matrix’) to indicate how different languages were performing (see below). The shaded cells on the diagonal represent resources where the language has been correctly identified, whilst the other shaded cells show us where things have gone wrong.

Language confusion matrix

The best-performing languages were Hawaiian, Malay, Zulu, Icelandic, English, Samoan, Finnish, Welsh, Latin and French, whilst the worst-performing languages were Shona, Turkish, Pushto, Slovenian, Azerbaijani, Javanese, Vietnamese, Bosnian, Thai and Somali.

Where possible, predictions were checked by language experts from the British Library’s curatorial teams. Such validation facilitated the identification of off-diagonal shaded areas (i.e. languages for which predictions which should be treated with caution), and enabled acceptance thresholds to be set. For example, the model tends to over-predict English, in part due to the predominance of English-language material in the training corpus, thus the acceptance threshold for English was set at 100%: predictions of English would only be accepted if the model claimed that it was 100% certain that the language was English. For other languages, the acceptance threshold was generally between 95% and 99%.

Outcomes

Two batches of records have been completed to date. In the first batch, language codes were assigned to 1.15 million records with 99.7% confidence; in the second batch, a further 1 million language codes were assigned with 99.4% confidence. Work on a third batch is currently underway, and it is hoped to achieve at least a further million language code assignments. The graph below shows the impact that this project is having on the British Library catalogue.

Graph showing improvement in the number of 'foundation catalogue' records with languages recorded

The project has already been well-received by Library colleagues, who have been able to use the additional language coding to assist them in identifying curatorial responsibilities and better understanding the collection.

Further reading

For a more in-depth, mathematical write-up of this project, please see a paper written for Cataloging & Classification Quarterly, which is available at: https://doi.org/10.1080/01639374.2019.1700201, and is also in the BL research repository at https://bl.iro.bl.uk/work/6c99ffcb-0003-477d-8a58-64cf8c45ecf5.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs