Digital scholarship blog

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

11 July 2025

Automatic Text Recognition in Cultural Heritage Institutions survey: a brief analysis and a published dataset

This post is by Dr Valentina Vavassori, Digital Curator for Automatic Text Recognition.

Introduction

A few months ago, we circulated a brief survey to understand how other institutions use Automatic Text Recognition and to discuss the creation of a working group on the subject (see the original blog post here: Help us explore Automatic Text Recognition in cultural heritage institutions! - Digital scholarship blog). This survey was part of my ongoing research on ATR processes and workflows and how we can approach our own ATR workflow.
I am happy to report that the anonymised data are available on our repository and freely downloadable and reusable. We thank everyone who took the time to answer and gave such amazing insights!
A data paper is also in the pipeline and will be published soon to detail the context of the data, including its limitations such as language used (English), channels used to publicise it (Libraries mailing lists, groups such as Transkribus, AI4LAM and IIIF), and the data cleaning process.
I took some time to do a quick data analysis and here are some of the highlights. First of all, the survey had a good uptake, receiving 67 answers (one was a double entry, so the final count was 68 before cleaning the data).

General Questions

The survey started with some general questions on respondents’ organisations and their collections. Most of the respondents worked in a library or archive (47), in medium-sized institutions with 51-500 employees (38). Their collections are in multiple languages (50) and scripts (44). The answers were mostly from the United States and Europe.

One of the survey questions asked about languages in the collections and it was interesting to see the variety of languages reported. The most mentioned languages were English, French, German, Spanish, Dutch, Arabic, Chinese, Italian, and Russian.

I did a similar analysis with scripts. The most mentioned scripts were Latin, Arabic, Cyrillic, Hebrew, Chinese, Japanese, Devanagari, Fraktur, and Korean.

ATR workflows in cultural heritage institutions

The majority of the survey was dedicated to identifying what kind of ATR workflows are currently adopted in cultural heritage institutions. In general, it was reported that ATR was part of the workflow for 46 respondents. Motivations for performing ATR were varied: implementing content search (54), creating datasets for future use (35) and performing research (30) (these were options already offered). However other respondents mentioned additional motivations such as improving accessibility (4), helping with cataloguing (4) and helping with the creation of AI models (2). These motivations were particularly interesting as they touched on key points that we need to keep in mind for the ATR workflow that we are in the process of creating at the British Library. For example:

“accessibility/disability support”
“Create sources for NER and Wikification and other NLP activity”

Most of the institutions do quality control of a sample (41) and do not correct the output texts (33). Copyright is part of their workflow (39), even though one of the respondents did an interesting distinction: “Copyright assessment is a standard part of our digitisation workflow - so not tied to ATR, but to all digitisation”.

Another question asked what tools they commonly used for ATR. The answer was particularly interesting as it mentioned some of the most well-known solutions (commercial and open source) such as Transkribus, ABBYY and Tesseract OCR, but also a variety of new solutions such as custom pipelines, Loghi and Docwizz, showcasing how the field is rapidly evolving. Another interesting observation is the variety of local and online testing of LLM (e.g. ChatGPT, Gemini, Llama, Qwen and Gemma to just cite a few).
The most used text formats are .TXT (32), ALTO (25), PDF (24) and JSON (17), and text is usually displayed using searchable PDF(s) (36) or IIIF viewers (32). When asked why they use certain formats, the respondents mentioned how these formats were helpful for different audiences as well as being international standards.
Metadata is a particularly interesting topic as the metadata standards mentioned were quite varied such as METS (15), IIIF (11), and PAGE XML (6). There was also an interesting mention of MLFlow to document and manage ML processes. Despite the use of different standards, some fields were identified as common ground, such as Software, Date of Creation, Accuracy and Model, therefore demonstrating the need to start a broader discussion on how to standardise and document the various ATR processes in a transparent and open way.

Future implementations

Finally, the survey asked a few questions on future expansion of ATR and on what measures have been adopted to create sustainable and ethical ATR workflows. Most institutions do not currently implement Optical Music Recognition or Automatic Speech Recognition: some would like to implement it (19) but others are not interested (18). Most institutions have not evaluated the environmental sustainability (42) of their ATR workflow or its ethical impact (42) but expressed interest in doing so. Some of the most interesting answers to these challenges were:

“Only create new versions of HTR when results are accepted to be significantly better. not using llm when you can use simple AI”
“we place a disclaimer that the HTR is result of AI”
“Evaluated tools on a variety of different scripts and languages; compiled report on the potential for HTR to perpetuate existing curatorial biases in our collections.”

These answers reflect both the commitment to transparency, the risks connected to the use of AI, its environmental impact and how it can be minimised, and its responsible usage, reflecting similar considerations that are part of our internal conversations.
This survey has have helped us in the scoping of the current landscape and in re-designing and standardising our ATR workflow. To further the collaboration that started with this survey, I will also begin the process of creating a working group on ATR and will share updates soon!

Posted by Digital Research Team at 11:10 AM

Technorati Tags: ATR, HTR, OCR, survey

09 July 2025

A Geographer’s Initiation Into Digital Humanities: Part 1

A post by Dr Huw Rowlands on his Coleridge Fellowship 2025, 'Cross-cultural Encounters in the Survey of India in the Mid-nineteenth Century'.

To begin at the beginning.

My 2021 doctoral thesis focused on cross-cultural encounters in Aotearoa – New Zealand. I started with an overview of the 18th century voyage of the Endeavour, led by James Cook, to Te Moana nui a Kiwa – the Pacific Ocean. I went on to examine the histories that continue to be created about them in official reports, academic research, museum exhibitions, and documentary film.

Since then, I have been working with the many thousands of maps produced by the Survey of India held in the India Office Records (IOR) Map Collection. I soon became aware of the virtual invisibility of work by the Indian, Burmese and other staff on the maps themselves. With this tucked away at the back of my mind, I have followed my curiosity about digital humanities in British Library and other seminars and workshops, and actively followed the Library’s work on its Race Equality Action Plan. When I came across three series of printed annual reports produced by Survey of India Survey Parties, which listed all survey staff, including those they called ‘Native Surveyors’, these strands quickly came together in my mind and eventually led to my Coleridge Fellowship proposal. The Coleridge Fellowship offers British Library staff the opportunity to pursue a piece of original research and further understanding of the Library’s collections. It was established in 2017 through the generosity of Professor Heather Jackson and her late husband Professor J.R. de J. Jackson, and is named after Samuel Taylor Coleridge (1772-1834).

My aims with the Fellowship are to show the opportunities in the IOR Map Collection to identify a range of individuals involved in mapping what is called in the reports ‘British India’, to learn and demonstrate how data can be extracted and managed, and to reveal its potential in understanding cross-cultural relationships in this context.

Black Boxes

With great support from the Library’s Digital Research and Heritage Made Digital teams among others, particularly Harry Lloyd, Mia Ridge, and Valentina Vavassori, I drew up a plan for the project. The first step was to evaluate the series of reports and choose one set. The next stages are focused on digital methods: firstly to acquire and verify digital images of the chosen reports, use OCR (Optical Character Recognition) to create text files, extract and structure the information I need from them, and lastly visualise the information to create a foundation to help answer my research questions. Each of these stages looked to me like a black box – something clear and present but whose internal workings are a bit of a mystery. At an early planning meeting with the team, we started to explore each black box stage. Black boxes were unpacked onto three white boards: Inputs/Sources, Process, and Results. These initial sketches have become the foundations of my detailed research plan for the digital stages of the project.

photo of a whiteboard with text and sketches of information needed from the source documents

One of the whiteboards from our first digital planning meeting

Potentially hidden away in or between each black box were what Mia called ‘magic elves’, imaginary creatures who undertake essential but unresourced tasks such as converting information from one form to another. We unpacked the boxes and set out a series of smaller steps, banishing numerous phantom elves.

My work is currently focused on learning the skills needed to achieve each smaller step. I have been getting to grips with OCR application Transkribus, ably guided by Valentina. Crucial to making the most of such tools is referring forwards to the next digital stage and its own tools, as well as backwards to my research questions. In doing so, the image of a series of discrete black boxes has now given way to a relay race, passing a baton of information on from one stage to the next. The way I use one tool can make the transition onto the next easier or harder. So, while firmly focused on Transkribus, Harry has been guiding me through the stage that follows, so that the data baton can be passed on as smoothly as possible.

Digitised page of a survey report showing numbered paragraphs and an inset list of members of the topographical party

Digital image before uploading to Transkribus

As well as relying on some unsophisticated metaphors, my vocabulary has been changing, with both some new words, and some old words with different, or more specific meanings. Regions and tags are two from Transkribus. Regions are a way of segregating areas of the original image so that Transkribus organises the text into separate sections. I have been using the pre-existing Heading and Marginalia, for example, and have added a new Region, Credit, where staff are credited with work undertaken during the year. Using regions should help the data extraction stage by enabling me to focus on areas of text where the data most useful for my research questions is to be found. Tags label individual words or phrases as entities such as People, Places and Organisations. ‘Tag’ is a short word but using tags involves a careful examination of what I need to tag and why, as well as consideration of each tag’s attributes. Transkribus’ default Person tag, for example, includes the Attributes First Name, Last name and dates of Birth and Death. To track promotion over time, I have added a new attribute – Title. Tagging is an intriguing, interpretive process and I expect to have more to say about it later in the project.

Screenshot of a printed page with sections outlined, and names from the page set out in the Transkribus tool

Transkribus screenshot showing regions applied to the digital image on the left, and the tagged transcription on the right.

As I move onto the data extraction stage, I will no doubt be acquiring and understanding more vocabulary. I have so far spotted entities, triples, NLP, Python, LLM, and NER, to name a few. I also expect to need a new metaphor or two.

Dr Huw Rowlands

British Library Coleridge Fellow 2025

Processing Coordinator and Cataloguer

India Office Records Map Project

Posted by Digital Research Team at 10:51 AM

Tags

Decolonising, Digital scholarship, LIS research, Maps, South Asia

17 June 2025

The Digital Research team at DH2025

Several of the Digital Research team had proposals accepted and will be attending the Digital Humanities 2025 conference in Lisbon. To help get conversations started, we’ve compiled some information about the work we’ll be discussing below.

We’ll be on social media – Mastodon (@[email protected], @[email protected], @[email protected], @[email protected]) and BlueSky (@bldigischol.bsky.social, @adi-keinan.bsky.social, @miaout.bsky.social, @universalviewer.io) – and we’re looking forward to talking to people there!

In order of appearance…

On July 16, Digital Curator Adi Keinan-Schoonbaert is presenting on ‘Digital Humanities and Environmental Sustainability at the British Library’:

‘In this paper, Adi will explore a heritage organisation’s journey into digital sustainability, looking at the British Library as a case study. She’ll discuss initiatives aimed at increasing literacy and capacity building, both within the Library but also externally, fostering personal agency, and encouraging action using both bottom-up and top-down approaches. Framing this within the context of the Library’s Sustainability and Climate Change Strategy, Adi will examine the role of internal capacity-building efforts—including staff-led networks, targeted training, and collaborative workshops such as those with the Digital Humanities Climate Coalition—in promoting sustainable digital literacy and embedding environmentally conscious decision-making across the organisation.’

The 'Future of Digital Sustainability' workshop, as part of the 'Discover Digital Sustainability' training series

On July 17, our Universal Viewer team - Lanie Okorodudu, Saira Akhter, James Misson and Erin Burnand – and Digital Curator Mia Ridge are sharing their work in ‘Radically inclusive software development for digital cultural heritage’. The Universal Viewer is a community-developed open source project on a mission to help share digital collections. Fresh from community sprints focused on improving the developer experience, the team will share:

‘Sustaining open source software can be challenging. We discuss collaboration on the Universal Viewer (UV), software designed to display cultural heritage collections. We highlight methods including innovative, inclusive and multi-institution sprints. We showcase UV’s evolution, including accessibility and user experience enhancements, future plans and ways for others to contribute.’

We might also attend the 'Decade of IIIF' panel on Friday.

Sally Chambers contributed to a group poster on Computational Literary Studies Infrastructure (CLS INFRA): Leveraging Literary Methods for FAIR(er) Science shown on July 18.

In the final session of the conference on July 18, Mia is part of a panel, Openness in GLAM: Analysing, Reflecting, and Discussing Global Case Studies, with Nadezhda Povroznik, Paul L. Arthur, T. Leo Cao, Samantha Callaghan and Luis Ramos Pinto:

‘This panel explores diverse dimensions of openness within the galleries, libraries, archives and museums (GLAM) sector globally, shaping discussions about accessibility, inclusivity, participation, and knowledge democratisation. Cultural heritage institutions are responsible “to all citizens”. Yet there are gaps relating to collections, knowledge, policy, technology, engagement, IP, ethics, infrastructure and AI.’

Mia is particularly interested in ‘the Paradoxes of Open Data in Libraries, Archives and Museums’, including:

The lack of robust, sector-wide shared infrastructure providing long-term access to GLAM collections, despite decades of evidence for its value and the difficulties many institutions have in maintaining individual repositories
The tension between making data open for exploration and re-use (including, scraping by generative AI companies), while respecting copyright and the right of creators to receive income from their writing, art, music, etc.
Balancing the FAIR principles - making open collections Findable, Accessible, Interoperable and Reusable - with the CARE principles for Indigenous Data Governance, to support Indigenous people in “asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit” (Global Indigenous Data Alliance, 2018). Operationalizing the CARE principles might require an investment of time in building relationships and trust with Indigenous communities before releasing open data - or perhaps choosing to keep data closed in some ways - that counters the urge for speed. What changes are required for organisations to meaningfully address the CARE principles, and what can individual staff do if resources to invest in community relationships aren’t available?
The need for financial models to fund collections digitisation that don't rely on individual users paying for access to collections and the overhead required to provide evidence for the use and impact of open data

Posted by Digital Research Team at 12:18 PM

Tags

Digital scholarship, Events, LIS research