11 July 2025
Automatic Text Recognition in Cultural Heritage Institutions survey: a brief analysis and a published dataset
This post is by Dr Valentina Vavassori, Digital Curator for Automatic Text Recognition.
Introduction
A few months ago, we circulated a brief survey to understand how other institutions use Automatic Text Recognition and to discuss the creation of a working group on the subject (see the original blog post here: Help us explore Automatic Text Recognition in cultural heritage institutions! - Digital scholarship blog). This survey was part of my ongoing research on ATR processes and workflows and how we can approach our own ATR workflow.
I am happy to report that the anonymised data are available on our repository and freely downloadable and reusable. We thank everyone who took the time to answer and gave such amazing insights!
A data paper is also in the pipeline and will be published soon to detail the context of the data, including its limitations such as language used (English), channels used to publicise it (Libraries mailing lists, groups such as Transkribus, AI4LAM and IIIF), and the data cleaning process.
I took some time to do a quick data analysis and here are some of the highlights. First of all, the survey had a good uptake, receiving 67 answers (one was a double entry, so the final count was 68 before cleaning the data).
The survey started with some general questions on respondents’ organisations and their collections. Most of the respondents worked in a library or archive (47), in medium-sized institutions with 51-500 employees (38). Their collections are in multiple languages (50) and scripts (44). The answers were mostly from the United States and Europe.
One of the survey questions asked about languages in the collections and it was interesting to see the variety of languages reported. The most mentioned languages were English, French, German, Spanish, Dutch, Arabic, Chinese, Italian, and Russian.
I did a similar analysis with scripts. The most mentioned scripts were Latin, Arabic, Cyrillic, Hebrew, Chinese, Japanese, Devanagari, Fraktur, and Korean.
ATR workflows in cultural heritage institutions
The majority of the survey was dedicated to identifying what kind of ATR workflows are currently adopted in cultural heritage institutions. In general, it was reported that ATR was part of the workflow for 46 respondents. Motivations for performing ATR were varied: implementing content search (54), creating datasets for future use (35) and performing research (30) (these were options already offered). However other respondents mentioned additional motivations such as improving accessibility (4), helping with cataloguing (4) and helping with the creation of AI models (2). These motivations were particularly interesting as they touched on key points that we need to keep in mind for the ATR workflow that we are in the process of creating at the British Library. For example:
“accessibility/disability support”
“Create sources for NER and Wikification and other NLP activity”
Most of the institutions do quality control of a sample (41) and do not correct the output texts (33). Copyright is part of their workflow (39), even though one of the respondents did an interesting distinction: “Copyright assessment is a standard part of our digitisation workflow - so not tied to ATR, but to all digitisation”.
Another question asked what tools they commonly used for ATR. The answer was particularly interesting as it mentioned some of the most well-known solutions (commercial and open source) such as Transkribus, ABBYY and Tesseract OCR, but also a variety of new solutions such as custom pipelines, Loghi and Docwizz, showcasing how the field is rapidly evolving. Another interesting observation is the variety of local and online testing of LLM (e.g. ChatGPT, Gemini, Llama, Qwen and Gemma to just cite a few).
The most used text formats are .TXT (32), ALTO (25), PDF (24) and JSON (17), and text is usually displayed using searchable PDF(s) (36) or IIIF viewers (32). When asked why they use certain formats, the respondents mentioned how these formats were helpful for different audiences as well as being international standards.
Metadata is a particularly interesting topic as the metadata standards mentioned were quite varied such as METS (15), IIIF (11), and PAGE XML (6). There was also an interesting mention of MLFlow to document and manage ML processes. Despite the use of different standards, some fields were identified as common ground, such as Software, Date of Creation, Accuracy and Model, therefore demonstrating the need to start a broader discussion on how to standardise and document the various ATR processes in a transparent and open way.
Future implementations
Finally, the survey asked a few questions on future expansion of ATR and on what measures have been adopted to create sustainable and ethical ATR workflows. Most institutions do not currently implement Optical Music Recognition or Automatic Speech Recognition: some would like to implement it (19) but others are not interested (18). Most institutions have not evaluated the environmental sustainability (42) of their ATR workflow or its ethical impact (42) but expressed interest in doing so. Some of the most interesting answers to these challenges were:
“Only create new versions of HTR when results are accepted to be significantly better. not using llm when you can use simple AI”
“we place a disclaimer that the HTR is result of AI”
“Evaluated tools on a variety of different scripts and languages; compiled report on the potential for HTR to perpetuate existing curatorial biases in our collections.”
These answers reflect both the commitment to transparency, the risks connected to the use of AI, its environmental impact and how it can be minimised, and its responsible usage, reflecting similar considerations that are part of our internal conversations.
This survey has have helped us in the scoping of the current landscape and in re-designing and standardising our ATR workflow. To further the collaboration that started with this survey, I will also begin the process of creating a working group on ATR and will share updates soon!