Wikisource 2025 Conference: Collaboration, Innovation, and the Future of Digital Texts

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected] and Bluesky as @adi-keinan.bsky.social

The Wikisource 2025 Conference, held in the lush setting of Bali, Indonesia between 14-16 February 2025, brought together a global community of Wikimedians, heritage enthusiasts, and open knowledge advocates. Organised by a coalition of Wikisource contributors, Wikimedia Foundation and Wikimedia Indonesia, the conference served as a dynamic space to discuss the evolving role of Wikisource, explore new technologies, and strengthen collaborations with libraries, cultural institutions, and other global stakeholders.

Wikisource Conference 2025 participants. Photo by Memora Productions for Wikimedia Indonesia.

The conference, themed “Wikisource: Transform & Preserve Heritage Digitally,” featured a rich programme of keynote talks, long presentations, lightning talks, and informal meet-ups. Central themes included governance, technological advancements, community engagement, and the challenge of scaling Wikisource as a set of collaborative, multilingual platforms. We also enjoyed a couple of fantastic cultural events, celebrating the centuries-old, unique heritage of Bali!

Keynotes and Indonesian Partnerships

Following a kick-off session on the state of Wikisource community and technology, several Indonesian partners shared insights into their work on heritage, preservation, and digital accessibility. Dr Munawar Holil (Kang Mumu) highlighted the efforts of Manassa (the Indonesian Manuscript Society) to safeguard over 121,000 manuscripts, the majority of which remain undigitised, with key collections located in Bali, Jakarta, and Aceh. Challenges include limited public awareness, sacred perceptions requiring ceremonial handling, and structural gaps in institutional training.

Dr Cokorda Rai Adi Paramartha from Udayana University addressed the linguistic diversity of Indonesia – home to 780 languages and 40 scripts, only eight (!) of which are in Unicode – and stressed the importance of developing digital tools like a Balinese keyboard to engage the younger generation. Both speakers underscored the role of community collaboration and technological innovation in making manuscripts more accessible and relevant in the digital age.

Dr Munawar Holil (left), Dr Cokorda Rai Adi Paramartha (right) and session moderator Ivonne Kristiani (WMF; centre).

I had the honour – and the absolute pleasure! – of being invited as one of the keynote speakers for this conference. In my talk I explored collaborations between the British Library and Wikisource, focusing on engaging local communities, raising awareness of library collections, facilitating access to digitised books and manuscripts, and enhancing them with accurate transcriptions.

We have previously collaborated with Bengali communities on two competitions to proofread 19^th century Bengali books digitised as part of the Two Centuries of Indian Print project. More recently, the Library partnered with the Wikisource Loves Manuscripts (WiLMa) project, sharing Javanese manuscripts digitised through the Yogyakarta Digitisation Project. I’ve highlighted past and present work with Transkribus undertaken to develop Machine Learning training models aimed at automating transcriptions in various languages, encouraging further collaborations that could benefit communities worldwide, and highlighting the potential of such partnerships in expanding access to digitised heritage.

Dr Adi Keinan-Schoonbaert delivering a keynote address at the conference. Photo by Memora Productions for Wikimedia Indonesia.

Another keynote was delivered by Andy Stauder from the READ-COOP. After introducing the cooperative and Transkribus, Andy talked about a key component of their approach – CCR – which stands for Clean, Controllable, and Reliable data coupled with information extraction (NER), powered by end-to-end ATR (automated text recognition) models. This approach is essential for both training and processing with large language models (LLMs). The future may move beyond pre-training to embrace active learning, fine-tuning, retrieval-augmented generation (RAG), dynamic prompt engineering, and reinforcement learning, with an aim to generate linked knowledge—such as integration with Wikidata IDs. Community collaboration remains central, as seen in projects like the digitisation of Indonesian palm-leaf manuscripts using Transkribus.

Andy Stauder (READ-COOP) talking about collaboration around the Indonesian palm-leaf manuscripts digitisation

Cassie Chan (Google APAC Search Partnerships) gave a third keynote on Google's role in digitising and curating cultural and literary heritage, aligning with Wikisource’s mission of providing free access to source texts. Projects like Google Books aim to make out-of-copyright works discoverable online, while Google Arts & Culture showcases curated collections such as the Timbuktu Manuscripts, aiding preservation and accessibility. These efforts support Wikimedia goals by offering valuable, context-rich resources for contributors. Additionally, Google's use of AI for cultural exploration – through tools like Poem Postcards and Art Selfie – demonstrates innovative approaches to engaging with global heritage.

Spotlight on Key Themes and Takeaways

The conference featured so many interesting talks and discussions, providing insights into projects, sharing knowledge, and encouraging collaborations. I’ll mention here just a few themes and some key takeaways, from my perspective as someone working with heritage collections, communities, and technology.

Starting with the latter, a major focus was on Optical Character Recognition (OCR) improvements. Enhanced OCR capabilities on Wikisource platforms not only improve text accuracy but also encourage more volunteers to engage in text correction. Implementing Google OCR, Tesseract, and more recently – Transkribus – are driving increased participation, as volunteers enjoy refining text accuracy. Among other speakers, User:Darafsh, Chairman of the Iranian Wikimedians User Group, mentioned the importance of teaching how to use Wikisource and OCR, and the development of Persian OCR at the University of Hamburg. Other talks relating to technology covered the introduction of new extensions, widgets, and mobile apps, highlighting the push to make Wikisource more user-friendly and scalable.

Nicolas Vigneron showcasing the languages for which Google OCR was implemented on Wikisource

Some discussions explored the potential of WiLMa (Wikisource Loves Manuscripts) as a model for coordinating across stakeholders, ensuring the consistency of tools, and fostering engagement with cultural institutions. For example, Irvin Tomas and Maffeth Opiana talked about WiLMa Philippines. This project launched in June 2024 as the first WiLMa project outside of Indonesia, focusing on transcribing and proofreading Central Bikol texts through activities like monthly proofread-a-thons, a 12-hour transcribe-a-thon, and training sessions at universities.

Another interesting topic was that of Wikidata and Metadata. The integration of structured metadata remains a key area of development, enabling better searchability and linking across digital archives. Bodhisattwa Mandal (West Bengal Wikimedians User Group) talked about Wikisource content including both descriptive metadata and unstructured text. While most data isn’t yet stored in a structured format, using Wikidata enables easier updates, avoids redundancy, and improves search, queries, and visualisation. There are tools that support metadata enrichment, annotation, and cataloguing, and a forthcoming mobile app will allow Wikidata-based book searches. Annotating text with Wikidata items enhances discoverability and link content more effectively across Wikimedia projects.

Working for the British Library, I (naturally!) picked up on a few collaborative projects between Wikisource and public or national libraries. One talk was about a digitisation project for traditional Korean texts, a three-year collaboration with Wikimedia Korea and the National Library of Korea, successfully revitalising the Korean Wikisource community by increasing participation and engaging volunteers through events and partnerships.

Another project built a Wikisource community in Uganda by training university students, particularly from library information studies, alongside existing volunteers. Through practical sessions, collaborative tasks, and support from institutions like the National Library of Uganda and Wikimedia contributors, participants developed digital literacy and archival skills.

Nanteza Divine Gabriella giving a talk on ‘Training Wikisource 101’ and building a Wikisource community in Uganda

A third Wikisource and libraries talk was about a Wikisource to public library pipeline project, which started initially in a public library in Hokitika, New Zealand. This pipeline enables scanned public domain books to be transcribed on Wikisource and then made available as lendable eBooks via the Libby app, using OverDrive's Local Content feature. With strong librarian involvement, a clear workflow, and support from a small grant, the project has successfully bridged Wikisource and library systems to increase accessibility and customise reading experiences for library users.

The final session of the conference focused on shaping a future roadmap for Wikisource through community-driven conversation, strategic planning, and partnership development. Discussions emphasised the need for clearer vision, sustainable collaborations with technology and cultural institutions, improved tools and infrastructure, and greater outreach to grow both readership and contributor communities. Key takeaways included aligning with partners’ goals, investing in editor growth, leveraging government language initiatives, and developing innovative workflows. A strong call was made to prioritise people over platforms and to ensure Wikisource remains a meaningful and inclusive space for engaging with knowledge and heritage.

Looking Ahead

The Wikisource 2025 Conference reaffirmed the platform’s importance in the digital knowledge ecosystem. However, sustaining momentum requires ongoing advocacy, technological refinement, and deeper institutional partnerships. Whether through digitising new materials or leveraging already-digitised collections, there is a clear hunger for openly accessible public domain texts.

As the community moves forward, a focus on governance, technology, and strategic partnerships will be essential in shaping the future of Wikisource. The atmosphere was so positive and there was so much enthusiasm and willingness to collaborate – see this fantastic video available via Wikimedia Commons, which successfully captures the sentiment. I’m sure we’re going to see a lot more coming from Wikisource communities in the future!

Posted by Digital Research Team at 9:52 AM in Africa , Australasia , Collaborations , Digital scholarship , East Asia , Manuscripts , Medieval history , Middle East , Printed books , Projects , South Asia , South East Asia , Tools , Wikipedia | Permalink

Fantastic Futures 2025 (FF2025) Call for Proposals

Fantastic Futures 2025: AI Everywhere, All at Once

AI4LAM’s annual conference, December 3 – 5, 2025, British Library, London

The British Library and the Programme Committee for the Fantastic Futures 2025 conference are delighted to invite proposals for presentations and workshops for the Fantastic Futures 2025 conference.

Fantastic Futures is the annual conference for the AI4LAM (Artificial Intelligence, for Libraries, Archives, Museums) community. Submissions are invited from colleagues around the world about organisations, collections, interest and experience with Artificial Intelligence (AI) and Machine Learning (ML) technologies applied to or developed with cultural, research and heritage collections. This includes practitioners in the GLAM (Galleries, Libraries, Archives, Museums) sector and Digital Humanities, Arts and Social Sciences, Data, Information and Computer Science researchers in Higher Education.

Key information

Call for proposals shared: Thursday 13 March 2025
Conference submission form opens: May 2025
Proposal submission deadline: midnight anywhere, Sunday 1 June 2025
Notification of acceptance: 25 July 2025
Conference dates: December 3 – 5, 2025
Location: British Library, London, onsite – with some livestreams and post-event videos

FF2025 Theme: AI Everywhere, All at Once

We invite presentations on the theme of 'AI Everywhere, All at Once’. While AI has a long history in academia and practice, the release of public language models like ChatGPT propelled AI into public consciousness. The sudden appearance of AI ‘tools’ in the software we use every day, government consultations on AI and copyright and the hype about Artificial Intelligence mean that libraries, museums and archives must understand what AI means for them. Should they embrace it, resist it or fear it? How does it relate to existing practices and services, how can it help or undermine staff, and how do we keep up with rapid changes in the field?

There are many opportunities and many challenges in delivering AI that create rich, delightful and immersive experiences of GLAM collections and spaces for the public, and meet the needs of researchers for relevant, reliable and timely information. Challenges range from the huge – environmental and economic sustainability, ensuring alignment with our missions, ethical and responsible AI, human-centred AI, ensuring value for money – to the practical – evaluation, scalability, cyber security, multimodal collections – and throughout it all, managing the pace of change.

Our aim is to promote interdisciplinary conversations that foster broader understandings of AI methods, practices and technologies and enable critical reflections about collaborative approaches to research and practice.

Themes

We’re particularly interested in proposals that cover these themes:

Ethical and Responsible AI
Human-Centred AI / the UX of AI
Trust, AI literacy and society
Building AI systems for and with staff and users
Cyber-security and resilience
Interoperability and standards
FAIR, CARE, rights and copyright
Benchmarking AI / machine learning
Regional, national, international approaches to AI
Environmental sustainability

Formats for presentations (Thursday, Friday December 4-5)

Lightning talk: 5 mins. These might pitch an idea, call for collaborators, throw out a provocation or just provide a short update
Poster - perfect for project updates – what went well, what would you do differently, what lessons can others take?
Short presentation: 15 mins
Long presentation: 30 mins
Panel: 45 mins, multiple presenters with short position statements then discussion

Formats for workshops or working group sessions (Wednesday December 3)

Formal, instructor-led sessions, including working groups, tutorials, hands-on workshops – 1 or 2 hours
Informal, unstructured sessions, including unconferences, meetups, hacking – 1 or 2 hours
Digital showcase (demo): 30 mins

We value the interactions that an in-person event enables, so the default mode for this event is in-person presentations. However, if your proposal is accepted for inclusion in the conference but you are not able to travel to London, we can consider arrangements for making a virtual presentation on a case-by-case basis. Please contact the Programme Committee at [email protected] to discuss.

The conference will be held over three days: one day of workshops and other events, and two days of formal sessions. The social programme will include opportunities for informal networking.

Plenary sessions on Thursday and Friday will be livestreamed, recorded and published.

Find out more and get updates

Join the AI4LAM Google Group https://groups.google.com/forum/#!forum/ai4lam

Join the AI4LAM Slack for regular updates on the #ff2025 event channel: https://ai4lam.slack.com Join: https://join.slack.com/t/ai4lam/shared_invite/zt-1omthldn8-9vrGySjIRdija1nKQm0ltA

Check the AI4LAM FF2025 page for updates, including Frequently Asked Questions and information on our review criteria

Send inquiries to [email protected]

Organisers: Rossitza Atanassova, Neil Fitzgerald and Mia Ridge, British Library

Further details about the conference submission process and registration will be supplied soon.

This post was last updated 16 May 2025.

Posted by Digital Research Team at 12:50 PM in Collaborations , Digital scholarship , Events , LIS research , Tools | Permalink

AI (and machine learning, etc) with British Library collections

Machine learning (ML) is a hot topic, especially when it’s hyped as ‘AI’. How might libraries use machine learning / AI to enrich collections, making them more findable and usable in computational research? Digital Curator Mia Ridge lists some examples of external collaborations, internal experiments and staff training with AI / ML and digitised and born-digital collections.

Background

The trust that the public places in libraries is hugely important to us - all our 'AI' should be 'responsible' and ethical AI. The British Library was a partner in Sheffield University's FRAIM: Framing Responsible AI Implementation & Management project (2024). We've also used lessons from the projects described here to draft our AI Strategy and Ethical Guide.

Many of the projects below have contributed to our Digital Scholarship Training Programme and our Reading Group has been discussing deep learning, big data and AI for many years. It's important that libraries are part of conversations about AI, supporting AI and data literacy and helping users understand how ML models and datasets were created.

If you're interested in AI and machine learning in libraries, museums and archives, keep an eye out for news about the AI4LAM community's Fantastic Futures 2025 conference at the British Library, 3-5 December 2025. The conference themes have been published and the Call for Proposals will be open soon.

You can also watch public events we've previously hosted on AI in libraries (January 2025) and Safeguarding Tomorrow: The Impact of AI on Media and Information Industries (February 2024).

Using ML / AI tools to enrich collections

Generative AI tends to get the headlines, but at the time of writing, tools that use non-generative machine learning to automate specific parts of a workflow have more practical applications for cultural heritage collections. That is, 'AI' is currently more process than product.

Text transcription is a foundational task that makes digitised books and manuscripts more accessible to search, analysis and other computational methods. For example, oral history staff have experimented with speech transcription tools, raising important questions, and theoretical and practical issues for automatic speech recognition (ASR) tools and chatbots.

We've used Transkribus and eScriptorium to transcribe handwritten and printed text in a range of scripts and alphabets. For example:

‘Using Transkribus for Arabic Handwritten Text Recognition’, ‘Using Transkribus for automated text recognition of historical Bengali Books’.
Investigating the legacies of curatorial voice in the descriptions of incunabula collections at the British Library and student work on Detecting Catalogue Entries in Printed Catalogue Data
Handwritten Text Recognition of the Dunhuang manuscripts: the challenges of machine learning on ancient Chinese texts (eScriptorium)
Reinventing the 'Convert-a-Card' crowdsourcing project as a semi-automated workflow: Convert-a-Card: Past, Present and Future of Catalogue Cards Retroconversion, Convert-a-Card: Helping Cataloguers Derive Records with OCLC APIs and Python; Convert-a-Card: Extracting Entities from Catalogue Cards to Create E-Records

Creating tools and demonstrators through external collaborations

Mining the UK Web Archive for Semantic Change Detection (2021)

This project used word vectors with web archives to track words whose meanings changed over time. Resources: DUKweb (Diachronic UK web) and blog post ‘Clouds and blackberries: how web archives can help us to track the changing meaning of words’.

Graphs showing how words associated with the words blackberry, cloud, eta and follow changed over time.

From blackberries to clouds... word associations change over time

Living with Machines (2018-2023)

Our Living With Machines project with The Alan Turing Institute pioneered new AI, data science and ML methods to analyse masses of newspapers, books and maps to understand the impact of the industrial revolution on ordinary people. Resources: short video case studies, our project website, final report and over 100 outputs in the British Library's Research Repository.

Outputs that used AI / machine learning / data science methods such as lexicon expansion, computer vision, classification and word embeddings included:

Tools and demonstrators created via internal pilots and experiments

Many of these examples were enabled by on-staff Research Software Engineers and the Living with Machines (LwM) team at the British Library's skills and enthusiasm for ML experiments in combination with long-term Library’s staff knowledge of collections records and processes:

Identifying upside-down images in the Endangered Archives Project – projects within this important collection were often digitised under trying circumstances, so training machine learning to identify image attributes is useful.
Languid: Language Identification Project (2020) – Metadata Services' Victoria Morris experimented with machine learning (and ‘human review’ from c40 enthusiastic language experts checking the results) and was able to add language codes to over 3 million catalogue records. Her project identified 471 languages in the records, 141 of which were not previously represented. Resources: short video, longer video, publication Automated Language Identification of Bibliographic Resources: Cataloging & Classification Quarterly: Vol 58, No 1.
Flyswot (2021) – BL staff trained a machine learning model to find images of digitised manuscripts incorrectly labelled as ‘flysheets’.
Trialling a book genre classification model (2022) - while the team concluded that the model worked well, but not well enough to use for creating catalogue data yet, they shared their model on Hugging Face and training data created by British Library staff on Zooniverse. Resources: blog post and tutorial.

British Library resources for re-use in ML / AI

Our Research Repository includes datasets suitable for ground truth training, including 'Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records'.

Our ‘1 million images’ on Flickr Commons have inspired many ML experiments, including:

The Library has also shared models and datasets for re-use on the machine learning platform Hugging Face.

Posted by Digital Research Team at 2:50 PM in Collaborations , Data , Digital scholarship , Experiments , LIS research , Projects , Research collaboration , Tools | Permalink

Automating metadata creation: an experiment with Parliamentary 'Road Acts'

This post was originally written by Giorgia Tolfo in early 2023 then lightly edited and posted by Mia Ridge in late 2024. It describes work undertaken in 2019, and provides context for resources we hope to share on the British Library's Research Repository in future.

The Living with Machines project used a range of diverse sources, including newspapers to maps and census data. This post discusses the Road Acts, 18th century Acts of Parliament stored at the British Library, as an example of some of the challenges in digitising historical records, and suggests computational methods for reducing some of the overhead for cataloging Library records during digitisation.

What did we want to do?

Before collection items can be digitised, they need a preliminary catalogue record - there's no point digitising records without metadata for provenance and discoverability. Like many extensive collections, the Road Acts weren't already catalogued. Creating the necessary catalogue records manually wasn't a viable option for the timeframe and budget of the project, so with the support of British Library experts Jennie Grimshaw and Iris O’Brien, we decided to explore automated methods for extracting metadata from digitised images of the documents themselves. The metadata created could then be mapped to a catalogue schema provided by Jennie and Iris.

Due to the complexity, the timeframe of the project, the infrastructure and the resources needed, the agency Cogapp was commissioned to do the following:

Export metadata for 31 scanned microfilms in a format that matched the required field in a metadata schema provided by the British Library curators
OCR (including normalising the 'long S') to a standard agreed with the Living with Machines project
Create a package of files for each Act including: OCR (METS + ALTO) + images (scanned by British Library)

To this end, we provided Cogapp with:

Scanned images of the 31 microfilm reels, named using the microfilm ID and the numerical sequential order of the frame
The Library's metadata requirements
Curators' support to explain and guide them through the metadata extraction and record creation process

Once all of this was put in place, the process started, however this is where we encountered the main problem.

First issue: the typeface

After some research and tests we came to the conclusion that the typeface (or font, shown in Figure 1) is probably English Blackletter. However, at the time, OCR software - software that uses 'optical character recognition' to transcribe text from digitised images, like Abbyy, Tesseract or Transkribus - couldn't accurately read this font. Running OCR using a generic tool would inevitably lead to poor, if not unusable, OCR. You can create 'models' for unrecognised fonts by manually transcribing a set of documents, but this can be time-consuming.

Figure 1: Page showing typefaces and layout. SPRMicP14_12_016

Second issue: the marginalia

As you can see in Figure 2, each Act has marginalia - additional text in the margins of the page.

This makes the task of recognising the layout of information on the page more difficult. At the time, most OCR software wasn't able to detect marginalia as separate blocks of text. As a consequence these portions of text are often rendered inline, merged with the main text. Some examples showing how OCR software using standard settings interpret the page in Figure 2 are below.

Black and white image of printed page with comments in the margins

Figure 2 Printed page with marginalia. SPRMicP14_12_324

OCR generated by ABBYY FineReader:

Qualisicatiori 6s Truitees;

Penalty on acting if not quaiified.

Anno Regni septimo Georgii III. Regis.

9nS be it further enaften, Chat no person ihali he tapable of aftingt ao Crustee in the Crecution of this 9ft, unless be ftall he, in his oton Eight, oj in the Eight of his ©Btfe, in the aftual PofTefli'on anb jogment oj Eeceipt of the Eents ana profits of tanas, Cenements, anb 5)erebitaments, of the clear pearlg Oalue of J?iffp Pounbs} o? (hall be ©eit apparent of some person hatiing such estate of the clear gcatlg 5ia= lue of ©ne hunb?eb Pounbs; o? poffcsseb of, o? intitieb unto, a personal estate to the amount o? Oalue of ©ne thoufanb Pounbs: 9nb if ang Person hcrebg beemeo incapable to aft, ihali presume to aft, etierg such Per* son (hall, so? etierg such ©ffcnce, fojfcit anb pag the @um of jTiftg pounbs to ang person o?

OCR generated by the open source tool Tesseract:

586 Anno Regni ?eptimo Georgi III. Regis.

Qualification

of Truttees;

Penalty on

Gnd be it further enated, That no. Per?on ?hall bÈ

capable of ating as Tru?tËe in the Crecution of thig

A, unle?s he ?hall be, in his own Right, 02 in the

Right of his Wife, in the a‰ual Pofe??ion and En. |

joyment 02 Receipt of the Rents and P2zofits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of Fifty Pounds z o? hall be Deir Apparent of

?ome Per?on having ?uch Cfitate of the clear yearly Uga-

lue of Dne Hundred Pounds ; 02 po??e?leD of, 02 intitled

unto, a Per?onal E?tate to the Amount 02 Ualue of One

thou?and Pounds : And if any Per?on hereby deemed

acting if not incapable to ai, ?hall p2e?ume to ait, every ?uch Perz

qualified.

OCR generated by Cogapp (without any enhancement)

of Trusteesi

586

Anno Regni ſeptimo Georgii III. Regis.

Qualihcation and be it further enałted, That no perſon thall be

capable of aging as Trulltee in the Erecution of this

ad, unlefs he thall be, in his own Right, of in the

Right of his Wife, in the ađual Polellion and En:

joyment or Receipt of the Rents and Profits of Lands,

Tenements, and hereditaments, of the clear pearly

Ualue of ffifty pounds : oi thall be peir apparent of

ſome Perſon having ſuch Etate of the clear yearly Ua:

lue of Dne hundred Pounds; ou podeled of, od intitled

unto, a Perſonal Elate to the amount ou Ualue of Dne

Penalty on thouſand Pounds : and if any perſon hereby deemed

acting if not incapable to ad, thall preſume to ađ, every ſuch Per-

Qualified.

As you can see, the OCR transcription results were too poor to use in our research.

Changing our focus: experimenting with metadata creation

Time was running out fast, so we decided to adjust our expectations about text transcription, and asked Cogapp to focus on generating metadata for the digitised Acts. They have reported on their process in a post called 'When AI is not enough' (which might give you a sense of the challenges!).

Since the title page of each Act has a relatively standard layout it was possible to train a machine learning model to recognise the title, year and place of publication, imprint etc. and produce metadata that could be converted into catalogue records. These were sent on to British Library experts for evaluation and quality control, and potential future ingest into our catalogues.

Conclusion

This experience, although only partly successful in creating fully transcribed pages, explored the potential of producing the basis of catalogue records computationally, and was also an opportunity to test workflows for automated metadata extraction from historical sources.

Since this work was put on hold in 2019, advances in OCR features built into generative AI chatbots offered by major companies mean that a future project could probably produce good quality transcriptions and better structured data from our digitised images.

If you have suggestions or want to get in touch about the dataset, please email [email protected]

Posted by Digital Research Team at 5:46 PM in Collaborations , Digital scholarship , Experiments , LIS research , Printed books , Research collaboration , Tools | Permalink

Working Together: The UV Community Sprint Experience

How do you collaborate on a piece of software with a community of users and developers distributed around the world? Lanie and Saira from the British Library’s Universal Viewer team share their recent experience with a ‘community sprint’...

Back in July, digital agency Cogapp tested the current version of the Universal Viewer (UV) against Web Content Accessibility Guidelines (WCAG) 2.2 and came up with a list of suggestions to enhance compliance.

As accessibility is a top priority, the UV Steering Group decided to host a community sprint - an event focused on tackling these suggestions while boosting engagement and fostering collaboration. Sprints are typically internal, but the community sprint was open to anyone from the broader open-source community.

18 participants from 6 organisations teamed up to make the Universal Viewer more accessible - true collaboration in action!

The sprint took place for two weeks in October. Everyone brought unique skills and perspectives, making it a true community effort.

Software engineers worked on development tasks, such as improving screen reader compatibility, fixing keyboard navigation problems, and enhancing element visibility. Testing engineers ensured functionality, and non-technical participants assisted with planning, translations and management.

The group had different levels of experience, which made it important to provide a supportive environment for learning and collaboration.

The project board at the end of the Sprint - not every issue was finished, but the sprint was still a success with over 30 issues completed in two weeks.

Some of those involved shared their thoughts on the sprint:

Bruce Herman - Development Team Lead, British Library: 'It was a great opportunity to collaborate with other development teams in the BL and the UV Community.'

Demian Katz - Director of Library Technology, Villanova University: 'As a long-time member of the Universal Viewer community, it was really exciting to see so many new people working together effectively to improve the project.'

Sara Weale - Head of Web Design & Development, Llyfrgell Genedlaethol Cymru - National Library of Wales: 'Taking part in this accessibility sprint was an exciting and rewarding experience. As Scrum Master, I had the privilege of facilitating the inception, daily stand-ups, and retrospective sessions, helping to keep the team focused and collaborative throughout. It was fantastic to see web developers from the National Library of Wales working alongside the British Library, Falvey Library (Villanova University), and other members of the Universal Viewer Steering Group.

This sprint marked the first time an international, cross-community team came together in this way, and the sense of shared purpose and camaraderie was truly inspiring. Some of the key lessons I took away from the sprint was the need for more precise task estimation, as well as the value of longer sprints to allow for deeper problem-solving. Despite these challenges, the fortnight was defined by excellent communication and a strong collective commitment to addressing accessibility issues.

Seeing the team come together so quickly and effectively highlighted the power of collaboration to drive meaningful progress, ultimately enhancing the Universal Viewer for a more inclusive future.'

BL Test Engineers:

Damian Burke: 'Having worked on UV for a number of years, this was my first community sprint. What stood out for me was the level of collaboration and goodwill from everyone on the team. How quickly we formed into a working agile team was impressive. From a UV tester's perspective, I learned a lot from using new tools like Vercel and exploring GitHub's advanced functionality.'

Alex Rostron: 'It was nice to collaborate and work with skilled people from all around the world to get a good number of tickets over the line.'

Danny Taylor: 'I think what I liked most was how organised the sprints were. It was great to be involved in my first BL retrospective.'

Positive reactions to 'how I feel after the sprint'

A Miro board was used for Sprint planning and the retrospective – a review meeting after the Sprint where we determined what went well and what we would improve for next time.

Experience from the sprint helped us to organise a further sprint within the UV Steering Group for admin-related work, aimed at improving documentation to ensure clearer processes and better support for contributors. Looking ahead, we're planning to release UV 4.1.0 in the new year, incorporating the enhancements we've made - we’ll share another update when the release candidate is ready for review.

Building on the success of the community sprint, we're excited to make these collaborative efforts a key part of our strategic roadmap. Join us and help shape the future of UV!

Posted by Digital Research Team at 3:42 PM in Digital scholarship , Projects , Tools | Permalink

Collaborating to improve usability on the Universal Viewer project

Open source software is a valuable alternative to commercial software, but its decentralised nature often leads to less than polished user interfaces. This has also been the case for the Universal Viewer (UV), despite attempts over the years to improve the user experience (UX) for viewing digital collections. Improving the usability of the UV is just one of the challenges that the British Library's UV team have taken on. We've even recruited an expert volunteer to help!

Digital Curator Mia Ridge talks to UX expert Scott Jenson about his background in user experience design, his interest in working with open source software, and what he's noticed so far about the user experience of the Universal Viewer.

Mia: Hi Scott! Could you tell our readers a little about your background, and how you came to be interested in the UX of open source software?

Scott: I’ve been working in commercial software my entire life (Apple, Google and a few startups) and it became clear over time that the profit motive is often at odds with users’ needs. I’ve been exploring open source as an alternative.

Mia: I noticed your posts on Mastodon about looking for volunteer opportunities as you retired from professional work at just about the time that Erin (Product Owner for the Universal Viewer at the British Library) and I were wondering how we could integrate UX and usability work into the Library's plans for the UV. Have you volunteered before, and do you think it'll become a trend for others wondering how to use their skills after retirement?

Scott: Google has a program where you can leave your position for 3 months and volunteer on a project within Google.org. I worked on a project to help California Forestry analyse and map out the most critical areas in need of treatment. It was a lovely project and felt quite impactful. It was partly due to that project that put me on this path.

Mia: Why did you say 'yes' when I approached you about volunteering some time with us for the UV?

Scott: I lived in London for 4 years working for a mobile OS company called Symbian so I’ve spent a lot of time in London. While living in London, I even wrote my book in the British Library! So we have a lot in common. It was an intersection of opportunity and history I just couldn’t pass up.

Mia: And what were your first impressions of the project?

Scott: It was an impactful project with a great vision of where it needed to go. I really wanted to get stuck in and help if I could.

Mia: we loved the short videos you made that crystallised the issues that users encounter with the UV but find hard to describe. Could you share one?

Scott: The most important one is something that happens to many projects that evolve over time: a patchwork of metaphors that accrue. In this case the current UV has at least 4 different ways to page through a document, 3 of which are horizontal and 1 vertical. This just creates a mishmash of conflicting visual prompts for users and simplifying that will go a long way to improve usability.

Screenshot of the Viewer with target areas marked up

A screenshot from Scott's video showing multiple navigation areas on the UV

How can you help improve the usability of the Universal Viewer?

We shared Scott's first impressions with the UV Steering Group in September, when he noted that the UV screen had 32 'targets' and 8 areas where functionality had been sprinkled over time, making it hard for users to know where to focus. We'd now to like get wider feedback on future directions.

Scott's made a short video that sets out some of the usability issues in the current layout of the Universal Viewer, and some possible solutions. We think it's a great provocation for discussion by the community! To join in and help with our next steps, you can post on the Universal Viewer Slack (request to join here) or GitHub.

Posted by Digital Research Team at 12:56 PM in Digital scholarship , Tools | Permalink

Recovered Pages: Crowdsourcing at the British Library

Digital Curator Mia Ridge writes...

While the British Library works to recover from the October 2023 cyber-attack, we're putting some information from our currently inaccessible website into an easily readable and shareable format. This blog post is based on a page captured by the Wayback Machine in September 2023.

Crowdsourcing at the British Library

Screenshot of the Zooniverse interface for annotating a historical newspaper article

Example of a crowdsourcing task

For the British Library, crowdsourcing is an engaging form of online volunteering supported by digital tools that manage tasks such as transcription, classification and geolocation that make our collections more discoverable.

The British Library has run several popular crowdsourcing projects in the past, including the Georeferencer, for geolocating historical maps, and In the Spotlight, for transcribing important information about historical playbills. We also integrated crowdsourcing activities into our flagship AI / data science project, Living with Machines.

Agents of Enslavement uses 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
Living with Machines, which is mostly based on research questions around nineteenth century newspapers

Crowdsourcing Projects at the British Library

Living with Machines (2019-2023) created innovative crowdsourced tasks, including tasks that asked the public to closely read historical newspaper articles to determine how specific words were used.
Agents of Enslavement (2021-2022) used 18th/19th century newspapers to research slavery in Barbados and create a database of enslaved people.
In the Spotlight (2017-2021) was a crowdsourcing project from the British Library that aimed to make digitised historical playbills more discoverable, while also encouraging people to closely engage with this otherwise less accessible collection of ephemera.
Canadian wildlife: notes from the field (2021), a project where volunteers transcribed handwritten field notes that accompany recordings of a wildlife collection within the sound archive.
Convert a Card (2015) was a series of crowdsourcing projects aimed to convert scanned catalogue cards in Asian and African languages into electronic records. The project template can be found and used on GitHub.
Georeferencer (2012 - present) enabled volunteers to create geospatial data from digitised versions of print maps by adding control points to the old and modern maps.
Pin-a-Tale (2012) asked people to map literary texts to British places.

Research Projects

The Living with Machines project included a large component of crowdsourcing research through practice, led by Digital Curator Mia Ridge.

Mia was also the Principle Investigator on the AHRC-funded Collective Wisdom project, which worked with a large group of co-authors to produce a book, The Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage, through two 'book sprints' in 2021:

This book is written for crowdsourcing practitioners who work in cultural institutions, as well as those who wish to gain experience with crowdsourcing. It provides both practical tips, grounded in lessons often learned the hard way, and inspiration from research across a range of disciplines. Case studies and perspectives based on our experience are woven throughout the book, complemented by information drawn from research literature and practice within the field.

More Information

Our crowdsourcing projects were designed to produce data that can be used in discovery systems (such as online catalogues and our item viewer) through enjoyable tasks that give volunteers an opportunity to explore digitised collections.

Each project involves teams across the Library to supply digitised images for crowdsourcing and ensure that the results are processed and ingested into various systems. Enhancing metadata through crowdsourcing is considered in the British Library's Collection Metadata Strategy.

We previously posted on twitter @LibCrowds and currently post occasionally on Mastodon https://glammr.us/@libcrowds and via our newsletter.

Past editions of our newsletter are available online.

Posted by Digital Research Team at 1:34 PM in Americas , Data , Digital scholarship , eResources , Experiments , Humanities , LIS research , Manuscripts , Maps , Projects , Sound and vision , Tools | Permalink

Welcome to the British Library’s new Digital Curator OCR/HTR!

Hello everyone! I am Dr Valentina Vavassori, the new Digital Curator for Optical Character Recognition/Handwritten Text Recognition at the British Library.

I am part of the Heritage Made Digital Team, which is responsible for developing and overseeing the digitisation workflow at the Library. I am also an unofficial member of the Digital Research Team, where I promote the reuse and access to the Library’s collections.

My role has both an operational component (integrating and developing OCR and HTR in the digitisation workflow) and a research and engagement component (supporting OCR/HTR projects in the Library). I really enjoy these two sides of my role, as I have a background as a researcher and as a cultural heritage professional.

I joined the British Library from The National Archives, London, where I worked as a Digital Scholarship Researcher in the Digital Research Team. I worked on projects involving data visualisation, OCR/HTR, data modelling, and user experience.

Before that, I completed a PhD in Digital Humanities at King’s College London, focusing on chatbots and augmented reality in museums and their impact on users and museum narratives. Part of my thesis explored how to use these narratives using spatial humanities methods such as GIS. During my PhD, I also collaborated on various digital research projects with institutions like The National Gallery, London, and the Museum of London.

However, I originally trained as an art historian. I studied art history in Italy and worked for a few years in museums. During my job, I realised the potential of developing digital experiences for visitors and the significant impact digitisation can have on research and enjoyment in cultural heritage. I was so interested in the opportunities, that I co-founded a start-up which developed a heritage geolocation app for tourists.

Joining the Library has been an amazing opportunity. I am really looking forward to learning from my colleagues and exploring all the potential collaborations within and outside the Library.

Posted by Digital Research Team at 10:33 AM in Data , Digital scholarship , Printed books , Tools | Permalink

Tags: Data , Digital Scholarship , Manuscripts , Printed Books , Tools

145 posts categorized "Tools"