Digital scholarship blog

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

28 August 2024

Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship

 British Library is delighted to host its annual Open and Engaged Conference on Monday 21 October, in-person and online, as part of the International Open Access Week. The Conference is supported by the Arts and Humanities Research Council (AHRC) and Research Libraries UK (RLUK).  

Save the Date flyer for Open & Engaged 2024 on 21 October, in person and online, and with logos for sponsors UKRI, Ars and Humanities Research Council and RLUK

Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship will centre leveraging the power of communities in the axis of open scholarship, open infrastructure, emerging technologies, collections as data, equity and integrity, skills development and sustainable models to elevate research of all kinds for the public good. We take a cross sectoral approach to the conference programme – unifying around shared-values for openness – by reflecting on practices within research libraries both in higher education and GLAM (Galleries, Libraries, Archives, Museums) sectors as well as the national and public libraries.  

Open and Engaged 2024 is supported by the Arts and Humanities Research Council (AHRC) and Research Libraries UK (RLUK). Everyone interested in the conference topics is welcome to join us on Monday, 21 October! 

This will be a hybrid event taking place at the British Library’s Knowledge Centre in St. Pancras, London, and streamed online for those unable to attend in-person. 

The event will be recorded and recordings made available in the British Library’s Research Repository.

Registration

Registration is closed for in-person and online attendance. Registrants have been contacted with details. Any questions, please contact [email protected].  

Programme 

Slides and recordings of the talks are available as a collection in the British Library’s Research Repository.

09:30  Registration

10:00  Welcome remarks

10:10  Opening keynote panel: Cross disciplinary approach to open scholarship

Chaired by Sally Chambers, Head of Research Infrastructure Services at the British Library.

10:50    Empowering communities through equity, inclusivity, and ethics

Chaired by Beth Montague-Hellen, Head of Library and Information Services at the Francis Crick Institute.

This session addresses the role of the communities in governance, explores the ethical implications of AI for citizens and highlights the value of public engagement, and discusses the central importance of equity, inclusivity, and integrity in scholarly communications.

11:40  Break

12:10    Deepening partnership in skills development through shared values

Chaired by Kirsty Wallis, Head of Research Liaison at UCL.

This session explores initiatives that foster skills development in libraries with a cross sectoral approach and dives into the role of libraries to support communities in building resilience.

13:00  Lunch

13:45   Open repositories for research of all kinds

This session addresses the role of infrastructure to carry out open scholarship practices, explores the practice as research in the axis of diverse outputs and infrastructure, discusses institutional resilience in digital strategies. 

Chaired by William J Nixon, Deputy Executive Director at Research Libraries UK (RLUK).

14:45  Break

15:15   Enabling collections as data: from policy to practice  

Chaired by Jez Cope, Data Services Lead at the British Library.

This session dives into the digital collections as data by exploring policies and practices across different sectors, public-private partnerships in making collections publicly available, dynamics in preservation versus access approach in national libraries whilst underlining the public good. 

16:15   Closing keynote: Stories Change Lives

Chaired by Liz White, Director of Library Partnerships at the British Library

16:45 Closing remarks

17:00 Networking session

19:00  End

The hashtag for the event is #OpenEngaged on social media platform of your choice. If you have any questions, please contactus at [email protected].  

26 July 2024

Charting the European D-SEA Conference at the Stabi

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Earlier this month, I had the pleasure of attending the “Charting the European D-SEA: Digital Scholarship in East Asian Studies” conference held at the Berlin State Library (Staatsbibliothek zu Berlin), also known as the Stabi. The conference, held on 11-12 July 2024, aimed to fill a gap in the European digital scholarship landscape by creating a research community and a space for knowledge exchange on digital scholarship issues across humanities disciplines concerned with East Asian regions and languages.

The event was a dynamic fusion of workshops, presentations and panel discussions. Over three days of workshops (8-10 July), participants were introduced to key digital methods, resources, and databases. These sessions aimed to transmit practical knowledge in digital scholarship, focusing on East Asian collections and data. The subsequent two days were dedicated to the conference proper, featuring a broad range of presentations on various themes.

The reading room in the Berlin State Library, Haus Potsdamer Straße
The reading room in the Berlin State Library, Haus Potsdamer Straße

 

DH and East Asian Studies in Europe and Beyond

Conference organisers Jing Hu and Brent Ho from the Stabi, and Shih-Pei Chen and Dagmar Schäfer from the Max Planck Institute for the History of Science (MPIWG), set the stage for an enriching exchange of ideas and knowledge. The diversity of topics covered was impressive – from the more established digital resources and research tools to AI applications in historical research – the sessions provided a comprehensive overview of the current state and future directions of the field.

There were so many excellent presentations – and I often wished I could clone myself to attend parallel sessions! As expected, there was much focus on working with AI – machine learning and generative AI – and their potential in historical and humanities research. AI technologies offer powerful tools for data analysis and pattern recognition, and can significantly enhance research capabilities.

Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
 
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning

One notable session was "Reflections on Deep Learning & Generative AI," chaired by Brent Ho and discussed by Clemens Neudecker. The roundtable highlighted the evolving role of AI in humanities research. Calvin Yeh from MPIWG discussed AI's potential to augment, rather than just automate, research processes. He shared intriguing examples of using AI tools like ChatGPT to simulate group discussions and suggest research actions. Hongsu Wang from Harvard University presented on the use of Large Language Models and traditional Transformers in the China Biographical Database (CBDB) project, demonstrating the effectiveness of these models in data extraction and standardisation.

Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
 
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT

 

Exploring the Stabi

Our group tour in the Stabi was a personal highlight for me. This historic library, part of the Prussian Cultural Heritage Foundation, is renowned for its extensive collections and commitment to making digitised materials publicly accessible. The library operates from two major public sites – Haus Unter Den Linden and Haus Potsdamer Straße. Tours of both locations were available, but I chose to explore the more recent building, designed by Hans Scharoun and located in the Kulturforum on Potsdamer Straße in West Berlin – the history and architecture of which is fascinating.

A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße
A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße

I really enjoyed catching up with old colleagues and making new connections with fellow scholars passionate about East Asian digital humanities!

To conclude

In conclusion, the Charting European D-SEA Conference at the Stabi was an enriching experience, providing deep insights into the integration of digital methods in East Asian studies. It provided valuable insights into the advancements in digital scholarship and allowed me to connect with a global community of scholars. The combination of traditional and more recent digital practices, coupled with the forward-looking discussions on AI and deep learning, made this conference a significant milestone in the field. I look forward to seeing how these conversations evolve and contribute to the broader landscape of digital humanities.

 

16 July 2024

'AI and the Digital Humanities' session at CILIP's 2024 conference

Digital Curator Mia Ridge writes... I was invited to chair a session on 'AI and the digital humanities' at CILIP's 2024 conference with Ciaran Talbot (Associate Director AI & Ideas Adoption, University of Manchester Library) and Glen Robson (IIIF Technical Co-ordinator, International Image Interoperability Framework Consortium). Here's a quick post with some reflections on themes in the presentations and the audience Q&A.

A woman stands on stage in front of slides; two men sit at a panel table on the stage
CILIP's photo of our session

I presented a brief overview of some of the natural language processing (NLP) and computer vision methods in the Living with Machines project. That project and other work at the British Library showed that researchers can create innovative Digital Humanities methods and improve collections data with current AI / machine learning tools. But is there a gap between 'utilities' and 'cutting edge research' that AI can't (yet) fill for libraries?

AI (machine learning) makes library, museum and archive collections more accessible in two key ways. Firstly, more and better metadata and links across collections can make individual items more discoverable (e.g. identifying places mentioned in text; visual search to find similar images). Secondly, thinking of 'collections as data' and sharing datasets for research lets others find insights and inspiration.

Some of the value in AI might lie in the marketing power of the term - we've had the technical ability to view collections across silos for some time, but the institutional will might have lagged behind. Identifying the real gaps that AI can meet is hard, cross-institutional work - you need to understand what time-consuming work could be automated with ML/AI. Ciaran's talk gave a sense of the collaborative, co-creative effort required to understand actual processes and real problems and devise ways to optimise them. An 'anarchy' phase might be part of that process, and a roadmap can help set a shared vision as you work out where AI tools will actually save time or just create more but different work.

Glen gave some great examples of how IIIF can help organisations and researchers, and how AI tools might work with IIIF collections. He highlighted the intellectual property questions that 'open access' collections being mined for AI models raises, and pointed people to HaveIBeenTrained to see if their collections have been scraped.

I was struck by the delicate balance between maintaining trust and secure provenance while also supporting creative and playful uses of AI in collections. Labelling generative AI images and texts is vital. Detecting subtle errors and structural biases requires effort and expertise. As a sector, we need to keep learning, talking and collaborating to understand what generative AI means for users and collection holders.

The first question from the audience was about the environmental impact of AI. I was able to say that our work-in-progress principles for AI at the British Library ask people to consider the environmental impact of AI (not just its carbon footprint, but also water usage and rare minerals mining) in balance with other questions of public value for proposed experiments and projects. Ciaran said that Manchester have appointed a sustainability manager, which is probably something we'll see more of in future. There was a question about what employers are looking for in library and informatics students; about where to go for information and inspiration about AI in libraries (AI4LAM is a good start); and about how to update people's perceptions of libraries and the skills of library professionals.

Thanks to everyone at CILIP for all the work they put into the conference, and the fantastic AV team working in the keynote room at the Birmingham Hilton Metropole.

 

08 July 2024

Embracing Sustainability at the British Library: Insights from the Digital Humanities Climate Coalition Workshop

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Sustainability has become a core value at the British Library, driven by our staff-led Sustainability Group and bolstered by the addition of a dedicated Sustainability Manager nearly a year ago. As part of our ongoing commitment to environmental responsibility, we have been exploring various initiatives to reduce our environmental footprint. One such initiative is our engagement with the Digital Humanities Climate Coalition (DHCC), a collaborative and cross-institutional effort focused on understanding and minimising the environmental impact of digital humanities research.

Screenshot from the Digital Humanities Climate Coalition website
Screenshot from the Digital Humanities Climate Coalition website
 

Discovering the DHCC and its toolkit

The Digital Humanities Climate Coalition (DHCC) has been on my radar for some time, primarily due to their exemplary work in promoting sustainable digital practices. The DHCC toolkit, in particular, has proven to be an invaluable resource. Designed to help individuals and organisations make more environmentally conscious digital choices, the toolkit offers practical guidance for building sustainable digital humanities projects. It encourages researchers to adopt climate-responsible practices and supports those who may lack the practical knowledge to devise greener initiatives.

The toolkit is comprehensive, providing tips on the planning and management of research infrastructure and data. It aims to empower researchers to make climate-friendly technological decisions, thereby fostering a culture of sustainability within the digital humanities community.

My primary goal in leveraging the DHCC toolkit is to raise awareness about the environmental impact of digital work and technology use. By doing so, I hope to empower Library staff to make informed decisions that contribute to our sustainability goals. The toolkit’s insights are crucial for anyone involved in digital research, offering both strategic guidance and practical tips for minimising ecological footprints.

Planning a workshop at the British Library

With the support of our Research Development team, I organised a one-day workshop at the British Library, inviting Professor James Baker, Director of Digital Humanities at the University of Southampton and a member of the DHCC, to lead the event. The workshop was designed to introduce the DHCC toolkit and provide guidance on implementing best practices in research projects. The in-person, full-day workshop was held on 5 February 2024.

Workshop highlights

The workshop featured four key sessions:

Session 1: Introductions and Framing: We began with an overview of the DHCC and its work within the GLAM sector, followed by an introduction to sustainability at the British Library, the roles that libraries play in reducing carbon footprint and awareness raising, the Green Libraries Campaign (of which the British Library was a founding partner), and perspectives on digital humanities and the use of computational methods.

CILIP’s Green Libraries Campaign banner
CILIP’s Green Libraries Campaign banner

Session 2: Toolkit Overview: Prof Baker introduced the DHCC toolkit, highlighting its main components and practical applications, focusing on grant writing (e.g. recommendations on designing research projects, including Data Management Plans), and working practices (guidance on reducing energy consumption in day-to-day working life, e.g. communication and shared working, travel, and publishing and preserving data). The session included responses from relevant Library teams, on topics such as research project design, data management and our shared research repository.

DHCC publication cover: A Reseacher Guide to Writing a Climate Justice Oriented Data Management Plan
DHCC Information, Measurement and Practice Action Group. (2022). A Researcher Guide to Writing a Climate Justice Oriented Data Management Plan (v0.6). Zenodo. https://doi.org/10.5281/zenodo.6451499

Session 3: Advocacy and Influencing: This session focused on strategies for advocating for sustainable practices within one's organisation and influencing others to adopt these practices. We covered the Library’s staff-led Sustainability Group and its activities, after which participants were then asked to consider the actions that could be taken at the Library and beyond, taking into account the types of people that might be influenced (senior leaders, colleagues, peers in wider networks/community).

Session 4: Feedback and Next Steps: Participants discussed their takeaways from the workshop and identified actionable steps they could implement in their work. This session included conversations on ways to translate workshop learnings into concrete next steps, and generated light ‘commitments’ for the next week, month and year. One fun way to set oneself a yearly reminder is to schedule an eco-friendly e-card to send to yourself in a year!

Post-workshop follow-up

Three months after the workshop had taken place, we conducted a follow-up survey to gauge its impact. The survey included a mix of agree/disagree statements (see chart below) and optional long-form questions to capture more detailed feedback. While we had only a few responses, survey results were constructive and positive. Participants appreciated the practical insights and reported better awareness of sustainable practices in their digital work.

Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact
Participants’ agree/disagree ratings for a series of statements about the DHCC workshop’s impact

Judging from responses to the set of statements above, at least several participants have embedded toolkit recommendations, made specific changes in their work, shared knowledge and influenced their wider networks. We got additional details on these actions in responses to the open-ended questions that followed.

What did staff members say?

Here are some comments made in relation to making changes and embedding the DHCC toolkit’s recommendation:

“Changes made to working policy and practice to order vegetarian options as standard for events.”

“I have referenced the toolkit in a chapter submitted for a monograph, in relation to my BL/university research.”

“I have discussed the toolkit's recommendations with colleagues re the projects I am currently working on. We agreed which parts of the projects were most carbon intensive and discussed ways to mitigate that.”

“I recommended a workshop on the toolkit to my [research] funding body.”

“Have engaged more with small impacts - less email traffic, fewer attachments, fewer images.”

A couple of comments were made with regard to challenges or barriers to change making. One was about colleagues being reluctant to decrease flying, or travel in general, as a way to reduce one’s carbon footprint. The second point referred to an uncertainty on how to influence internal discussions on software development infrastructure – highlighting the challenge of finding the right path to the right people.

An interesting comment was made in relation to raising environmental concerns and advocating the Toolkit:

“Shared the toolkit with wider professional network at an event at which environmentally conscious and sustainable practices were raised without prompting. Toolkit was well received with expressions of relief that others are thinking along these lines and taking practical steps to help progress the agenda.”

And finally, an excellent point about the energy-intensive use of ChatGPT (or other LLMs), which was covered at the workshop:

“The thing that has stayed with me is what was said about water consumption needed to cool the supercomputers - how every time you run one of those Chat GPT (or equivalent) queries it is the equivalent of throwing a litre of water out the window, and that Microsoft's water use has gone up 30%. I've now been saying this every time someone tells me to use one of these GPT searches. To be honest it has put me off using them completely.”

In summary

The DHCC workshop at the British Library was a great success, underscoring the importance of sustainability in digital humanities, digital projects and digital working. By leveraging the DHCC toolkit, we have taken important steps toward making our digital practices more environmentally responsible, and spreading the word across internal and external networks. Moving forward, we will continue to build on this momentum, fostering a culture of sustainability and empowering our staff to make informed, climate-friendly decisions.

Thank you to workshop contributors, organisers and helpers:

James Baker, Joely Fake, Maja Maricevic, Catherine Ross, Andy Rackley, Jez Cope, Jenny Basford, Graeme Bentley, Stephen White, Bianca Miranda Cardoso, Sarah Kirk-Browne, Andrea Deri, and Deirdre Sullivan.

 

04 July 2024

DHBN 2024 - Digital Humanities in the Nordic and Baltic Countries Conference Report

This is a joint blog post by Helena Byrne, Curator of Web Archives, Harry Lloyd, Research Software Engineer, and Rossitza Atanassova, Digital Curator.

Conference banner showing Icelandic landscape with mountains
This year’s Digital Humanities in the Nordic and Baltic countries conference took place at the University of Iceland School of Education in Reykjavik. It was the eight conference which was established in 2016, but the first time it was held in Iceland. The theme for the conference was “From Experimentation to Experience: Lessons Learned from the Intersections between Digital Humanities and Cultural Heritage”. There were pre-conference workshops from May 27-29 with the main conference starting on the afternoon of May 29 and finishing on May 31. In her excellent opening keynote Sally Chambers, Head of Research Infrastructure Services at the British Library, discussed the complex research and innovation data space for cultural heritage. Three British Library colleagues report highlights of their conference experience in this blog post.

Helena Byrne, Curator of Web Archives, Contemporary British & Irish Publications.

I presented in the Born Digital session held on May 28. There were four presentations in this session and three were related to web archiving and one related to Twitter (X) data. I co-presented ‘Understanding the Challenges for the Use of Web Archives in Academic Research’. This presentation examined the challenges for the use of web archives in academic research through a synthesis of the findings from two research studies that were published through the WARCnet research network. There was lots of discussion after the presentation on how web archives could be used as a research data management tool to help manage online citations in academic publications.

Helena presenting to an audience during the conference session on born-digital archives
Helena presenting in the born-digital archives session

The conference programme was very strong and there were many takeaways that relate to my role. One strong theme was ‘collections as data’. At the UK Web Archive we have just started to publish some of our inactive curated collections as data. So these discussions were very useful. One highlight was thePanel: Publication and reuse of digital collections: A GLAM Labs approach’. What stood out for me in this session was the checklist for publishing collections as data. It was very reassuring to see that we had pretty much everything covered for the release of the UK Web Archive datasets.

Rossitza and I were kindly offered a tour of the National and University Library of Iceland by Kristinn Sigurðsson, Head of Digital Projects and Development. We enjoyed meeting curatorial staff from the Special Collections who showed us some of the historical maps of Iceland that have been digitised. We also visited the digitisation studio to see how they process periodicals, and spoke to staff involved with web archiving. Thank you to Kristinn and his colleagues for this opportunity to learn about the library’s collections and digital services.

Rossitza and Helena standing by the moat outside the National Library of Iceland building
Rossitza and Helena outside the National and University Library of Iceland

 

Inscription in Icelandic reading National and University Library of Iceland outside the Library building
The National and University Library of Iceland

Harry Lloyd, Research Software Engineer, Digital Research.

DHNB2024 was a rich conference from my perspective as a research software engineer. Sally Chambers’ opening keynote on Wednesday afternoon demonstrated an extraordinary grasp of the landscape of digital cultural heritage across the EU. By this point there had already been a day and a half of workshops, including a session Rossitza and I presented on Catalogues as Data

I spent the first half using a Jupyter notebook to explain how we extracted entries from an OCR’d version of the catalogue of the British Library’s collection of 15th century books. We used an explainable algorithm rather than a ‘black-box’ machine learning one, so we walked through the steps involved and discussed where it worked well and where it could be improved. You can follow along by clicking the ‘launch notebook’ button in the ReadMe here

Harry pointing to an image from the catalogue of printed books on a screen for the workshop audience
Harry explaining text recognition results during the workshop

Handing over to Rossitza in the second half to discuss her corpus linguistic analysis worked really well by giving attendees a feel for the complete workflow. This really showed in some great conversations we had with attendees over the following days about tricky problems like where to store the ‘true’ results of OCR. 

A few highlights from the rest of the conference were Clelia LaMonica’s work using Latin large language model to analyse kinship in texts from Medieval Burgundy. Large language models trained on historic texts are important as the majority are trained on modern material and struggle with historical language. Jørgen Burchardt presented some refreshingly quantitative work on bias across a digitised newspaper collection, very reminiscent of work by Kaspar Beelen. Overall it was a productive few days, and I very much enjoyed my time in Reykjavik.

Rossitza Atanassova, Digital Curator, Digital Research.

This was my second DHNB conference and I was looking forward to reconnecting with the community of researchers and cultural heritage practitioners, some of whom I had met at DHNB2019 in Copenhagen. Apart from the informal discussions with attendees, I contributed to DHNB2024 in two main ways.

As already mentioned, Harry and I delivered a pre-conference workshop showcasing some processes and methodology we use for working with printed catalogues as data. In the session we used the corpus tool AntConc to perform computational analysis of the descriptions for the British Library’s collection of books published in the 15th century. You can find out more about the project here and reuse the workshop materials published on Zenodo here.

I also joined the pre-conference meeting of the international GLAM Labs Community held at the National and University Library of Iceland. This was the first in-person meeting of the community in five years and was a productive session during which we brainstormed ‘100 ideas for the GLAM Labs Community’. Afterwards we had a sneak peak of the archive of the National Theatre of Iceland which is being catalogued and digitised.

The main hall of the Library with a chessboard on a table with two chairs, a statue of a man, holding spectacles and a stained glass screen.
The main hall of the Library.

The DHNB community is so welcoming and supportive, and attracts many early career digital humanists. I was particularly interested to hear from doctoral students researching the use of AI with digitised archives, and using NLP methods with historical collections. One of the projects that stood out for me was Johannes Widegren’s PhD research into the ethical use of AI to enable access and discovery of Sami cultural heritage, and to develop library and archival practice. 

I was also interested in presentations that discussed workflows for creating Named Entity Recognition resources for historical archives and I plan to try out the open-source Label Studio tool that I learned about. And of course, the poster session is always a highlight and I enjoyed finding out about a range of projects, including computational analysis of Scandinavian runic-texts, digital reconstruction of Gothenburg’s 1923 Jubilee exhibition, and training large language models to track semantic variation in climate change vocabulary in Danish news articles.

A line up of people standing in front of a screen advertising the venue for DHNB25 in Estonia
The poster presentations session chaired by Olga Holownia

We are grateful to all DHNB24 organisers for the warm welcome and a great conference experience, with special thanks to the inspirational and indefatigable Olga Holownia

28 June 2024

IIIF Annual Conference 2024: A Journey of Innovation and Inspiration

The British Library Universal Viewer team were delighted to attend the IIIF conference and showcase 2024 at UCLA in Los Angeles, California. This was our the first official event since the team formed earlier in the year, and we felt incredibly fortunate to be travelling across numerous time zones to join over 70 members of the IIIF community for four days of innovation, learning and inspiration. 

301841f4-2849-40c7-8efc-1c40eb4f07e8
The Universal Viewer team outside the De Neve Plaza at UCLA

The first two days of the conference were held at the De Neve Plaza and took the form of lightning talks from delegates from a variety of different industries, and on many different topics. This format meant there was something to interest everyone, regardless of experience, and was great for keeping concentration levels high despite the jet lag! 

Birds of a feather sessions were held on the third day of the conference, with a last-minute entry from the Universal Viewer team – although lack of space meant that this was an impromptu meeting in the Kerckhoff Coffee House. However, this meant we were able to plan future work, specifically on annotations, in the sunshine on the terrace. 


6a00d8341c464853ef02c8d3b42230200c-320wi
Attendees of the UV Birds of a Feather session at the Kerckhoff Coffee House

Here were the exciting takeaways! 

Lanie Okorodudu: I was interested on how IIIF resources and IIIF-related tools could be used as a part of curriculums in online learning platforms to create meaningful knowledgeable experiences for students. I was also intrigued by “Tropiiify”, which is a plug-in for exporting IIIF collections and designed for non-technical users. 

Erin Burnand: I loved hearing about how IIIF can provide innovative solutions for incredible (but complex) collections such as the Judy Chicago Research Portal (Pennsylvania State University Library) and the work on Eastern Silk Road collections for the International Dunhuang Programme (presented by the BL’s Anastasia Pineschi) 

James Misson: The conference was an amazing opportunity to connect with fellow IIIF users, from IIIF newcomers, to those who helped define the original specifications. I enjoyed hearing work on the carbon footprint of OCR, and the transformation of historical textiles into sound to make an exhibition more accessible to visually impaired people. It was inspiring to see the range of uses IIIF has, and I was especially excited by Allmaps (allmaps.org), a toolbox for working with IIIF maps. The conference was a testament to how open the IIIF community is, and everyone generously shared their knowledge with our new team – conversations that continued in the bars of Westwood and In-n-Out Burger. 

Saira Akhter: I found the discussions on the use of AI within IIIF interesting, such as for facial recognition within historic photographs and future integration with OCR/HRT tools and outputs. The showcase at the Getty was great for learning more about IIIF itself, and it was cool to see how the idea for IIIF was first written on a napkin at a restaurant. I also enjoyed seeing more novel uses of IIIF, such as for importing paintings into Animal Crossing. 

Recordings of the conference are now available on YouTube.   

26 June 2024

Join the British Library as a Digital Curator, OCR/HTR

This is a repeated and updated blog post by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections. She shares some background information on how a new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users. Our previous run of this recruitment was curtailed due to the cyber-attack on the Library - but we are now ready to restart the process!

 

We’ve been digitising our collections for about three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections.

We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.

OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project
OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project

Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for EnglishBangla, Arabic, Urdu and Chinese. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.

We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.

An example of a catalogue card in Transkribus, showing segmentation and transcription
An example of a catalogue card in Transkribus, showing segmentation and transcription

We've also collaborated with Colin Brisson from the READ_Chinese project on Chinese HTR, working with eScriptorium to enhance binarisation, segmentation and transcription models using manuscripts that were digitised as part of the International Dunhuang Programme. You can read more about this work in this brilliant blog post by Peter Smith, who's done a PhD placement with us last year.

The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.

For more information and to apply please visit the ad for Digital Curator for OCR/HTR on the British Library recruitment site. Applications close on Sunday 21 July 2024. Please pay close attention to questions asked in the application process. Any questions? Drop us a line at [email protected].

Good luck!

24 June 2024

China trip report – IDP, DH, and everything in between

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Last April I was part of a British Library delegation to China, which was a wholesome and fulfilling experience. It aimed to refresh collaborations and partnerships with the National Library of China and the Dunhuang Academy, explore new connections and strengthen existing ones with many other institutions and individuals. I will explore this trip from a digital scholarship lens, but you can read all about the trip and its larger aims and accomplishments in a post on the IDP blog by International Dunhuang Programme Project Manager, Anastasia Pineschi. 

The Mogao Caves in Dunhuang
The Mogao Caves in Dunhuang

My primary objective was to attend and present at the IDP conference (19-20 April 2024), co-organised by the British Library and the Dunhuang Academy and synchronised with IDP’s 30th anniversary and the launch of a new, fresh and accomplished IDP website. Sharing our work and learning from others during this conference and the IDP workshop that took place the following day was one of my objectives. But I was also looking to reconnect with peers and getting to know new colleagues working in the fields of DH and the interchange of AI, cultural heritage and historical digital collections; explore opportunities for collaboration in the field of OCR/HTR (Optical Character Recognition, Handwritten Text Recognition); and get ideas for DH opportunities for IDP. 

British Library and Dunhuang Academy colleagues in front of Mogao Cave 96 (Nine Story Temple) 
British Library and Dunhuang Academy colleagues in front of Mogao Cave 96 (Nine Story Temple)

Colleagues from the Dunhuang Academy showed us such outstanding hospitality, with our Dunhuang trip including many behind-the-scenes visits and unique experiences. These included, naturally, the extraordinary Mogao Grottoes, but also another cave site called the Western Thousand Buddha Caves, and stunning natural spots such as the Singing Sand Dune (Mingsha Mountain) and the Crescent Moon Spring. We also visited places such as the Digital Exhibition and Visitor Center, the Multi-field lab at the Dunhuang Studies Information Center, the Grottoes Monitoring Center and Conservation Lab, and the Dunhuang City Museum. All have left long-lasting impressions. 

One of the dashboards managing the Mogao Grottoes at the Grottoes Monitoring Center
One of the dashboards managing the Mogao Grottoes at the Grottoes Monitoring Center

But let’s get back to the main purpose of this post, which is to report on some of the outstanding work happening out there at the intersection of Chinese historical collections and DH.

 

Conference (DH) Highlights  

I’ll start with one of the earliest platforms to enable and encourage DH research in the context of Chinese works, the Chinese Texts Project. Dr Donald Sturgeon (Durham University) presented about this well-known digital library of pre-20th century Chinese texts, which started in 2005 and is still impressively active at present, being one of the largest and most widely used digital libraries of premodern Chinese texts. Crowdsourcing and AI are now used to enhance the texts available via this platform. Machine Learning OCR is used to automate transcriptions, automated punctuation is added through deep learning, and OCR corrections are done via a crowdsourcing interface. This sees quite a high volume of engagement, typically ca. 1,000 edits per day! Sturgeon also talked about the automated annotation of named historical entities in transcribed texts, as well as using deep learning to assert periods and dates, being able to transition between Chinese and Western calendars. These annotations can then turn into structured data – enabling linking up to other data. 

Dr Donald Sturgeon presents about extracting structured data from annotations
Dr Donald Sturgeon presents about extracting structured data from annotations

While on the topic of state-of-the-art platforms, Prof Kiyonori Nagasaki (International Institute for Digital Humanities, Tokyo) talked about the SAT Daizokyo Text Database, a digital editing system for Buddhist canons and manuscripts using AI-OCR developed and recently released by the National Diet Library of Japan. The IIIF-compliant database of Buddhist icons annotated over 20,000 items, enabling search by various attributes. Nagasaki gave us a website demo, displaying an illustration with 400 annotations. One can search annotated parts of this image and compare images in the search results. Like the Chinese Texts Project, the SAT platform also incorporates crowdsourcing ‘editing’ with clever Machine Learning techniques. It was good to hear that there is an intention for SAT to gradually include Dunhuang manuscripts in the future. 

Prof Kiyonori Nagasaki demonstrated how the interface interaction is facilitated by IIIF: clicking on the text bring up the right area in the IIIF-image
Prof Kiyonori Nagasaki demonstrated how the interface interaction is facilitated by IIIF: clicking on the text bring up the right area in the IIIF-image

Another well-established, IIIF-based system, presented by Dr Hongxing Zhang (V&A Museum), is the Chinese Iconography Thesaurus (CIT). CIT has been an ongoing project since 2016, developed at the V&A and aiming to work towards subject indexing standard for Chinese Art. A system of controlled vocabulary is crucial to improve access to collections and linking up multiple collections. CIT focuses on Chinese iconography – motifs, themes, and subject matters of cultural objects, with almost 15,000 concepts and entities. And, it’s IIIF-supported – images and annotations can be viewed in IIIF Mirador lightbox. 

Not just Chinese

While much of the work around Dunhuang or Silk Road manuscripts has to do with Chinese language, several scholars emphasised the importance of addressing other languages as well. Dunhuang manuscripts were written in languages such as Sogdian, Middle Persian, Parthian, Bactrian, Tocharian, Khotanese, Sanskrit, Tibetan, Old Uighur, and Tangut. Prof Xinjiang Rong (Peking University) emphasised the importance of providing transcriptions, transliterations and translations alongside digitised images. These languages require special language expertise; therefore, cooperation between institutions and scholars is crucial. Prof Tieshan Zhang (Minzu University of China) also urges researchers to address and publish non-Chinese Dunhuang manuscripts. He especially highlighted the importance of making better use of text recognition technologies for languages other than Chinese. Last year, the Computer Science department of Minzu University of China applied for a research project to do just that. They started with non-Chinese languages and aim to increase recognition accuracy to over 90%. 

The talk by Prof Hannes Fellner (University of Vienna) came as a perfect example of how one could address the study of material in other languages, using computational methods. He introduced a project aiming to trace the development of Tarim Brahmi – one of the major writing systems of the Eastern Silk Road during the 1st millennium CE, which includes Khotanese, Sanskrit, Tocharian, and Saka. The project compiles a database of characters in Tarim Brahimi languages (currently primarily Tocharian), with palaeographic and linguistic annotations, presented as a web application. With the aim to create a research tool for texts in this writing system, such platform could facilitate the study of palaeographic variation, which in turn could help explore scribal identification, language development stages, and correlations between palaeographic and linguistic variations. Fellner works with Transkribus and IIIF to retrieve the coordinates of characters and words, returning the relevant ‘cut-outs’ of the photos to the web application. These can then be visualised, displaying character or word variations alongside their transliteration. 

Prof Hannes Fellner shows how working with Transkribus and IIIF makes it possible to retrieve ‘cut-outs’ from photographs corresponding to the query string
Prof Hannes Fellner shows how working with Transkribus and IIIF makes it possible to retrieve ‘cut-outs’ from photographs corresponding to the query string

Coming back to Chinese OCR/HTR, there’s quite a lot of activity in this area. I presented about work at the British Library aiming to advance Chinese HTR methods, in the wider context of the Library’s OCR/HTR work. We’ve focused on using the eScriptorium platform by collaborating with Colin Brisson (École Pratique des Hautes Études) and the French consortium Numerica Sinologica (now working on the READ_Chinese project). I talked about the work of our PhD Placement student, Peter Smith (University of Oxford), contributing to processes such as binarisation, segmentation and text recognition. I have recently presented about this work at Ryukoku University in Kyoto, and you can read more about it in Peter’s excellent blog post. 

Dr Adi Keinan-Schoonbaert talking about OCR/HTR activities at the British Library
Dr Adi Keinan-Schoonbaert talking about OCR/HTR activities at the British Library

 

Dunhuang online platforms

It is crucial to embed such technologies and software into user-friendly platforms, where different functionalities are available for different types of needs and audiences. Dr Peter Zhou (University of California, Berkeley) talked about the importance of building a sustainable platform that can support the complete digital lifecycle, including data curation and management, long-term preservation, and dissemination. Zhou’s objectives for the Digital Dunhuang platform are to connect resources that are otherwise isolated, featuring uniform standards for data exchanges. Such platform must enable different kinds of data formats, including raw images, historical photos, videos, cave QTVRs, digitised texts and artifacts, reproductions, microfilm, interactive visuals, conservation data, spatial info, 3D modelling data, and immersive media. This Digital Dunhuang platform should be flexible, able to scale up and deal with mass content in different formats, have Machine Learning capabilities, and aggregating knowledge content through linking.  

We can see many of these elements in a platform developed by the Dunhuang Academy. Xiaogang Zhang and Tianxiu Yu of the Dunhuang Academy introduced the Digital Library Cave platform (Digital Dunhuang), built in collaboration with Tencent, and its plans. The platform presents both a database of Dunhuang materials and murals, as well as a playable game focused on the narrative of the Library Cave. This platform displays an engaging, immersive mixture of 3D environments and artifacts, in addition to 2D items. The aim for the Digital Dunhuang platform is to present digital resources relating to the Mogao Grottoes in one integrated and comprehensive resource for Dunhuang studies. (Side note: access to the database requires a login and input of personal data). 

Tianxiu Yu showing a Knowledge Graph connecting different types of data resources
Tianxiu Yu showing a Knowledge Graph connecting different types of data resources

The richness and variety of data available now and in future on this platform is remarkable. The entire cliff of the Mogao Grottoes and some of the large-scale cultural relics are available in 3D, and this is complemented by other data used in conservation and research. And there’s an impressive array of AI technologies applied to both images and texts. For images, murals dataset annotations and automatic object detection would allow for search and retrieval; AI used for image enhancements for old photos; line drawing are extracted from art scenes; and image stitching automation. For texts, functionalities will include, at a later stage, character text recognition, providing full text retrieval at 90% precision rate; Traditional to Simplified Chinese conversion; automatic punctuation; entity extraction; and the creation of knowledge graphs. When completed, this platform will be open and share all resources available online. 

With a solid focus on text retrieval and analysis, Dr Xiaoxing Zhao (Dunhuang Academy) presented about the Dunhuang Documents Database, collating digitised manuscripts and prints dating from the 4th to the 11th centuries discovered in the Library Cave at Mogao, Dunhuang. Providing full-text retrieval for Chinese, Tibetan, and Uighur (and a plan to add Tangut), it includes search functionality using keywords, and features transliteration in Traditional Chinese, which can be conveniently viewed alongside the image. It’s great to see how far AI text recognition has come! 

Dr Xiaoxing Zhao demonstrating the Dunhuang Documents Database’s transliteration in Traditional Chinese, which can be seen side by side to the image
Dr Xiaoxing Zhao demonstrating the Dunhuang Documents Database’s transliteration in Traditional Chinese, which can be seen side by side to the image

However, technological advances are not just restricted to AI and Machine Learning. Prof Simon Mahony (Emeritus Professor, UCL) gave a fascinating, image-rich talk about non-invasive and non-destructive computational imaging of ancient texts. Mahony introduced different techniques to address research questions arising from textual manuscripts. These methods allow, for example, reading illegible texts and seeing artworks, determining the composition of pigments, or detecting characteristics of ink. One of the projects that he was involved with was the Great Parchment Book project. Damaged in a fire, the book’s content became inaccessible for researchers – but a series of steps taken to digitally straighten, flatten and stretch the book, turned it back to a readable state. This and other computational methods applied to images are indeed very inspirational! 

Prof Simon Mahony talking about how computational methods were used to enable the reading of the text in the Great Parchment Book project
Prof Simon Mahony talking about how computational methods were used to enable the reading of the text in the Great Parchment Book project

 

Back to Beijing 

Coming back to Beijing, we had several visits such as the National Library of China and the Palace Museum’s Conservation Department. But I’ll focus here on two visits which are directly related to DH and computational methods – the first at the Chinese Academy of Sciences (CAS), and the second at the National Key Laboratory of General Artificial Intelligence, Peking University. 

We were kindly hosted by Prof Cheng-Lin Liu from the State Key Laboratory of Multimodal AI Systems (MAIS), Institute of Automation, CAS, and joined by Drs Fei Yin, Heng Zhang, and Xiao-Hui Li. Prof Liu gave an excellent keynote talk at the Machine Learning workshop at the ICDAR2023 conference, which I attended in August 2023. It was about “Plane Geometry, Diagram Parsing and Problem Solving,” which well exemplifies MAIS’ areas of work. It is a national platform specialising in document analysis, computer vision, robotics, Machine Learning, Natural Language Processing (NLP), and medical AI research – the first to start Pattern Recognition research in China, and one of its main AI research centres. We enjoyed an excellent exchange – and a fruitful discussion.  

MAIS and British Library colleagues at the CAS offices in the Haidian District, Beijing
MAIS and British Library colleagues at the CAS offices in the Haidian District, Beijing

 

From there, we travelled to Peking University for another stimulating knowledge exchange meeting with Prof Jun Wang, Director of the Research Center for Digital Humanities (PKUDH) and Vice Dean, Artificial Intelligence Institute, joined by Dr Qi Su, Dr Pengyi Zhang, Dr Hao Yang, Honglei San, Kairan Liu, and Siyu Duan. We watched videos of two Shidian platforms – open access web platforms for reading, editing and analysing ancient Chinese books, developed through a partnership between PKUDH and the Douyin Group. One platform is the Open Access Ancient Book Reading Platform, and the second is the AI-powered Ancient Book Collation Platform. The AI-empowered editing and compiling system includes an impressive array of functionalities. 

Screenshot from the YouTube video, showing features of the Shidian reading platform
Screenshot from the YouTube video, showing features of the Shidian reading platform

Our session also included presentations and discussions around topics such as AI character reconstruction, cultural heritage curation and crowdsourcing, automatic text annotation and linked data. For example, PhD student Siyu Duan (supervised by Prof Su Qi) presented about dealing with ancient ideograph restoration, including a little experiment on Dunhuang data that showed suggested restoration of damaged or illegible characters. The whole session was an absolute delight!  

I am so grateful for everyone generosity and hospitality – I have learned so much, so thank you. Until next time! 

Dr Adi Keinan-Schoonbaert enjoying the dunes and the Crescent Moon Spring, Dunhuang
Dr Adi Keinan-Schoonbaert enjoying the dunes and the Crescent Moon Spring, Dunhuang