Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

31 October 2024

Welcome to the British Library’s new Digital Curator OCR/HTR!

Blog pictureHello everyone! I am Dr Valentina Vavassori, the new Digital Curator for Optical Character Recognition/Handwritten Text Recognition at the British Library.

I am part of the Heritage Made Digital Team, which is responsible for developing and overseeing the digitisation workflow at the Library. I am also an unofficial member of the Digital Research Team, where I promote the reuse and access to the Library’s collections.

My role has both an operational component (integrating and developing OCR and HTR in the digitisation workflow) and a research and engagement component (supporting OCR/HTR projects in the Library). I really enjoy these two sides of my role, as I have a background as a researcher and as a cultural heritage professional.

I joined the British Library from The National Archives, London, where I worked as a Digital Scholarship Researcher in the Digital Research Team. I worked on projects involving data visualisation, OCR/HTR, data modelling, and user experience.

Before that, I completed a PhD in Digital Humanities at King’s College London, focusing on chatbots and augmented reality in museums and their impact on users and museum narratives. Part of my thesis explored how to use these narratives using spatial humanities methods such as GIS. During my PhD, I also collaborated on various digital research projects with institutions like The National Gallery, London, and the Museum of London.

However, I originally trained as an art historian. I studied art history in Italy and worked for a few years in museums. During my job, I realised the potential of developing digital experiences for visitors and the significant impact digitisation can have on research and enjoyment in cultural heritage. I was so interested in the opportunities, that I co-founded a start-up which developed a heritage geolocation app for tourists.

Joining the Library has been an amazing opportunity. I am really looking forward to learning from my colleagues and exploring all the potential collaborations within and outside the Library.

29 October 2024

Happy Twelfth Birthday Wikidata!

Today the global Wikidata community is celebrating its 12th birthday! Wikidata originally went live on the 29th October 2012, back when Andrew Gray was the British Library’s first Wikipedian in Residence and since then it has massively expanded.  

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines, which acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia and Wikisource. Wikidata content is available under a free license (CC0), exported using standard formats (JSON & RDF), and can be interlinked to other open data sets on the linked data web.

Drawing of four people around a birthday cake

Over the past year Wikidata passed the incredible milestone of 2 Billion edits, making it the most edited Wikimedia project of all time. However, this growth has created Wikidata Query Service stability challenges and scaling issues. To address these, the development team have been working on several projects including splitting the data in the Query Service and releasing the multiple languages code to be able to handle the current size of Wikidata better.

Heat Map of Wikidata’s geographic coverage as of October 2024
Map of Wikidata’s geographic coverage as of October 2024

Another major focus during the past year has been promoting Wikidata reuse. To make it easier to access Wikidata’s data there is a new REST API. Plus developers who build with Wikidata’s data now have access to a Wikidata developer portal, which holds important information and provides inspiration about what is possible with Wikidata’s data.

The international library community actively engages with Wikidata. In 2019 the IFLA Wikidata Working Group was formed to explore the integration of Wikidata and Wikibase with library systems, and alignment of the Wikidata ontology with library metadata formats such as BIBFRAME, RDA, and MARC. There is also the LD4 Wikidata Affinity Group, who hold Affinity Group Calls and Wikidata Working Hours throughout the year.

If you are new to Wikidata and want to learn more, there are many resources available, including this Zine about Wikidata, created by our recent Wikimedian in Residence Dr Lucy Hinnie, and these videos:

You may also want to check out the online Bibliography of Wikidata, which lists books, academic conference presentations and peer-reviewed papers, which focus on Wikidata as their subject.

This post is by Digital Curator Stella Wisdom.

24 October 2024

Southeast Asian Language and Script Conversion Using Aksharamukha

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

The British Library’s vast Southeast Asian collection includes manuscripts, periodicals and printed books in the languages of the countries of maritime Southeast Asia, including Indonesia, Malaysia, Singapore, Brunei, the Philippines and East Timor, as well as on the mainland, from Thailand, Laos, Cambodia, Myanmar (Burma) and Vietnam.

The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)
The display of literary manuscripts from Southeast Asia outside of the Asian and African Studies Reading Room in St Pancras (photo by Adi Keinan-Schoonbaert)

 

Several languages and scripts from the mainland were the focus of recent development work commissioned by the Library and done on the script conversion platform Aksharamukha. These include Shan, Khmer, Khuen, and northern Thai and Lao Dhamma (Dhamma, or Tham, meaning ‘scripture’, is the script that several languages are written in).

These and other Southeast Asian languages and scripts pose multiple challenges to us and our users. Collection items in languages using non-romanised scripts are mainly catalogued (and therefore searched by users) using romanised text. For some language groups, users need to search the catalogue by typing in the transliteration of title and/or author using the Library of Congress (LoC) romanisation rules.

Items’ metadata text converted using the LoC romanisation scheme is often unintuitive, and therefore poses a barrier for users, hindering discovery and access to our collections via the online catalogues. In addition, curatorial and acquisition staff spend a significant amount of time manually converting scripts, a slow process which is prone to errors. Other libraries worldwide holding Southeast Asian collections and using the LoC romanisation scheme face the same issues.

Excerpt from the Library of Congress romanisation scheme for Khmer
Excerpt from the Library of Congress romanisation scheme for Khmer

 

Having faced these issues with Burmese language, last year we commissioned development work to the open-access platform Aksharamukha, which enables the conversion between various scripts, supporting 121 scripts and 21 romanisation methods. Vinodh Rajan, Aksharamukha’s developer, perfectly combines language and writing systems knowledge with computer science and coding skills. He added the LoC romanisation system to the platform’s Burmese script transliteration functionality (read about this in my previous post).

The results were outstanding – readers could copy and paste transliterated text into the Library's catalogue search box to check if we have items of interest. This has also greatly enhanced cataloguing and acquisition processes by enabling the creation of acquisition records and minimal records. In addition, our Metadata team updated all of our Burmese catalogue records (ca. 20,000) to include Burmese script, alongside transliteration (side note: these updated records are still unavailable to our readers due to the cyber-attack on the Library last year, but they will become accessible in the future).

The time was ripe to expand our collaboration with Vinodh and Aksharamukha. Maria Kekki, Curator for Burmese Collections, has been hosting this past year a Chevening Fellow from Myanmar, Myo Thant Linn. Myo was tasked with cataloguing manuscripts and printed books in Shan and Khuen – but found the romanisation aspect of this work to be very challenging to do manually. In order to facilitate Myo’s work and maximise the benefit from his fellowship, we needed to have a LoC romanisation functionality available. Aksharamukha was the right place for this – this free, open source, online tool is available to our curators, cataloguers, acquisition staff, and metadata team to use.

Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)
Former Chevening Fellow Myo Thant Linn reciting from a Shan manuscript in the Asian and African Studies Reading Room, September 2024 (photo by Jana Igunma)

 

In addition to Maria and Myo’s requirements, Jana Igunma, Ginsburg Curator for Thai, Lao and Cambodian Collections, noted that adding Khmer to Aksharamukha would be immensely helpful for cataloguing our Khmer backlog and assist with new acquisitions. Northern Thai and Lao Dhamma scripts would be mostly useful to catalogue new acquisitions for print material, and add original scripts to manuscript records. The automation of LoC transliteration could be very cost-effective, by saving many cataloguing, acquisitions and metadata team’s hours. Khmer is a great example – it has the most extensive alphabet in the world (74 letters), and its romanisation is extremely complicated and time consuming!

First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11
First three leaves with text in a long format palm leaf bundle (សាស្ត្រាស្លឹករឹត/sāstrā slẏk rẏt) containing part of the Buddhist cosmology (សាស្ត្រាត្រៃភូមិ/Sāstrā Traibhūmi) in Khmer script, 18th or 19th century. Acquired by the British Museum from Edwards Goutier, Paris, on 6 December 1895. British Library, Or 5003, ff. 9-11

 

It was required, therefore, to enhance Aksharamukha’s script conversion functionality with these additional scripts. This could generally be done by referring to existing LoC conversion tables, while taking into account any permutations of diacritics or character variations. However, it definitely has not been as simple as that!

For example, the presence of diacritics instigated a discussion between internal and external colleagues on the use of precomposed vs. decomposed formats in Unicode, when romanising original script. LoC systems use two types of coding schemata, MARC 21 and MARC 8. The former allows for precomposed diacritic characters, and the latter does not – it allows for decomposed format. In order to enable both these schemata, Vinodh included both MARC 8 and MARC 21 as input and output formats in the conversion functionality.

Another component, implemented for Burmese in the previous development round, but also needed for Khmer and Shan transliterations, is word spacing. Vinodh implemented word separation in this round as well – although this would always remain something that the cataloguer would need to check and adjust. Note that this is not enabled by default – you would have to select it (under ‘input’ – see image below).

Screenshot from Aksharamukha, showcasing Khmer word segmentation option
Screenshot from Aksharamukha, showcasing Khmer word segmentation option

 

It is heartening to know that enhancing Aksharamukha has been making a difference. Internally, Myo had been a keen user of the Shan romanisation functionality (though Khuen romanisation is still work-in-progress); and Jana has been using the Khmer transliteration too. Jana found it particularly useful to use Aksharamukha’s option to upload a photo of the title page, which is then automatically OCRed and romanised. This saved precious time otherwise spent on typing Khmer!

It should be mentioned that, when it comes to cataloguing Khmer language books at the British Library, both original Khmer script and romanised metadata are being included in catalogue records. Aksharamukha helps to speed up the process of cataloguing and eliminates typing errors. However, capitalisation and in some instances word separation and final consonants need to be adjusted manually by the cataloguer. Therefore, it is necessary that the cataloguer has a good knowledge of the language.

On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha
On the left: photo of a title page of a Khmer language textbook for Grade 9, recently acquired by the British Library; on the right: conversion of original Khmer text from the title page into LoC romanisation standard using Aksharamukha

 

The conversion tool for Tham (Lanna) and Tham (Lao) works best for texts in Pali language, according to its LoC romanisation table. If Aksharamukha is used for works in northern Thai language in Tham (Lanna) script, or Lao language in Tham (Lao) script, cataloguer intervention is always required as there is no LoC romanisation standard for northern Thai and Lao languages in Tham scripts. Such publications are rare, and an interim solution that has been adopted by various libraries is to convert Tham scripts to modern Thai or Lao scripts, and then to romanise them according to the LoC romanisation standards for these languages.

Other libraries have been enjoying the benefits of the new developments to Aksharamukha. Conversations with colleagues from the Library of Congress revealed that present and past commissioned developments on Aksharamukha had a positive impact on their operations. LoC has been developing a transliteration tool called ScriptShifter. Aksharamukha’s Burmese and Khmer functionalities are already integrated into this tool, which can convert over ninety non-Latin scripts into Latin script following the LoC/ALA guidelines. The British Library funding Aksharamukha to make several Southeast Asian languages and scripts available in LoC romanisation has already been useful!

If you have feedback or encounter any bugs, please feel free to raise an issue on GitHub. And, if you’re interested in other scripts romanised using LoC schemas, Aksharamukha has a complete list of the ones that it supports. Happy conversions!

 

14 October 2024

Research and Development activities in the Qatar Programme Imaging Team

This blog post is by members of the Imaging Team at British Library/Qatar Foundation Partnership (BLQFP) Programme: Eugenio Falcioni (Imaging and Digital Product Manager), Dominique Russell, Armando Ribeiro and Alexander Nguyen (Senior Imaging Technicians), Selene Marotta (Quality Management Officer), Matthew Lee and Virginia Mazzocato (Senior Imaging Support Technicians).

The Imaging Team has played a pivotal role in the British Library/Qatar Foundation Partnership (BLQFP) Programme since its launch in 2012. However, the journey has not been without hurdles. In October 2023, the infamous cyber-attack on the British Library severely disrupted operations across the organisation, impacting the Imaging Team profoundly. Inspired by the Library's Rebuild & Renew Programme, we used this challenging period to focus on research and development, refining our processes and deepening our understanding of the studio’s work practices. 

At the time of the attack, we were in the process of recruiting new members of the team who brought fresh energy, expertise, and enthusiasm. This also coincided with the appointment of a new Studio Manager. The formation of this almost entirely new team presented challenges as we adapted to the Library's disrupted environment. Yet, our synergy and commitment led us to find innovative ways of working.  Although the absence of an IT infrastructure, and therefore imaging hardware and software, posed significant difficulties for day-to-day activities in photography and digitisation, we had the time to focus on continuous improvement, without the usual pressures of deadlines. We enhanced our digitisation processes and expertise through a combination of quality improvements, strategic collaborations, and the development of innovative tools. Through teamwork and perseverance, we transformed adversity into an opportunity for growth. 

As an Imaging Team, we aim to create the optimal  digital surrogate of the items we capture. The BLQFP defined parameters for imaging which specify criteria such as colour and resolution accuracy, ensuring compliance with International Imaging Standards (such as FADGI or ISO 19264). 

During this unusual time, we focused on research and development into imaging standards, and updated our guidelines, resulting in a 150-page document detailing our workflow. This has improved consistency between setups and photographers, and has been fundamental in training new staff. We engaged in skills sharing workshops with Imaging Services, the Library’s core imaging department, and Heritage Made Digital (HMD), the Library’s department that manages digitisation workflows. 

Over the months, we have tested our images and setup, cameras, lighting, and colour targets, all while shooting directly to camera cards and using a laser measure device to check resolution (PPI). As a result of this work, we feel more confident in producing images that conform to International Imaging Standards; capturing images that truly represent the collection items. 

A camera stand with a bound volume with a colour target ruler on top and a laser device next to it.
Colour target on a bound volume

Alongside our testing, we arranged visits to imaging studios at other institutions where we shared our knowledge and learnt from the working processes of those who are digitising comparable collection material. During these visits, we gained a better understanding of the different imaging set-ups, the various international quality standards followed, and of how images produced are analysed. We also shared our approaches to capturing and stitching oversized items such as maps and foldouts. Lastly, we discussed quality assurance and workflow management tools. Overall, these visits across the sector have been a valuable exercise in making new connections, sharing ideas, and understanding that other institutions face similar problems when digitisation collection items. 

Without the use of dedicated digitisation software, the capture of items such as manuscripts and large bound volumes has been challenging as we have been unable to check the images we were producing. For this reason, we prioritised items of the collection which were less demanding and postponed the quality assurance checks to a later date. We chose to capture 78 rpm records as they required only two shots (front and back), minimising any possible mistakes. The imaging of audio collection items was our first achievement as a team since the cyber-attack: we digitised over 1100 shellac discs, in collaboration with the BLQFP Audio Team, who had previously catalogued and digitised the sound recording. 

A record with a green label reading Columbia
Image of a shellac disc (9CS0024993_ColumbiaGA3) digitised by the BLQFP

 Through this capture task we gained the optimism and confidence to start capturing more material, starting with the bindings of all the available bound collection items. The binding capture process is time-consuming and requires a specific setup and position of the item to photograph the front, back, spine, edge, head, and tail of each volume. By capturing bindings now, we will be able to streamline the process when we resume the digitisation of entire volumes.

A camera stand with a red-bound volume supported by a frame over cardboard
Capturing the spine of a bound volume, using l-shaped card on support frame

During this time, we were also involved in scoping work to locate and assess the most challenging items and plan a digitisation strategy accordingly. We focused particularly on identifying oversized maps and foldouts, which will be captured in sections and subsequently digitally stitched. This task required frequent visits to the Library’s basement storage areas and collaboration with the BLQFP Workflow Team to optimise and migrate data from the scoping process into existing workflow management systems. By gathering this data, we could determine the physical characteristics of each collection series and select the most suitable capture device. It was also crucial to collaborate with the BLQFP Conservation Team to develop new digitisation tools for capturing oversized foldouts more quickly and securely.

A volume with an insert, folded and unfolded, over two black foam supports

A volume with an insert, folded and unfolded, over two black foam supports
Using c-shaped Plastazote created by the BLQFP Conservation Team to support an oversized fold-out

The past nine months have presented many challenges for our Team. Nevertheless, in the spirit of Rebuild & Renew, we have been able to solve problems and develop creative ways of working, pulling together all our individual skills and experiences. As we expand, we have used this time productively to understand the intricacies of digitising fragile, complex, and oversized material while working to rigorous colour and quality standards. With the imminent return of imaging software, the next step for the BLQFP Imaging Team will be to apply our knowledge and understanding to a mass digitisation environment with the expectations of targets and monthly deliverables.

Team members standing around a stand on which a volume with a large foldout is prepared for photography, with lighting on both sides of the stand
Capturing a large foldout

 

16 September 2024

memoQfest 2024: A Journey of Innovation and Connection

Attending memoQfest 2024 as a translator was an enriching and insightful experience. Held from 13 to 14 June in Budapest, Hungary, the event stood out as a hub for language professionals and translation technology enthusiasts. 

Streetview 1 of Budapest, near the venue for memoQfest 2024. Captured by the author

Streetview 2 of Budapest, near the venue for memoQfest 2024. Captured by the author
Streetviews of Budapest, near the venue for memoQfest 2024. Captured by the author

 

A Well-Structured Agenda 

The conference had a well-structured agenda with over 50 speakers, including two keynotes, who brought valuable insights into the world of translation.  

Jay Marciano, President of the Association for Machine Translation in the Americas (AMTA), delivered his highly anticipated presentation on understanding generative AI and large language models (LLMs). While he acknowledged their significant potential, Marciano expressed only cautious optimism on their future in the industry, stressing the need for a deeper understanding of the limitations. As he laid out, machines can translate faster but the quality of their output depends greatly on the quality of the training data, especially in certain domains or for specific clients. He believes that translators’ role will evolve so that they will become more involved with data curation, than with translation itself, to improve the quality of machine output. 

Dr Mike Dillinger, the former Technical Lead for Knowledge Graphs in the AI Division at LinkedIn, and now a technical advisor and consultant, also delved into the challenges and opportunities presented by AI-generated content in his keynote speech, The Next 'New Normal' for Language Services.  Dillinger holds a nuanced perspective on the intersection of AI, machine translation (MT), and knowledge graphs. As he explained, knowledge graphs can be designed to integrate, organize, and provide context for large volumes of data. They are particularly valuable because they go beyond simple data storage, embedding rich relationships and context. They can therefore make it easier for AI systems to process complex information, enhancing tasks like natural language processing, recommendation engines, and semantic search.  

Dillinger therefore advocated for the integration of knowledge graphs with AI, arguing that high-quality, context-rich data is crucial for improving the reliability and effectiveness of AI systems. Knowledge graphs can significantly enhance the capabilities of LLMs by grounding language in concrete concepts and real-world knowledge, thereby addressing some of the current limitations of AI and LLMs. He concluded that, while LLMs have made significant strides, they often lack true understanding of the text and context. 

 

Enhancing Translation Technology for BLQFP 

The event also offered hands-on demonstrations of memoQ's latest features and updates such as significant improvements to the In-country Review tool (ICR), a new filter for Markdown files, and enhanced spellcheck.  

Interior of the Pesti Vigado, Budapest's second largest concert hall, and venue for the memoQfest Gala dinner
Interior of the Pesti Vigado, Budapest's second largest concert hall, and venue for the memoQfest Gala dinner

 

 

As a participant, I was keen to explore how some of these features could be used to enhance translation processes at the British Library. For example, could machine translation (MT) be used to translate catalogue records? Over the last twelve years, the translation team of the British Library/Qatar Foundation Partnership project has built up a massive translation memory (TM) – a bilingual repository of all our previous translations. A machine could be trained on our terminology and style, using this TM and our other bilingual resources, such as our vast and growing term base (TB). With appropriate data curation, MT could be a cost-effective and efficient way to maximise our translation operations. 

There are challenges, however. For example, before it can be used to train a machine, our TM would need to be edited and cleaned, removing repetitive and inappropriate content. We would need to choose the most appropriate translations, while maintaining proper alignment between segments. The same applies to our TB, which would need to be curated. Some of these data curation tasks cannot be pursued at this time, as we remain without access to much of our data following the cyberattack incident. Moreover, these careful preparatory steps would not suffice, as any machine output would still need to be post-edited by skilled human translators. As both the conference’s keynote speakers agreed, it is not yet a simple matter of letting the machines do the work. 

 This blog post is by Musa Alkhalifa Alsulaiman, Arabic Translator, British Library/Qatar Foundation Partnership. 

28 August 2024

Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship

 British Library is delighted to host its annual Open and Engaged Conference on Monday 21 October, in-person and online, as part of the International Open Access Week. The Conference is supported by the Arts and Humanities Research Council (AHRC) and Research Libraries UK (RLUK).  

Save the Date flyer for Open & Engaged 2024 on 21 October, in person and online, and with logos for sponsors UKRI, Ars and Humanities Research Council and RLUK

 

Open and Engaged 2024: Empowering Communities to Thrive in Open Scholarship will centre leveraging the power of communities in the axis of open scholarship, open infrastructure, emerging technologies, collections as data, equity and integrity, skills development and sustainable models to elevate research of all kinds for the public good. We take a cross sectoral approach to the conference programme – unifying around shared-values for openness – by reflecting on practices within research libraries both in higher education and GLAM (Galleries, Libraries, Archives, Museums) sectors as well as the national and public libraries.  

Open and Engaged 2024 is supported by the Arts and Humanities Research Council (AHRC) and Research Libraries UK (RLUK). Everyone interested in the conference topics is welcome to join us on Monday, 21 October! 

This will be a hybrid event taking place at the British Library’s Knowledge Centre in St. Pancras, London, and streamed online for those unable to attend in-person. 

The event will be recorded and recordings made available in the British Library’s Research Repository.

Registration

Registration is closed for in-person and online attendance. Registrants have been contacted with details. Any questions, please contact [email protected].  

Programme 

09:30  Registration

10:00  Welcome remarks

10:10  Opening keynote panel: Cross disciplinary approach to open scholarship

Chaired by Sally Chambers, Head of Research Infrastructure Services at the British Library.

10:50    Empowering communities through equity, inclusivity, and ethics

Chaired by Beth Montague-Hellen, Head of Library and Information Services at the Francis Crick Institute.

This session addresses the role of the communities in governance, explores the ethical implications of AI for citizens and highlights the value of public engagement, and discusses the central importance of equity, inclusivity, and integrity in scholarly communications.

11:40  Break

12:10    Deepening partnership in skills development through shared values

Chaired by Kirsty Wallis, Head of Research Liaison at UCL.

This session explores initiatives that foster skills development in libraries with a cross sectoral approach and dives into the role of libraries to support communities in building resilience.

13:00  Lunch

13:45   Open repositories for research of all kinds

This session addresses the role of infrastructure to carry out open scholarship practices, explores the practice as research in the axis of diverse outputs and infrastructure, discusses institutional resilience in digital strategies. 

Chaired by William J Nixon, Deputy Executive Director at Research Libraries UK (RLUK).

14:45  Break

15:15   Enabling collections as data: from policy to practice  

Chaired by Jez Cope, Data Services Lead at the British Library.

This session dives into the digital collections as data by exploring policies and practices across different sectors, public-private partnerships in making collections publicly available, dynamics in preservation versus access approach in national libraries whilst underlining the public good. 

16:15   Closing keynote: Stories Change Lives

Chaired by Liz White, Director of Library Partnerships at the British Library

16:45 Closing remarks

17:00 Networking session

19:00  End

The hashtag for the event is #OpenEngaged on social media platform of your choice. If you have any questions, please contactus at [email protected].  

26 July 2024

Charting the European D-SEA Conference at the Stabi

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected]. 

 

Earlier this month, I had the pleasure of attending the “Charting the European D-SEA: Digital Scholarship in East Asian Studies” conference held at the Berlin State Library (Staatsbibliothek zu Berlin), also known as the Stabi. The conference, held on 11-12 July 2024, aimed to fill a gap in the European digital scholarship landscape by creating a research community and a space for knowledge exchange on digital scholarship issues across humanities disciplines concerned with East Asian regions and languages.

The event was a dynamic fusion of workshops, presentations and panel discussions. Over three days of workshops (8-10 July), participants were introduced to key digital methods, resources, and databases. These sessions aimed to transmit practical knowledge in digital scholarship, focusing on East Asian collections and data. The subsequent two days were dedicated to the conference proper, featuring a broad range of presentations on various themes.

The reading room in the Berlin State Library, Haus Potsdamer Straße
The reading room in the Berlin State Library, Haus Potsdamer Straße

 

DH and East Asian Studies in Europe and Beyond

Conference organisers Jing Hu and Brent Ho from the Stabi, and Shih-Pei Chen and Dagmar Schäfer from the Max Planck Institute for the History of Science (MPIWG), set the stage for an enriching exchange of ideas and knowledge. The diversity of topics covered was impressive – from the more established digital resources and research tools to AI applications in historical research – the sessions provided a comprehensive overview of the current state and future directions of the field.

There were so many excellent presentations – and I often wished I could clone myself to attend parallel sessions! As expected, there was much focus on working with AI – machine learning and generative AI – and their potential in historical and humanities research. AI technologies offer powerful tools for data analysis and pattern recognition, and can significantly enhance research capabilities.

Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
Damian Mandzunowski (Heidelberg University) talked about using AI to extract and analyse information from Chinese Comics
 
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning
Shaojian Li (Renmin University of China) looked into automating the classification of pattern images using deep learning

One notable session was "Reflections on Deep Learning & Generative AI," chaired by Brent Ho and discussed by Clemens Neudecker. The roundtable highlighted the evolving role of AI in humanities research. Calvin Yeh from MPIWG discussed AI's potential to augment, rather than just automate, research processes. He shared intriguing examples of using AI tools like ChatGPT to simulate group discussions and suggest research actions. Hongsu Wang from Harvard University presented on the use of Large Language Models and traditional Transformers in the China Biographical Database (CBDB) project, demonstrating the effectiveness of these models in data extraction and standardisation.

Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
Calvin Yeh (MPIWG) discussed AI for “Augmentation, not only Automation” and experimented with ChatGPT discussing a research approach, designing a research process and simulating a group discussion
 
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT
Hongsu Wang (Harvard University) talked about extracting and standardising data using LLMs and traditional Transformers in the CBDB project – here showcasing Jeffrey Tharsen’s research to create a network graph using a prompt in ChatGPT

 

Exploring the Stabi

Our group tour in the Stabi was a personal highlight for me. This historic library, part of the Prussian Cultural Heritage Foundation, is renowned for its extensive collections and commitment to making digitised materials publicly accessible. The library operates from two major public sites – Haus Unter Den Linden and Haus Potsdamer Straße. Tours of both locations were available, but I chose to explore the more recent building, designed by Hans Scharoun and located in the Kulturforum on Potsdamer Straße in West Berlin – the history and architecture of which is fascinating.

A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße
A group of the conference delegates enjoying the tour of SBB’s Haus Potsdamer Straße

I really enjoyed catching up with old colleagues and making new connections with fellow scholars passionate about East Asian digital humanities!

To conclude

In conclusion, the Charting European D-SEA Conference at the Stabi was an enriching experience, providing deep insights into the integration of digital methods in East Asian studies. It provided valuable insights into the advancements in digital scholarship and allowed me to connect with a global community of scholars. The combination of traditional and more recent digital practices, coupled with the forward-looking discussions on AI and deep learning, made this conference a significant milestone in the field. I look forward to seeing how these conversations evolve and contribute to the broader landscape of digital humanities.

 

16 July 2024

'AI and the Digital Humanities' session at CILIP's 2024 conference

Digital Curator Mia Ridge writes... I was invited to chair a session on 'AI and the digital humanities' at CILIP's 2024 conference with Ciaran Talbot (Associate Director AI & Ideas Adoption, University of Manchester Library) and Glen Robson (IIIF Technical Co-ordinator, International Image Interoperability Framework Consortium). Here's a quick post with some reflections on themes in the presentations and the audience Q&A.

A woman stands on stage in front of slides; two men sit at a panel table on the stage
CILIP's photo of our session

I presented a brief overview of some of the natural language processing (NLP) and computer vision methods in the Living with Machines project. That project and other work at the British Library showed that researchers can create innovative Digital Humanities methods and improve collections data with current AI / machine learning tools. But is there a gap between 'utilities' and 'cutting edge research' that AI can't (yet) fill for libraries?

AI (machine learning) makes library, museum and archive collections more accessible in two key ways. Firstly, more and better metadata and links across collections can make individual items more discoverable (e.g. identifying places mentioned in text; visual search to find similar images). Secondly, thinking of 'collections as data' and sharing datasets for research lets others find insights and inspiration.

Some of the value in AI might lie in the marketing power of the term - we've had the technical ability to view collections across silos for some time, but the institutional will might have lagged behind. Identifying the real gaps that AI can meet is hard, cross-institutional work - you need to understand what time-consuming work could be automated with ML/AI. Ciaran's talk gave a sense of the collaborative, co-creative effort required to understand actual processes and real problems and devise ways to optimise them. An 'anarchy' phase might be part of that process, and a roadmap can help set a shared vision as you work out where AI tools will actually save time or just create more but different work.

Glen gave some great examples of how IIIF can help organisations and researchers, and how AI tools might work with IIIF collections. He highlighted the intellectual property questions that 'open access' collections being mined for AI models raises, and pointed people to HaveIBeenTrained to see if their collections have been scraped.

I was struck by the delicate balance between maintaining trust and secure provenance while also supporting creative and playful uses of AI in collections. Labelling generative AI images and texts is vital. Detecting subtle errors and structural biases requires effort and expertise. As a sector, we need to keep learning, talking and collaborating to understand what generative AI means for users and collection holders.

The first question from the audience was about the environmental impact of AI. I was able to say that our work-in-progress principles for AI at the British Library ask people to consider the environmental impact of AI (not just its carbon footprint, but also water usage and rare minerals mining) in balance with other questions of public value for proposed experiments and projects. Ciaran said that Manchester have appointed a sustainability manager, which is probably something we'll see more of in future. There was a question about what employers are looking for in library and informatics students; about where to go for information and inspiration about AI in libraries (AI4LAM is a good start); and about how to update people's perceptions of libraries and the skills of library professionals.

Thanks to everyone at CILIP for all the work they put into the conference, and the fantastic AV team working in the keynote room at the Birmingham Hilton Metropole.