Digital scholarship blog

Enabling innovative research with British Library digital collections

302 posts categorized "Digital scholarship"

18 March 2024

Handwritten Text Recognition of the Dunhuang manuscripts: the challenges of machine learning on ancient Chinese texts

This blog post is by Peter Smith, DPhil Student at the Faculty of Asian and Middle Eastern Studies, University of Oxford

 

Introduction

The study of writing and literature has been transformed by the mass transcription of printed materials, aided significantly by the use of Optical Character Recognition (OCR). This has enabled textual analysis through a growing array of digital techniques, ranging from simple word searches in a text to linguistic analysis of large corpora – the possibilities are yet to be fully explored. However, printed materials are only one expression of the written word and tend to be more representative of certain types of writing. These may be shaped by efforts to standardise spelling or character variants, they may use more formal or literary styles of language, and they are often edited and polished with great care. They will never reveal the great, messy diversity of features that occur in writings produced by the human hand. What of the personal letters and documents, poems and essays scribbled on paper with no intention of distribution; the unpublished drafts of a major literary work; or manuscript editions of various classics that, before the use of print, were the sole means of preserving ancient writings and handing them onto future generations? These are also a rich resource for exploring past lives and events or expressions of literary culture.

The study of handwritten materials is not new but, until recently, the possibilities for analysing them using digital tools have been quite limited. With the advent of Handwritten Text Recognition (HTR) the picture is starting to change. HTR applications such as Transkribus and eScriptorium are capable of learning to transcribe a broad range of scripts in multiple languages. As the potential of these platforms develops, large collections of manuscripts can be automatically transcribed and consequently explored using digital tools. Institutions such as the British Library are doing much to encourage this process and improve accessibility of the transcribed works for academic research and the general interest of the public. My recent role in an HTR project at the Library represents one small step in this process and here I hope to provide a glimpse behind-the-scenes, a look at some of the challenges of developing HTR.

As a PhD student exploring classical Chinese texts, I was delighted to find a placement at the British Library working on HTR of historical Chinese manuscripts. This project proceeded under the guidance of my British Library supervisors Dr Adi Keinan-Schoonbaert and Mélodie Doumy. I was also provided with support and expertise from outside of the Library: Colin Brisson is part of a group working on Chinese Historical documents Automatic Transcription (CHAT). They have already gathered and developed preliminary models for processing handwritten Chinese with the open source HTR application eScriptorium. I worked with Colin to train the software further using materials from the British Library. These were drawn entirely from the fabulous collection of manuscripts from Dunhuang, China, which date back to the Tang dynasty (618–907 CE) and beyond. Examples of these can be seen below, along with reference numbers for each item, and the originals can be viewed on the new website of the International Dunhuang Programme. Some of these texts were written with great care in standard Chinese scripts and are very well preserved. Others are much more messy: cursive scripts, irregular layouts, character corrections, and margin notes are all common features of handwritten work. The writing materials themselves may be stained, torn, or eaten by animals, resulting in missing or illegible text. All these issues have the potential to mislead the ‘intelligence’ of a machine. To overcome such challenges the software requires data – multiple examples of the diverse elements it might encounter and instruction as to how they should be understood.

The challenges encountered in my work on HTR can be examined in three broad categories, reflecting three steps in the HTR process of eScriptorium: image binarisation, layout segmentation, and text recognition.

 

Image binarisation

The first task in processing an image is to reduce its complexity, to remove any information that is not relevant to the output required. One way of doing this is image binarisation, taking a colour image and using an algorithm to strip it of hue and brightness values so that only black and white pixels remain. This was achieved using a binarisation model developed by Colin Brisson and his partners. My role in this stage was to observe the results of the process and identify strengths and weaknesses in the current model. These break down into three different categories: capturing details, stained or discoloured paper, and colour and density of ink.

1. Capturing details

In the process of distinguishing the brushstrokes of characters from other random marks on the paper, it is perhaps inevitable that some thin or faint lines – occurring as a feature of the hand written text or through deterioration over time – might be lost during binarisation. Typically the binarisation model does very well in picking them out, as seen in figure 1:

Fig 1. Good retention of thin lines (S.3011, recto image 23)
Fig 1. Good retention of thin lines (S.3011, recto image 23)

 

While problems with faint strokes are understandable, it was surprising to find that loss of detail was also an issue in somewhat thicker lines. I wasn’t able to determine the cause of this but it occurred in more than one image. See figures 2 and 3:

Fig 2. Loss of detail in thick lines (S.3011, recto image 23)
Fig 2. Loss of detail in thick lines (S.3011, recto image 23)

 

Fig 3. Loss of detail in thick lines (S.3011, recto image 23)
Fig 3. Loss of detail in thick lines (S.3011, recto image 23)

 

2. Stained and discoloured paper

Where paper has darkened over time, the contrast between ink and background is diminished and during binarisation some writing may be entirely removed along with the dark colours of the paper. Although I encountered this occasionally, unless the background was really dark the binarisation model did well. One notable success is its ability to remove the dark colours of partially stained sections. This can be seen in figure 4, where a dark stain is removed while a good amount of detail is retained in the written characters.

Fig 4. Good retention of character detail on heavily stained paper (S.2200, recto image 6)
Fig 4. Good retention of character detail on heavily stained paper (S.2200, recto image 6)

 

3. Colour and density of ink

The majority of manuscripts are written in black ink, ideal for creating good contrast with most background colourations. In some places however, text may be written with less concentrated ink, resulting in greyer tones that are not so easy to distinguish from the paper. The binarisation model can identify these correctly but sometimes it fails to distinguish them from the other random markings and colour variations that can be found in the paper of ancient manuscripts. Of particular interest is the use of red ink, which is often indicative of later annotations in the margins or between lines, or used for the addition of punctuation. The current binarisation model will sometimes ignore red ink if it is very faint but in most cases it identifies it very well. In one impressive example, shown in figure 5, it identified the red text while removing larger red marks used to highlight other characters written in black ink, demonstrating an ability to distinguish between semantic and less significant information.

Fig 5. Effective retention of red characters and removal of large red marks (S.2200, recto image 7)
Fig 5. Effective retention of red characters and removal of large red marks (S.2200, recto image 7)

 

In summary, the examples above show that the current binarisation model is already very effective at eliminating unwanted background colours and stains while preserving most of the important character detail. Its response to red ink illustrates a capacity for nuanced analysis. It does not treat every red pixel in the same way, but determines whether to keep it or remove it according to the context. There is clearly room for further training and refinement of the model but it already produces materials that are quite suitable for the next stages of the HTR process.

 

Layout segmentation

Segmentation defines the different regions of a digitised manuscript and the type of content they contain, either text or image. Lines are drawn around blocks of text to establish a text region and for many manuscripts there is just one per image. Anything outside of the marked regions will just be ignored by the software. On occasion, additional regions might be used to distinguish writings in the margins of the manuscript. Finally, within each text region the lines of text must also be clearly marked. Having established the location of the lines, they can be assigned a particular type. In this project the options include ‘default’, ‘double line’, and ‘other’ – the purpose of these will be explored below.

All of this work can be automated in eScriptorium using a segmentation model. However, when it comes to analysing Chinese manuscripts, this model was the least developed component in the eScriptorium HTR process and much of our work focused on developing its capabilities. My task was to run binarised images through the model and then manually correct any errors. Figure 6 shows the eScriptorium interface and the initial results produced by the segmentation model. Vertical sections of text are marked with a purple line and the endings of each section are indicated with a horizontal pink line.

Fig 6. Initial results of the segmentation model section showing multiple errors. The text is the Zhuangzi Commentary by Guo Xiang (S.1603)
Fig 6. Initial results of the segmentation model section showing multiple errors. The text is the Zhuangzi Commentary by Guo Xiang (S.1603)

 

This example shows that the segmentation model is very good at positioning a line in the centre of a vertical column of text. Frequently, however, single lines of text are marked as a sequence of separate lines while other lines of text are completely ignored. The correct output, achieved through manual segmentation, is shown in figure 7. Every line is marked from beginning to end with no omissions or inappropriate breaks.

Fig 7. Results of manual segmentation showing the text region (the blue rectangle) and the single and double lines of text (S.1603)
Fig 7. Results of manual segmentation showing the text region (the blue rectangle) and the single and double lines of text (S.1603)

 

Once the lines of a text are marked, line masks can be generated automatically, defining the area of text around each line. Masks are needed to show the transcription model (discussed below) exactly where it should look when attempting to match images on the page to digital characters. The example in figure 8 shows that the results of the masking process are almost perfect, encompassing every Chinese character without overlapping other lines.

Fig 8. Line masks outline the area of text associated with each line (S.1603)
Fig 8. Line masks outline the area of text associated with each line (S.1603)

 

The main challenge with developing a good segmentation model is that manuscripts in the Dunhuang collection have so much variation in layout. Large and small characters mix together in different ways and the distribution of lines and characters can vary considerably. When selecting material for this project I picked a range of standard layouts. This provided some degree of variation but also contained enough repetition for the training to be effective. For example, the manuscript shown above in figures 6–8 combines a classical text written in large characters interspersed with double lines of commentary in smaller writing, in this case it is the Zhuangzi Commentary by Guo Xiang. The large text is assigned the ‘default’ line type while the smaller lines of commentary are marked as ‘double-line’ text. There is also an ‘other’ line type which can be applied to anything that isn’t part of the main text – margin notes are one example. Line types do not affect how characters are transcribed but they can be used to determine how different sections of text relate to each other and how they are assembled and formatted in the final output files.

Fig 9. A section from the Lotus Sūtra with a text region, lines of prose, and lines of verse clearly marked (Or8210/S.1338)
Fig 9. A section from the Lotus Sūtra with a text region, lines of prose, and lines of verse clearly marked (Or8210/S.1338)

 

Figures 8 and 9, above, represent standard layouts used in the writing of a text but manuscripts contain many elements that are more random. Of these, inter-line annotations are a good example. They are typically added by a later hand, offering comments on a particular character or line of text. Annotations might be as short as a single character (figure 10) or could be a much longer comment squeezed in between the lines of text (figure 11). In such cases these additions can be distinguished from the main text by being labelled with the ‘other’ line type.

Fig 10. Single character annotation in S.3011, recto image 14 (left) and a longer annotation in S.5556, recto image 4 (right)
Fig 10. Single character annotation in S.3011, recto image 14 (left) and a longer annotation in S.5556, recto image 4 (right)

 

Fig 11. A comment in red ink inserted between two lines of text (S.2200, recto image 5)
Fig 11. A comment in red ink inserted between two lines of text (S.2200, recto image 5)

 

Other occasional features include corrections to the text. These might be made by the original scribe or by a later hand. In such cases one character may be blotted out and a replacement added to the side, as seen in figure 12. For the reader, these should be understood as part of the text itself but for the segmentation model they appear similar or identical to annotations. For the purpose of segmentation training any irregular features like this are identified using the ‘other’ line type.

Fig 12. Character correction in S.3011, recto image 23.
Fig 12. Character correction in S.3011, recto image 23.

 

As the examples above show, segmentation presents many challenges. Even the standard features of common layouts offer a degree of variation and in some manuscripts irregularities abound. However, work done on this project has now been used for further training of the segmentation model and reports are promising. The model appears capable of learning quickly, even from relatively small data sets. As the process improves, time spent using and training the model offers increasing returns. Even if some errors remain, manual correction is always possible and segmented images can pass through to the final stage of text recognition.

 

Text recognition

Although transcription is the ultimate aim of this process it consumed less of my time on the project so I will keep this section relatively brief. Fortunately, this is another stage where the available model works very well. It had previously been trained on other print and manuscript collections so a well-established vocabulary set was in place, capable of recognising many of the characters found in historical writings. Dealing with handwritten text is inevitably a greater challenge for a transcription model but my selection of manuscripts included several carefully written texts. I felt there was a good chance of success and was very keen to give it a go, hoping I might end up with some usable transcriptions of these works. Once the transcription model had been run I inspected the first page using eScriptorium’s correction interface as illustrated in figure 13.

Fig 13. Comparison of image and transcription in eScriptorium’s correction interface.
Fig 13. Comparison of image and transcription in eScriptorium’s correction interface.

 

The interface presents a single line from the scanned image alongside the digitally transcribed text, allowing me to check each character and amend any errors. I quickly scanned the first few lines hoping I would find something other than random symbols – I was not disappointed! The results weren’t perfect of course but one or two lines actually came through with no errors at all and generally the character error rate seems very low. After careful correction of the errors that remained and some additional work on the reading order of the lines, I was able to export one complete manuscript transcription bringing the whole process to a satisfying conclusion.

 

Final thoughts

Naturally there is still some work to be done. All the models would benefit from further refinement and the segmentation model in particular will require training on a broader range of layouts before it can handle the great diversity of the Dunhuang collection. Hopefully future projects will allow more of these manuscripts to be used in the training of eScriptorium so that a robust HTR process can be established. I look forward to further developments and, for now, am very grateful for the chance I’ve had to work alongside my fabulous colleagues at the British Library and play some small role in this work.

 

15 March 2024

Call for proposals open for DigiCAM25: Born-Digital Collections, Archives and Memory conference

Digital research in the arts and humanities has traditionally tended to focus on digitised physical objects and archives. However, born-digital cultural materials that originate and circulate across a range of digital formats and platforms are rapidly expanding and increasing in complexity, which raises opportunities and issues for research and archiving communities. Collecting, preserving, accessing and sharing born-digital objects and data presents a range of technical, legal and ethical challenges that, if unaddressed, threaten the archival and research futures of these vital cultural materials and records of the 21st century. Moreover, the environments, contexts and formats through which born-digital records are mediated necessitate reconceptualising the materials and practices we associate with cultural heritage and memory. Research and practitioner communities working with born-digital materials are growing and their interests are varied, from digital cultures and intangible cultural heritage to web archives, electronic literature and social media.

To explore and discuss issues relating to born-digital cultural heritage, the Digital Humanities Research Hub at the School of Advanced Study, University of London, in collaboration with British Library curators, colleagues from Aarhus University and the Endangered Material Knowledge Programme at the British Museum, are currently inviting submissions for the inaugural Born-Digital Collections, Archives and Memory conference, which will be hosted at the University of London and online from 2-4 April 2025. The full call for proposals and submission portal is available at https://easychair.org/cfp/borndigital2025.

Text on image says Born-Digital Collections, Archives and Memory, 2 - 4 April 2025, School of Advanced Study, University of London

This international conference seeks to further an interdisciplinary and cross-sectoral discussion on how the born-digital transforms what and how we research in the humanities. We welcome contributions from researchers and practitioners involved in any way in accessing or developing born-digital collections and archives, and interested in exploring the novel and transformative effects of born-digital cultural heritage. Areas of particular (but not exclusive) interest include:

  1. A broad range of born-digital objects and formats:
    • Web-based and networked heritage, including but not limited to websites, emails, social media platforms/content and other forms of personal communication
    • Software-based heritage, such as video games, mobile applications, computer-based artworks and installations, including approaches to archiving, preserving and understanding their source code
    • Born-digital narrative and artistic forms, such as electronic literature and born-digital art collections
    • Emerging formats and multimodal born-digital cultural heritage
    • Community-led and personal born-digital archives
    • Physical, intangible and digitised cultural heritage that has been remediated in a transformative way in born-digital formats and platforms
  2. Theoretical, methodological and creative approaches to engaging with born-digital collections and archives:
    • Approaches to researching the born-digital mediation of cultural memory
    • Histories and historiographies of born-digital technologies
    • Creative research uses and creative technologist approaches to born-digital materials
    • Experimental research approaches to engaging with born-digital objects, data and collections
    • Methodological reflections on using digital, quantitative and/or qualitative methods with born-digital objects, data and collections
    • Novel approaches to conceptualising born-digital and/or hybrid cultural heritage and archives
  3. Critical approaches to born-digital archiving, curation and preservation:
    • Critical archival studies and librarianship approaches to born-digital collections
    • Preserving and understanding obsolete media formats, including but not limited to CD-ROMs, floppy disks and other forms of optical and magnetic media
    • Preservation challenges associated with the platformisation of digital cultural production
    • Semantic technology, ontologies, metadata standards, markup languages and born-digital curation
    • Ethical approaches to collecting and accessing ‘difficult’ born-digital heritage, such as traumatic or offensive online materials
    • Risks and opportunities of generative AI in the context of born-digital archiving
  4. Access, training and frameworks for born-digital archiving and collecting:
    • Institutional, national and transnational approaches to born-digital archiving and collecting
    • Legal, trustworthy, ethical and environmentally sustainable frameworks for born-digital archiving and collecting, including attention to cybersecurity and safety concerns
    • Access, skills and training for born-digital research and archives
    • Inequalities of access to born-digital collecting and archiving infrastructures, including linguistic, geographic, economic, legal, cultural, technological and institutional barriers

Options for Submissions

A number of different submission types are welcomed and there will be an option for some presentations to be delivered online.

  • Conference papers (150-300 words)
    • Presentations lasting 20 minutes. Papers will be grouped with others on similar subjects or themes to form a complete session. There will be time for questions at the end of each session.
  • Panel sessions (100 word summary plus 150-200 words per paper)
    • Proposals should consist of three or four 20-minute papers. There will be time for questions at the end of each session.
  • Roundtables (200-300 word summary and 75-100 word bio for each speaker)
    • Proposals should include between three to five speakers, inclusive of a moderator, and each session will be no more than 90 minutes.
  • Posters, demos & showcases (100-200 words)
    • These can be traditional printed posters, digital-only posters, digital tool showcases, or software demonstrations. Please indicate the form your presentation will take in your submission.
    • If you propose a technical demonstration of some kind, please include details of technical equipment to be used and the nature of assistance (if any) required. Organisers will be able to provide a limited number of external monitors for digital posters and demonstrations, but participants will be expected to provide any specialist equipment required for their demonstration. Where appropriate, posters and demos may be made available online for virtual attendees to access.
  • Lightning talks (100-200 words)
    • Talks will be no more than 5 minutes and can be used to jump-start a conversation, pitch a new project, find potential collaborations, or try out a new idea. Reports on completed projects would be more appropriately given as 20-minute papers.
  • Workshops (150-300 words)
    • Please include details about the format, length, proposed topic, and intended audience.

Proposals will be reviewed by members of the programme committee. The peer review process will be double-blind, so no names or affiliations should appear on the submissions. The one exception is proposals for roundtable sessions, which should include the names of proposed participants. All authors and reviewers are required to adhere to the conference Code of Conduct.

The submission deadline for proposals is 15 May 2024, and notification of acceptance is scheduled for late July 2024. Organisers plan to make a number of bursaries available to presenters to cover the cost of attendance and details about these will be shared when notifications are sent. 

Key Information:

  • Dates: 2 - 4 April 2025
  • Venue: University of London, London, UK & online
  • Call for papers deadline: 15 May 2024
  • Notification of acceptance: late July 2024
  • Submission link: https://easychair.org/cfp/borndigital2025

Further details can be found on the conference website and the call for proposals submission portal at https://easychair.org/cfp/borndigital2025. If you have any questions about the conference, please contact the organising committee at [email protected].

28 February 2024

Safeguarding Tomorrow: The Impact of AI on Media and Information Industries

The British Library has joined forces with the Guardian to hold a summit on the complex policy impacts of AI on media and information industries. The summit, chaired by broadcaster and author Timandra Harkness, brings together politicians, policy makers, industry leaders, artists and academics to shed light on key issues facing the media, newspapers, broadcasting, library and publishing industries in the age of AI. The summit is on Monday 11 March 2024 14:00 - 17:20; networking reception 17:30 - 19:00 GMT; the ticket link is below.

Lucy Crompton-Reid, Chief Executive of Wikimedia UK; Sara Lloyd, Group Communications Director & Global AI Lead at Pan Macmillan and Matt Rogerson from the Guardian will tackle the issue of copyright in the age of algorithms.

Novelist Tahmima Anam; Greg Clark MP, Chair Science & Technology Committee; Chris Moran from the Guardian and Roly Keating, Chief Executive of the British Library will discuss the issue of AI generated misinformation and bias.

 

A conference panel on stage with an audience in raked seating
Data Debate event at the British Library chaired by Timandra Harkness

AI is rapidly changing the world as we know it, and the media and information industries are no exception. AI-powered technologies are already being used to automate tasks, create personalised content, and deliver targeted advertising. In the process AI is quickly becoming both a friend and a foe. People can use AI to flood the online environment with misinformation, creating significant worries, for example, around how deep fakes, and AI personalised and targeted content could influence democratic processes. At the same time, AI could become a key tool to combat misinformation by identifying fake news articles and social media posts.

Many creators of content - from the organisations creating and publishing content, to individual authors, artists and actors - are worried that their copyright has been infringed by AI and we have already seen a flurry of legal action, mostly in the United States. At the same time, many artists are embracing AI as a part of their creative process. The recent British Library exhibition on Digital Storytelling explored the ways technology provides new opportunities to transform and enhance the way writers write and readers engage, including interactive works that invite and respond to user input, and reading experiences influenced by data feeds.

And it is not only in the world of news that there is a danger of AI misinformation. In science, where AI is revolutionising many areas of research from helping us discover new drugs to aiding research on complexities of climate change, we are, at the same time, encountering the issue of fake, AI generated scientific articles. For libraries, AI holds the future promise of improving discovery and access to information, which would help library users to find relevant information quickly. Yet, AI is also introducing significant new challenges when it comes to understanding the provenance of information sources, especially in making the public aware if the information has been created or selected by algorithms rather than human beings.

How will we know - and will we care - if our future newspapers, television programmes and library enquiries are mediated and delivered by AI? Or if the content we are consuming is a machine rather than a human creation? We are used to making judgements about people and organisations that we trust on the basis of how we perceive their professional integrity, political leanings, their stance on the issues that we care about, or just likability and charisma of the individual in front of us. How will we make similar judgments about an algorithm and its inherent bias? And how will we govern and manage this new AI-powered environment?

Governmental regulation of AI is under development in the UK, the US, the EU and elsewhere. At the beginning of February 2024 the UK government released its response to the UK AI Regulation White Paper, signaling the continuation of ‘agile’ AI regulation in the UK, which attempts to balance innovation and economic benefits of AI while also giving greater responsibility related to AI to existing regulators. The government’s response also reserves an option for more binding regulation in the future. For some, such as tech companies investing in AI products, this creates uncertainty for their future business models. For others, especially many in the creative industries and artists affected by AI, there is a disappointment due to the absence of regulations in relation to AI being trained by using content under copyright.

Inevitably, as AI further develops and becomes more prevalent, the issues of its regulation and adoption in the society will continue to evolve. AI will continue to challenge the ways in which we understand creators’ rights, individual and corporate governance and management of information, and the ways in which we acquire knowledge, trust different information sources, and form our opinions on what to buy to who to vote for.

Join us to discuss the challenges and opportunities ahead. You can book your place on Eventbrite: https://www.eventbrite.co.uk/e/safeguarding-tomorrow-the-impact-of-ai-in-media-information-industries-tickets-814482728767?aff=oddtdtcreator.

18 October 2023

Join the British Library as a Digital Curator, OCR/HTR

In this post, Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, shares some background information on how the new post advertised for a Digital Curator for OCR/HTR will help the Library streamline post-digitisation work to make its collections even more accessible to users.

 

We’ve been digitising our collections for almost three decades, opening up access to incredibly diverse and rich collections, for our users to study and enjoy. However, it is important that we further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections!

We’ve done some work over the years towards making our collection items available in machine-readable format, in order to enable full-text search and analysis. Optical Character Recognition (OCR) technology has been around for a while, and there are several large-scale projects that produced OCRed text alongside digitised images – such as the Microsoft Books project. Until recently, Western languages print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, the Living with Machines project, applied OCR technology to UK newspapers, designing and implementing new methods in data science and artificial intelligence, and analysing these materials at scale.

OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project
OCR of Bengali books using Transkribus, Two Centuries of Indian Print Project

 

Machine Learning technologies have been dealing increasingly well with both modern and historical collections, whether printed, typewritten or handwritten. Taking a broader perspective on Library collections, we have been exploring opportunities with non-Western collections too. Library staff have been engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for EnglishBangla and Arabic. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to run four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material.

We have been working with Transkribus as well – for example, Alex Hailey, Curator for Modern Archives and Manuscripts, used the software to automatically transcribe 19th century botanical records from the India Office Records. A digital humanities work strand led by former colleague Tom Derrick saw the OCR of most of our digitised collection of Bengali printed texts, digitised as part of the Two Centuries of Indian Print project. More recently Transkribus has been used to extract text from catalogue cards in a project called Convert-a-Card, as well as from Incunabula print catalogues.

An example of a catalogue card in Transkribus, showing segmentation and transcription
An example of a catalogue card in Transkribus, showing segmentation and transcription

 

The British Library is now looking for someone to join us to further improve the access and usability of our digital collections, by integrating a standardised OCR and HTR production process into our existing workflows, in line with industry best practice.

For more information and to apply please visit the British Library recruitment site and look for the Digital Curator for OCR/HTR position. Applications close on Sunday 5 November 2023. Please pay close attention to questions asked in the application process. Any questions? Drop us a line at [email protected].

Good luck!

 

09 October 2023

Strike a Pose Steampunk style! For our Late event with Clockwork Watch on Friday 13th October

This Friday (13th October) the British Library invites you to join the world of Clockwork Watch by Yomi Ayeni, a participatory storytelling project, set in a fantastical retro-futurist vision of Victorian England, with floating cities and sky pirates, which is one of the showcased narratives in our Digital Storytelling exhibition.

Flyer with text saying Late at the Library, Digital Steampunk at the British Library, London. Friday 13 October, 19:30 – 22:30

We are delighted that Dark Box Images will be bringing their portable darkroom to the Late at the Library: Digital Steampunk event and taking portrait photographs. If this appeals to you, then please arrive early to have your picture taken. Photographer Gregg McNeill is an expert in the wet plate collodion process invented by Frederick Scott Archer in 1851. Gregg’s skill in using an authentic Victorian camera creates genuinely remarkable results that appear right in front of your eyes.

Black and white photograph of a woman wearing an elaborate outfit and a mask with her arms outstretched wide with fabric like wings
Wet plate collodion photograph of Jennifer Garside of Wyte Phantom corsetry, taken by Gregg McNeill of Dark Box Images

If you want to pose for the camera at our steampunk Late, or have a portrait drawn by artist Doctor Geof, please don’t be shy, this is an event where guests are encouraged to dress to impress! The aesthetic of steampunk fashion is inspired by Victoriana and 19th Century literature, including Jules Verne’s novels and the Sherlock Holmes stories by Sir Arthur Conan Doyle. Steampunk looks can include hats and googles, tweed tailoring, waistcoats, corsets, fob watches and fans. Whatever your personal style, we encourage you to unleash your creativity when putting together an outfit for this event.

Furthermore, whether you are seeking a new look or some finishing touches, there will be an opportunity to browse a Night Market at this Late event, where you can purchase and admire a range of exquisite hand crafted items created by:

  • Jema Hewitt, a professional costumer and academic, will be bringing some of her unique, handmade jewellery and accessories to the Library Late event. She was one of the originators of the early artistic steampunk scene in the UK, subsequently exhibiting her costume work internationally, and having three how-to-make books published as her alter ego “Emilly Ladybird”. Jema currently specialises as a pattern cutter for film, theatre and TV, as well as lecturing and teaching workshops.
Photograph of jewellery, hats and clothing
Jewellery, hats and clothing created by Jema Hewitt/Emilly Ladybird
  • Doctor Geof, an artist, scientist, comics creator and maker of whimsical objects. His work is often satirical, usually with an historical twist, and features tea, goblins, krakens, steampunk, smut, nuns, bees, cats and more tea. Since 2004 you may have encountered him selling his comics, prints, cards, mugs, pins, and for some reason a lot of embroidered badges (including an Evil Librarian patch!) at various events. As one of the foremost Steampunk artists in the UK, Doctor Geof has worked with and exhibited at the Cutty Sark, Royal Museums Greenwich, and Discovery Museum Newcastle. He is a talented portrait artist, so please seek him out if you would like him to capture your likeness in ink and watercolour.
A round embroidered patch with a cartoon figure wearing goggles and carrying books. Text says "Evil Librarian"
Evil Librarian embroidered patch by Dr Geof

  • Jennifer Garside, a seamstress specialising in modern corsetry, which takes inspiration from historical styles. Her business, Wyte Phantom, opened in 2010, and she has made costumes for opera singers, performers and artists across the world.

  • Tracy Wells, a couture milliner based in the Lake District. She creates all kinds of hats and headpieces, often collaborating with other artists to explore new styles, concepts and genres.
Photograph of a woman wearing a steampunk hat with feathers
Millinery by Tracy Wells
  • Herr Döktor, a renowned inventor, gadgeteer, and contraptionist, who has been working in his Laboratory in the Surrey Hills for the last two decades, building a better future via the prism of history. He will be bringing a small selection of his inventions and scale models of his larger ideas. (His alter ego, Ian Crichton, is a professional model maker with thirty years experience as a toy prototype maker, museum and exhibition designer, and, most recently, building props and models for the film industry, he also lives in the Surrey Hills). 
Photograph of a man wearing a top hat and carrying a model submarine
Herr Döktor, inventor, gadgeteer, and contraptionist. Photograph by Adam Stait
  • Linette Withers established Anachronalia in 2012 to be a full-time bookbinder, producing historically-inspired books, miniature books, and quirky stationery. Her work has been shortlisted for display at the Bodleian Library at the University of Oxford as part of their ‘Redesigning the Medieval Book’ competition and exhibition in 2018 and one of her books is held in the permanent collection of The Lit & Phil in Newcastle after being part of an exhibition of bookbinding in 2021. She also teaches bookbinding in her studio in Leeds.

  • Heather Hayden of Diamante Queen Designs creates handmade vintage inspired, kitsch, macabre, noir accessories for everybody to wear and enjoy. Heather studied fashion and surface pattern design in the 80's near Leeds during the emergence of Gothic culture and has remained interested in the darker side of life ever since. She became fascinated with Steampunk after seeing Datamancer's Steampunk computer, loving the juxtaposition of new and old technology. This inspired her to make steampunk clothing and accessories using old and found items and upcycling as much as possible.
Photograph of a mannequin head wearing a headpiece with tassels, feathers, flowers and beads
Headpiece by Diamante Queen Designs
  • Matthew Chapman of Raphael's Workshop specialises in creating strange and sublime chainmail items, bringing ideas to life in metal that few would ever consider. From collars to corsets, serpents to squids, arms to armour and medals to masterpieces, you should visit his stall and see what creations spark the imagination.
Photograph of a table displaying a range of wearable items of chainmail jewellery and accessories
Chainmail jewellery and accessories created by Raphael's Workshop

We hope that this post has whetted your appetite for the delights available at the Late at the Library: Digital Steampunk event on Friday 13th October at the British Library. Tickets can be booked here.

02 October 2023

Last chance to see the Digital Storytelling exhibition

All good things must come to an end, no I’m not talking about the collapse of a favourite high street chain store beginning with W, but the final few weeks of our Digital Storytelling exhibition, which closes on the 15th October 2023. If you haven’t seen it yet, then this is your last chance to book!

Digital Storytelling showcases eleven different born digital works, including interactive narratives that respond to user input, reading experiences personalised by data feeds, and immersive multimedia story worlds developed through audience participation. From thought provoking autobiographical hypertexts to data journalism, uncanny ghost stories to weather poetry, steampunk literary adaptation to quirky Elizabethan medical comedy. 

Digital Storytelling exhibition image with art from Astrologaster, Seed, 80 Days, and Zombies, Run!

If you want to hear more about this exhibition, Digital Curator Stella Wisdom will be giving two talks later this week. The first of these will be in-person on Thursday evening, 5th October, in Richmond Lending Library for the Richmond Reads season of events, celebrating the joys and benefits of reading. The second will be held online on Friday morning, 6th October, for the DARIAH-EU autumn 2023 Friday Frontiers series.

We are also delighted to share that there is a chapter about interactive digital books written by Giulia Carla Rossi, Curator for Digital Publications, in The Book by Design, which was recently launched by our colleagues in British Library Publishing. Giulia’s chapter discusses innovative Editions at Play publications, including Seed by Joanna Walsh and Breathe by Kate Pullinger, which are both currently displayed in Digital Storytelling.

Before the Digital Storytelling exhibition closes, we'd love you to join us for a party on the evening of Friday 13th October. For one night only, transmedia storyteller Yomi Ayeni will transform the British Library into the Clockwork Watch story world for an immersive steampunk late event.

Genre-bending DJ Sacha Dieu will be spinning the best in Balkan Gypsy, Electro Swing, and Global Beats. Professor Elemental will perform live for us, and we really hope he’ll sing I Love Libraries! You'll also be able to view the Digital Storytelling exhibition, and there will be quieter areas to explore 19th Century London in Minecraft, play board games including Great Scott! The Game of Mad Invention with games librarian Marion Tessier, and to discover poetry with the Itinerant Poetry Librarian.

If you plan to party with us, book your ticket here.

27 September 2023

Late at the Library: Digital Steampunk

Summer may be over, but there is much to look forward to this autumn, including our Late at the Library: Digital Steampunk event on Friday 13th October 2023, where we invite you to immerse yourself in the Clockwork Watch story world, party with chap hop maestro Professor Elemental and explore 19th-century London in Minecraft. If these kind of shenanigans sound right up your street, then book tickets here and join us!

Clockwork Watch by Yomi Ayeni is currently showcased in the British Library’s Digital Storytelling exhibition, which is open until 15 October 2023. Set in a retro-futurist steampunk Victorian England, Clockwork Watch is a participatory story that includes multiple voices and perspectives on themes relating to empire, colonialism, exploitation and resistance, which is told across a range of formats, including a series of graphic novels (there is an overview of these titles here), immersive theatre, role play, and an online newspaper the London Gazette.

Drawing of a a range of people in steampunk clothing,in front of a London skyline
Steampunk Illustration by Brett Walsh

For the evening of Friday 13th October, the British Library will transform into the story world of the next part of the Clockwork Watch narrative. Featuring an auction of the last few remaining properties on Peak B, and the opening of bids for Peak C, new housing developments situated on floating islands hovering over the British Channel. Leggett and Scarper, the estate agents managing these properties, will also be inviting inventors or anyone with a solution to problems plaguing these floating islands, to submit their plans for a chance to win a Golden Ticket to one of the new homes on Peak C.

Illustration of Peak B property development on a floating island
© Clockwork Watch / Graham Leggett 2023

Attendees will be able to explore the streets of Sherlock Holmes’ London in Minecraft created by Blockworks and Lancaster University, visit the Night Market, have a photograph taken with authentic Victorian Dark Box photography, or a portrait drawn by artist Dr Geof, and that’s before the auction begins. But be warned, buying your way into this real estate dreamworld is not straightforward – this night is a golden opportunity for the Clockwork Watch underbelly of pickpockets, rogues and vagabonds.

Dressing up and joining in is heartily encouraged. To prepare for this event, we suggest reading the Clockwork Watch graphic novels, you can order these online, or purchase the first two ominbus editions from the British Library’s onsite shop. Also check out the London Gazette website and this special British Library edition of the newspaper. We hope to see you there!

Cover page of the London Gazette British Library edition
© Clockwork Watch

26 September 2023

Let’s learn together - Join us in the Cultural Heritage Open Scholarship Network

Are you working in Galleries-Libraries-Archives-Museums (GLAM) and cultural heritage organisations as research support and research-active staff? Are you interested in developing knowledge and skills in open scholarship? Would you like to establish good practices, share your experience with others and collaborate? If your answer is yes to one or more of these questions, we invite you to join the Cultural Heritage Open Scholarship Network (CHOSN).

Initiated by the British Library’s Research Infrastructure Services built on the experience of and positive responses received from the open scholarship training programme, which was run earlier this year. CHOSN is a community of practice for research support and research-active staff who work in GLAMs, organisations interested in developing and sharing open scholarship knowledge and skills, organising events, and supporting each other in this area. 

GLAMs demonstrate a significant amount of research showcases, but we may find ourselves with inadequate resources to make that research openly available, gain relevant open scholarship skills to make it happen, or even identify what forms research in these environments. CHOSN aims to provide a platform to create synergy for those aiming for good practice in open scholarship.

CHOSN flyer image, text says: Cultural Heritage Open Scholarship Network (CHOSN). Are you working in Galleries-Libraries-Archives-Museums (GLAMs)? Join Us! To develop knowledge and skills in open scholarship, organise activities to learn and grow, and create a community of practise to collaborate and support each other.

This network can be of interest to anyone who is facilitating, enabling, supporting research activities in GLAM organisations. They include but are not limited to research support staff, research-active staff, librarians, curatorial teams, IT specialists, copyright officers and so on. Anyone interested in the areas of open scholarship and works in cultural heritage organisations are welcome.

Join us in the Cultural Heritage Open Scholarship Network (CHOSN) to;

  • explore research activities, roles in GLAMs and make them visible,
  • develop knowledge and skills in open scholarship,
  • carry out capacity development activities to learn and grow, and
  • create a community of practice to collaborate and support each other.

We have set up a JISC mailing list to start communication with the network, you can join by signing up here. We will shortly organise an online meeting to kick off the network plans, explore how to move forward and to collectively discuss what we would like to do next. This will all be communicated via the CHOSN mailing list.

If you have any questions about CHOSN, we are happy to hear from you at [email protected].

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs