Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

02 October 2019

The 2019 British Library Labs Staff Award - Nominations Open!

Looking for entries now!

A set of 4 light bulbs presented next to each other, the third light bulb is switched on. The image is supposed to a metaphor to represent an 'idea'
Nominate a British Library staff member or a team that has done something exciting, innovative and cool with the British Library’s digital collections or data.

The 2019 British Library Labs Staff Award, now in its fourth year, gives recognition to current British Library staff who have created something brilliant using the Library’s digital collections or data.

Perhaps you know of a project that developed new forms of knowledge, or an activity that delivered commercial value to the library. Did the person or team create an artistic work that inspired, stimulated, amazed and provoked? Do you know of a project developed by the Library where quality learning experiences were generated using the Library’s digital content? 

You may nominate a current member of British Library staff, a team, or yourself (if you are a member of staff), for the Staff Award using this form.

The deadline for submission is 12:00 (BST), Tuesday 5 November 2019.

Nominees will be highlighted on Monday 11 November 2019 at the British Library Labs Annual Symposium where some (winners and runners-up) will also be asked to talk about their projects.

You can see the projects submitted by members of staff for the last two years' awards in our online archive, as well as blogs for last year's winners and runners-up.

The Staff Award complements the British Library Labs Awards, introduced in 2015, which recognise outstanding work that has been done in the broader community. Last year's winner focused on the brilliant work of the 'Polonsky Foundation England and France Project: Digitising and Presenting Manuscripts from the British Library and the Bibliothèque nationale de France, 700–1200'.

The runner up for the BL Labs Staff Award last year was the 'Digital Documents Harvesting and Processing Tool (DDHAPT)' which was designed to overcome the problem of finding individual known documents in the United Kingdom's Legal Deposit Web Archive.

In the public competition, last year's winners drew attention to artistic, research, teaching & learning, and commercial activities that used our digital collections.

British Library Labs is a project within the Digital Scholarship department at the British Library that supports and inspires the use of the Library's digital collections and data in exciting and innovative ways. It was previously funded by the Andrew W. Mellon Foundation and is now solely funded by the British Library.

If you have any questions, please contact us at labs@bl.uk.

 

20 September 2019

Labbers of the world unite to write a book in 1 week through a Book Sprint

Posted by Mahendra Mahey Manager of BL Labs.

I can't believe it's been a year since people from national, state, regional, university libraries (as well as a few galleries, archives and museums) met in London to attend the first global 'Library Labs' event at the British Library on 13th and 14th of September 2018. These 'Labs' are increasingly found in cultural heritage and academic institutions around the world and offer a space for their users to experiment and innovate on-site and on-line with their own (and others') digitised and born digital collections and data.

We had over 70 people from 43 institutions and 20 countries attend the London event and it was really wonderful, with a very full programme. There was a palpable sense of excitement and willingness to want to share experiences, build new professional relationships and witness the birth of a new international 'Labs' community. Through the event, we were able to understand more about the digital 'Labs' landscape around the world from the results of Library Labs survey. For example, we learned that many institutions were in the process of planning a 'Lab', many wanted to learn more about how to set them up, maintain and sustain them and learn the lessons from those that had already done it. About half of the attendees in London had already set up Labs in their organisations and wanted to share their experiences with other professionals so that they could build better Labs and help others so they didn't have to reinvent the wheel to save time and precious resources.

Growing an international Cultural Heritage Labs community
Some of the presenters from the first Building Library Labs Event at the
British Library, London, UK on 13-14 September 2019

The event was a mixture of presentations and lightning talks, stories of how labs are developing, parallel discussion groups and debates, many of which were videoed. At the end of the event, the collaborative document we had created contained over 60 edited pages of notes, together with a folder of other useful documents and presentations. It was concluded that it would be wonderful to come together to perhaps convert these shared experiences into a useful book/guide, perhaps through a Book Sprint. A Book Sprint is where up to 15 people come together for a week, and with minimal distractions work together to create a book. Each day when the participants sleep, a team of illustrators and editors transform their content for the next day remotely. The week ends having created a book! A great idea for busy people! We felt it was a nice fit for the Labs community we work in or want to create, which are largely based on a 'mindset' of experimentation, taking risks and being prepared to learn from your mistakes. I started to research how it might be possible to hold such a Book Sprint by talking to the Book Sprint company that has had over 20 years experience organising and running these book creation events.

Collectively as a group we decided that we would continue to build the Labs community and establish a mailing list. Clemens Neudecker wrote an excellent blog post about the event.

Zoom meeting Building Library LabsA screen grab from a virtual zoom meeting of the building Labs community

Subsequently, we held various meetings from October 2018 through to February 2019 (some virtual and some face to face) and agreed to hold our next global Labs meeting at the Royal Danish Library in Copenhagen, Denmark on 4-5 March 2019, again with an action packed programme with the help of Katrine Gasser and her team at kbtechlab. Directly after that event, some of us participated in a pre-conference workshop as part of Digital Humanities Nordic 2019, DHN-Labs - Digital Humanities and the National and University Libraries and Archives (in the Nordic and Baltic Countries) on the 6 March 2019.

Royal Danish Library, Copenhagen, Denmark

Royal Danish Library, Copenhagen, Denmark where the second
Building Library Labs event was held between 4-5 March, 2019

Over 50 people attended the 2-day event in Copenhagen, although similar to the previous event in London, this time we agreed we would hold it under Chatham House rule (an idea from Kirsty Lingstadt from the University of Edinburgh) which many of us found was very liberating.

Again, we managed to produce over 60 pages of notes and collect other relevant and helpful information. It was even more abundantly clear at the end of this event that we would definitely need to find a way for some of us to come together to write a book through the Book Sprint methodology previously proposed.

A very kind and generous offer of exploring funding from her institution was made by Milena Dobreva-McPherson Associate Professor Library and Information Studies at University College London Qatar. Abigail Potter from the Library of Congress Labs also kindly suggested that she and her team may be able to hold the next global Labs meeting in Washington between 4-6 May, 2020 in the USA.

Myself and Milena met in Qatar at the first Musuem's and Big Data conference in Qatar organised by her colleague Georgios Papaioannou Associate Professor of Museum Studies, in May 2019. We formulated a proposal to UCL Qatar (funded by the Qatar Foundation) which was successful. Milena also managed to also obtain funding from the University of Qatar. There has also been support from the British Library Labs, the Library of Congress Labs, Book Sprint Ltd, who agreed to donate half of the Book Sprint fee to run the event and finally Qatar National Library.

What was important from the outset was that the digital version of the book should be made FREELY available on the web to reuse, in line with the spirit and ethos of the group.

Milena also managed to secure funding for research assistants Somia Salim and Fidelity Phiri to help create a global directory of organisations which are doing Labs style things or might want to. They have also helped out and are helping at various Labs style events including the Book Sprint.

From the first building library labs event in September 2018 to the present day there have been various events where the work of this community has been mentioned. Here is a small sample:

In July 2019, we released an open invitation to apply to be part of the Book Sprint and received some fantastic entries. We would like to thank everyone that sent an application and we would like to reassure everyone that they can still contribute to the community even if they were not chosen on this occasion.

We can now finally announce who will be attending the Book Sprint...drum droll...:

  1. Abigail Potter, Senior Innovation Specialist with the Library of Congress Digital Innovation Lab. She tweets at @opba.
  2. Aisha Al Abdulla, Section Head of the Digital Repository and Archives at Qatar University Library.
  3. Caleb Derven, Head of Technical and Digital Services at the University of Limerick with overall responsibility for strategy and operations related to collections, electronic resources and library systems. He tweets at @calebderven.
  4. Ditte Laursen, Head of Department, The Royal Library Denmark responsible for the acquisition of digitally born cultural heritage materials, long-term preservation of digital heritage collections, and access to digital cultural heritage collections. She tweets at @DitteDla.
  5. Gustavo Candela, Associate Professor at the University of Alicante and member of the Research and Development department at The Biblioteca Virtual Miguel de Cervantes. He tweets at @gus_candela.
  6. Katrine Gasser, Section Head of IT at The Royal Library Denmark managing a team of 40 IT experts in programming, networking and research. She tweets at @blackat_ and kbtechlab
  7. Kristy Kokegei, Director of Public Engagement at the History Trust of South Australia who oversees the organisation’s public programming, digital engagement, marketing, learning and education programs across 4 State Government funded museums and supporting and enabling 350 community museums and historical societies across South Australia. She tweets at @KristyKokegei and @SAGLAMLab.
  8. Lotte Wilms, Digital Scholarship advisor managing the KB Research Lab and Digital Humanities in libraries advocate, co-chair for the LIBER working group Digital Humanities and a board member of the IMPACT Centre of Competence. She tweets at @Lottewilms.
  9. Mahendra Mahey, Manager of British Library Labs (BL Labs), an Andrew W. Mellon foundation and British Library funded initiative supporting and inspiring the use of its data in innovative ways with scholars, artists, entrepreneurs, educators and innovators through competitions, awards and other engagement activities. He tweets at @BL_Labs and @mahendra_mahey.
  10. Milena Dobreva-McPherson, Associate Professor Library and Information Studies at UCL Qatar with international experience of working in Bulgaria, Scotland and Malta. She tweets at @Milena_Dobreva.
  11. Paula Bray, DX Lab Leader at the State Library of NSW and responsible for developing and promoting an innovation lab utilising emerging and existing web technologies to deliver new ways to explore the Library’s collections and its data. She tweets at @paulabray #dxlab @statelibrarynsw
  12. Sally Chambers, Digital Humanities Research Coordinator at Ghent Centre for Digital Humanities, Ghent University, Belgium and National Coordinator for DARIAH, the Digital Research Infrastructure for the Arts and Humanities in Belgium. She tweets at @schambers3, @GhentCDH and @KBRbe
  13. Sarah Ames, Digital Scholarship Librarian at the National Library of Scotland, responsible for developing a Digital Scholarship Service and launching the Data Foundry. She tweets at @semames1.
  14. Sophie-Carolin Wagner, Co-Founder of RIAT Research Institute for Art and Technology, Co-Editor of the Journal for Research Cultures and Project Manager of ONB Labs at the Austrian National Library.
  15. Stefan Karner, Technical Lead of the ONB Labs at the Austrian National Library, providing access to diverse data and metadata sources within the library, developing a platform for users of the digital library to create and share annotations and other user generated data with each other and the public.
  16. Armin Straube, Teaching Fellow in Library and Information Studies at UCL Qatar. He is an archivist with work experience in data curation, digital preservation and web archiving and tweets at @ArminStraube.

Laia Ros Gasch will be facilitating the Book Sprint and has 10 years of experience as a cultural producer working all over the world with all kinds of groups. Laia speaks English, French, Spanish and Catalan. 

More detailed biographies are available here.

WE NEED YOUR HELP!

We all realise how incredibly lucky and privileged we are to be chosen. However, we want to hear from those of you who are interested in this area. What do you think we should be writing about, who should it be for, what style of writing should we use? Please HELP us by completing this questionnaire by Monday 23 September at 0600 BST! We will consider your thoughts and opinions seriously when we sit down to write the book on Monday morning in Doha in Qatar.

We would also like to get your help when we will be disseminating information about how to get hold of the book on social media, and at various events around the world, especially to coincide with International Open Access week 2019 (21-27 October 2019). Planned activities in 2019-2020 include:

We plan to run a 'Read Sprint' in the near future to review the Book and perhaps create an improved version. We know what we will produce next week won't be perfect!

We have plans to ensure that the book is published on a interactive platform so that it becomes a 'living' book, so that others can add chapters, make amendments, enhancements and add new case studies. We will be making announcements about this soon after the book has been completed.

On a personal note, I feel incredibly grateful, lucky and privileged to have been involved at the very start of this journey. I also feel daunted to be part of the Book Sprint but excited too!

I really want us to create a useful handbook to help cultural heritage organisations build better innovation labs which are often strapped for resources and need help. I have a strong desire that our ‘Book’ will genuinely help and inspire galleries, libraries, archives, museums, universities and other cultural heritage organisations to learn and benefit from those of us who can talk honestly about and share our experiences. I want to share the risks we have taken, mistakes we have made, provide realistic lessons and give sensible advice about what we have learned over the many years in setting up, maintaining and sustaining innovation labs. I believe this approach could mean it may prevent many institutions from having to re-invent the wheel and save them time, money and resources too.

The people in this community have a passionate desire to create something useful and meaningful that will help all of us be better at our jobs and build better innovation labs for the benefit of all our users. Hopefully, we will be following the principles of kindness, generously sharing and understanding and having empathy for the contexts in which we work. In short we hope it sincerely makes a difference and prove that sharing and kindness really can change things.

Now that I have written this, I realise I have done it again, I have written too much! However, I am glad I have written the story of how we got here. What I realise is what a busy year it’s been for everyone and particularly for people in this community, it’s amazing what we have achieved and I want to thank everyone who has played an active role, no matter how small. Let’s hope it continues to grow.

Monday morning, fifteen of us have got to write a book, gulp!

14 September 2019

BL Labs Awards 2019: enter before 2100 on Sunday 29th September! (deadline extended)

We have extended our deadline for our BL Labs Awards to 21:00 (BST) on Sunday 29th September, submit your entry here. If you have already entered, you don't have to resubmit, however, we are happy to receive updated entries too.

The BL Labs Awards formally recognises outstanding and innovative work that has been created using the British Library’s digital collections and data.

Submit your entry, and help us spread the word to all interested parties!

This year, BL Labs is commending work in four key areas:

  • Research - A project or activity that shows the development of new knowledge, research methods, or tools.
  • Commercial - An activity that delivers or develops commercial value in the context of new products, tools, or services that build on, incorporate, or enhance the Library's digital content.
  • Artistic - An artistic or creative endeavour that inspires, stimulates, amazes and provokes.
  • Teaching / Learning - Quality learning experiences created for learners of any age and ability that use the Library's digital content.

After the submission deadline of 21:00 (BST) on Sunday 29th September for entering the BL Labs Awards has passed, the entries will be shortlisted. Selected shortlisted entrants will be notified via email by midnight BST on Thursday 10th October 2019. 

A prize of £500 will be awarded to the winner and £100 to the runner up in each Awards category at the BL Labs Symposium on 11th November 2019 at the British Library, St Pancras, London.

The talent of the BL Labs Awards winners and runners up over the last four years has led to the production of a remarkable and varied collection of innovative projects. In 2018, the Awards commended work in four main categories – Research, Artistic, Commercial and Teaching & Learning:

Photo collage

  • Research category Award (2018) winner: The Delius Catalogue of Works: the production of a comprehensive catalogue of works by the composer Delius, based on research using (and integrated with) the BL’s Archives and Manuscripts Catalogue by Joanna Bullivant, Daniel Grimley, David Lewis and Kevin Page from Oxford University’s Music department.
  • Artistic Award (2018) winner: Another Intelligence Sings (AI Sings): an interactive, immersive sound-art installation, which uses AI to transform environmental sound recordings from the BL’s sound archive by Amanda Baum, Rose Leahy and Rob Walker independent artists and experience designers.
  • Commercial Award (2018) winner: Fashion presentation for London Fashion Week by Nabil Nayal: the Library collection - a fashion collection inspired by digitised Elizabethan-era manuscripts from the BL, culminating in several fashion shows/events/commissions including one at the BL in London.
  • Teaching and Learning (2018) winner: Pocket Miscellanies: ten online pocket-book ‘zines’ featuring images taken from the BL digitised medieval manuscripts collection by Jonah Coman, PhD student at Glasgow School of Art.

For further information about BL Labs or our Awards, please contact us at labs@bl.uk.

Posted by Mahendra Mahey, Manager of of British Library Labs.

13 September 2019

Results of the RASM2019 Competition on Recognition of Historical Arabic Scientific Manuscripts

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

 

Earlier this year, the British Library in collaboration with PRImA Research Lab and the Alan Turing Institute launched a competition on the Recognition of Historical Arabic Scientific Manuscripts, or in short, RASM2019. This competition was held in the context of the 15th International Conference on Document Analysis and Recognition (ICDAR2019). It was the second competition of this type, following RASM2018 which took place in 2018.

The Library has an extensive collection of Arabic manuscripts, comprising of almost 15,000 works. We have been digitising several hundred manuscripts as part of the British Library/Qatar Foundation Partnership, making them available on Qatar Digital Library. A natural next-step would be the creation of machine-readable content from scanned images, for enhanced search and whole new avenues of research.

Running a competition helps us identify software providers and tool developers, as well as introduce us to the specific challenges that pattern recognition systems face when dealing with historic, handwritten materials. For this year’s competition we provided a ground truth set of 120 images and associated XML files: 20 pages to be used to train text recognition systems to automatically identify Arabic script, and a 100 pages to evaluate the training.

Aside from providing larger training and evaluation sets, for this year’s competition we’ve added an extra challenge – marginalia. Notes written in the margins are often less consistent and less coherent than main blocks of text, and can go in different directions. The competition set out three different challenges: page segmentation, text line detection and Optical Character Recognition (OCR). Tackling marginalia was a bonus challenge!

We had just one submission for this year’s competition – RDI Company, Cairo University, who previously participated in 2018 and did very well. RDI submitted three different methods, and participated in two challenges: text line segmentation and OCR. When evaluating the results, PRImA compared established systems used in industry and academia – Tesseract 4.0, ABBYY FineReader Engine 12 (FRE12), and Google Cloud Vision API – to RDI’s submitted methods. The evaluation approach was the same as last year’s, with PRImA evaluating page analysis and recognition methods using different evaluation metrics, in order to gain an insight into the algorithms.

 

Results

Challenge 1 - Page Layout Analysis

The first challenge was set out to identify regions in a page, and find out where blocks of text are located on the page. RDI did not participate in this challenge, therefore an analysis was made only on common industry software mentioned above. The results can be seen in the chart below:

Chart showing RASM2019 page segmentation results
Chart showing RASM2019 page segmentation results

 

Google did relatively well here, and the results are quite similar to last year’s. Despite dealing with the more challenging marginalia text, Google’s previous accuracy score (70.6%) has gone down only very slightly to a still impressive 69.3%.

Example image showing Google’s page segmentation
Example image showing Google’s page segmentation

 

Tesseract 4 and FRE12 scored very similarly, with Tesseract decreasing from last year’s 54.5%. Interestingly, FRE12’s performance on text blocks including marginalia (42.5%) was better than last year’s FRE11 performance without marginalia, scoring at 40.9%. Analysis showed that Tesseract and FRE often misclassified text areas as illustrations, with FRE doing better than Tesseract in this regard.

 

Challenge 2 - Text Line Segmentation

The second challenge looked into segmenting text into distinct text lines. RDI submitted three methods for this challenge, all of which returned the text lines of the main text block (as they did not wish to participate in the marginalia challenge). Results were then compared with Tesseract and FineReader, and are reflected below:

Chart showing RASM2019 text line segmentation results
Chart showing RASM2019 text line segmentation results

 

RDI did very well with its three methods, with an accuracy level ranging between 76.6% and 77.6%. However, despite not attempting to segments marginalia text lines, their methods did not perform as well as last year’s method (with 81.6% accuracy). Their methods did seem to detect some marginalia, though very little overall, as seen in the screenshot below.

Example image showing RDI’s text line segmentation results
Example image showing RDI’s text line segmentation results

 

Tesseract and FineReader again scored lower than RDI, both with decreasing accuracy compared to RASM2018’s results (Tesseract 4 with 44.2%, FRE11 with 43.2%). This is due to the additional marginalia challenge. The Google method does not detect text lines, therefore the Text Line chart above does not include their results.

 

Challenge 3 - OCR Accuracy

The third and last challenge was all about text recognition, tackling the correct identification of characters and words in the text. Evaluation for this challenge was conducted four times: 1) on the whole page, including marginalia, 2) only on main blocks of text, excluding marginalia, 3) using the original texts, and 4) using normalised texts. Text normalisation was performed for both ground truth and OCR results, due to the historic nature of the material, occasional unusual spelling, and use/lack of diacritics. All methods performed slightly better when not tested on marginalia; accuracy rates are demonstrated in the charts below:

Chart showing OCR accuracy results, for main text body only (normalised, no marginalia)
Chart showing OCR accuracy results, for main text body only (normalised, no marginalia)
 
Chart showing OCR accuracy results for all text regions (normalised, with marginalia)
Chart showing OCR accuracy results for all text regions (normalised, with marginalia)

 

It is evident that there are minor differences in the character accuracies for the three RDI methods, with RDI2 performing slightly better than the others. When comparing the OCR accuracy between texts with and without marginalia, there are slightly higher success rates for the latter, though the difference is not significant. This means that tested methods performed on the marginalia almost as well as they did on the main text, which is encouraging.

Comparing RASM2018’s results, RDI’s results are good but not as good as last year (with 85.44% accuracy), likely to be a result of adding marginalia to the recognition challenge. Google performed very well too, considering they did not specifically train or optimised for this competition. Tesseract’s results went down from 30.45% to 25.13%, and FineReader Engine 12 performed better than its previous version FRE11, going up from 12.23% to 17.53% accuracy. However, it is still very low, as handwritten texts are not part of their target material.

 

Further Thoughts

RDI-Corporation has its own historical Arabic handwritten and typewritten OCR system, which has been built using different historical manuscripts. Its methods have done well, given the very challenging nature of the documents. Neither Tesseract nor ABBYY FineReader produce usable results, but that’s not surprising since they are both optimised for printed texts, and target contemporary material and not historical manuscripts.

As next steps, we would like to test these materials with Transkribus, which produced promising results for early printed Indian texts (see e.g. Tom Derrick’s blog post – stay tuned for some even more impressive results!), and potentially Kraken as well. All ground truth will be released through the Library’s future Open Access repository (now in testing phase), as well as through the website of IMPACT Centre for Competence. Watch this space for any developments!

 

30 August 2019

Using Transkribus for automated text recognition of historical Bengali Books

In this post Tom Derrick, Digital Curator, Two Centuries of Indian Print, explains the Library's recent use of Transkribus for automated text recognition of Bengali printed books.

Are you working with digitised printed collections that you want to 'unlock' for keyword search and text mining? Maybe you have already heard about Transkribus but thought it could only be used for automated recognition of handwritten texts. If so you might be surprised to hear it also does a pretty good job with printed texts too. You might be even more surprised to hear it does an impressive job with printed texts in Indian scripts! At least that is what we have found from recent testing with a batch of 19th century printed books written in Bengali script that have been digitised through the British Library’s Two Centuries of Indian Print project.

Transkribus is a READ project and available as a free tool for users who want to automate recognition of historical documents. The British Library has already had some success using Transkribus on manuscripts from our India Office collection, and it was that which inspired me to see how it would perform on the Bengali texts, which provides an altogether different type of challenge.

For a start, most text recognition solutions either do not support Indian scripts, or do not reach close to the same level of recognition as they do with documents written in English or other Latin scripts. In part this is down to supply and demand. Mainstream providers of tools have prioritised Western customers, yet there is also the relative lack of digitised Indian texts that can be used to train text recognition engines.

These text recognition engines have also been well trained on modern dictionaries and a collection of historical texts like the Bengali books will often contain words which are no longer in use. Their aged physicality also brings with it the delights of faded print, blotchy paper and other paper-based gremlins that keeps conservationists in work yet disrupts automated text recognition. Throw in an extensive alphabet that contains more diverse and complicated character forms than English and you can start to piece together how difficult it can be to train recognition engines to achieve comparable results with Bengali texts.

So it was with more with hope than expectation I approached Transkribus. We began by selecting 50 pages from the Bengali books representing the variety of typographical and layout styles within the wider collection of c. 500,000 pages as much as possible. Not an easy task! We uploaded these to Transkribus, manually segmenting paragraphs into text regions and automating line recognition. We then manually transcribed the texts to create a ground truth which, together with the scanned page images, were used to train the recurrent neural network within Transkribus to create a model for the 5,700 transcribed words.

Screenshot of a page from one of the British Library's Bengali books within the Transkribus viewer showing segmentation of the page by green bounding boxes around paragraphs and underlined text lines. Typed transcriptions of the text are shown below the page image                               Screenshot of a page from one of the British Library's Bengali books within the Transkribus viewer showing segmentation of the page by green bounding boxes around paragraphs and underlined text lines. Typed transcriptions of the text are shown below the page image. 

The model was tested on a few pages from the wider collection and the results clearly communicated via the graph below. The model achieved an average character error rate (CER) of 21.9%, which is comparable to the best results we have seen from other text recognition services. Word accuracy of 61% was based on the number of words that were misspelled in the automated transcription compared to the ground truth. Eventually we would like to use automated transcriptions to support keyword searching of the Bengali books online and the higher the word accuracy increases the chances of users pulling back all relevant hits from their keyword search. We noticed the results often missed the upper zone of certain Bengali characters, i.e. the part of the character or glyph which resides above the matra line that connects characters in Bengali words. Further training focused on recognition of these characters may improve the results.

Screenshot of a graph showing the learning curve of the Bengali model using the Transkribus HTR tool which achieved 21.91% character error rateScreenshot of a graph showing the learning curve of the Bengali model using the Transkribus HTR tool which achieved 21.91% character error rate      

Our training set of 50 pages is very small compared to other projects using Transkribus and so we think the accuracy could be vastly improved by creating more transcriptions and re-training the model. However, we're happy with these initial results and would encourage others in a similar position to give Transkribus a try.

 

 

21 August 2019

Chevening British Library Fellowship working with Chinese historical texts

Chevening is the UK government’s international awards programme aimed at developing global leaders. In 2015, the Foreign and Commonwealth Office (FCO) has partnered with the British Library to offer professionals two new fellowships every year. These fellowships are unique opportunities for one-year placements at the Library, working with exceptional collections under the Library’s custodianship. Past and present Chevening Fellows at the Library have focused on geographically diverse collections, from Latin America through Africa to South Asia, with different themes such as Nationalism, Independence, and Partition in South Asia, 1900-1950 and Big Data and Libraries.

We are thrilled to announce that one of the two placements available for the 2020/2021 academic year will focus on automating the recognition of historical Chinese handwritten texts. This is a special opportunity to work in the Library’s Digital Scholarship Department, and engage with unique historical collections digitised as part of the International Dunhuang Project and the Lotus Sutra Manuscripts Digitisation Project. Focusing on material from Dunhuang (China), part of the Stein collection, this Fellowship will engage with new digital tools and techniques in order to explore possible solutions to automate the transcription of these handwritten texts.

Chinese Lotus Sutra scroll with Tibetan divination texts on the back (Shelfmark: Or.8210/S.155). Digitised as part of the Lotus Sutra Manuscripts Digitisation Project. © The British Library
Chinese Lotus Sutra scroll with Tibetan divination texts on the back (Shelfmark: Or.8210/S.155). Digitised as part of the Lotus Sutra Manuscripts Digitisation Project. © The British Library

 

The context for this fellowship is the Library’s efforts towards making its collection items available in machine-readable format, to enable full-text search and analysis. The Library has been digitising its collections at scale for over two decades, with digitisation opening up access to diversely rich collections. However, it’s important for us to further support discovery and digital research by unlocking the huge potential in automatically transcribing our collections. Until recently, Western language print collections have been the main focus, especially newspaper collections. A flagship collaboration with the Alan Turing Institute, a project called “Living with Machines,” is underway to apply Optical Character Recognition (OCR) to UK newspapers, design and implement new methods in data science and artificial intelligence, and analyse these materials at scale.

Taking a broader perspective on Library collections, we have started to explore opportunities with non-Latin collections too. Members of the Digital Scholarship team are engaging closely with the exploration of OCR and Handwritten Text Recognition (HTR) systems for Bangla and Arabic. Digital Curators Tom Derrick, Nora McGregor and Adi Keinan-Schoonbaert have teamed up with PRImA Research Lab and the Alan Turing Institute to ran four competitions in 2017-2019, inviting providers of text recognition methods to try them out on our historical material. Another initiative which Tom is engaged with is exploring Transkribus for Bengali printed texts. He trained Transkribus’ HTR+ recognition engine, which ended up transcribing this material at 94% character accuracy! Tom and Adi’s recent blog post in EuropeanaTech Insight (issue on OCR) summarises these initiatives.

Regions and text lines demarcated as ground truth for RASM2019 ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts (Shelfmark: Add MS 7474). Digitised and available on Qatar Digital Library.
Regions and text lines demarcated as ground truth for RASM2019 ICDAR2019 Competition on Recognition of Historical Arabic Scientific Manuscripts (Shelfmark: Add MS 7474). Digitised and available on Qatar Digital Library.

 

The Chevening Fellow will contribute to our efforts to identify OCR/HTR systems that can tackle digitised historical collections. They will explore the current landscape of Chinese handwritten text recognition, look into methods, challenges, tools and software, use them to test our material, and demonstrate digital research opportunities arising from the availability of these texts in machine-readable format.

This fellowship programme will start in September 2020 for a 12-month period of project-based activity at the British Library. The successful candidate will receive support and supervision from Library staff, and will benefit from professional development opportunities, networking and stakeholder engagement, gaining access to a range of organisational training and development opportunities (such as the Digital Scholarship Training Programme), as well as staff-level access to unique British Library collections and research resources.

For more information and to apply, please visit the Chevening British Library Fellowship page: https://www.chevening.org/fellowship/british-library/, and the “Automating the recognition of historical Chinese handwritten texts” Fellow page: https://www.chevening.org/fellowship/british-library-chinese-handwritten-texts/.

Applications close at 12pm (GMT), 5 November 2019. Good luck!

 

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

20 August 2019

Innovation Labs and the digital divide

Guest posting by Milena Dobreva-McPherson, Associate Professor Library and Information Studies UCL Qatar with contributions from Tuesday Bwalya, Lecturer, Library and Information Science Department, The University of Zambia (UNZA) and Fidelity Phiri, Visiting Researcher, UCL Qatar.

Can you recall seeing an interesting digital cultural heritage object from Zambia lately? If you search the Europeana Collections portal, you will find some 2500 digital objects coming from European heritage institutions. Alongside these items, you can enjoy the sound recording of a grunting and splashing Hippopotamus captured on 2 July 1985 on Luangwa river in Zambia. This object was aggregated from the British Library’s sound collection

Digitisation efforts of various Zambian institutions date back to 2002; for example, at the National Archives of Zambia (which does not have its own website at the time of writing this post), finding digital content originating from Zambian institutions is currently a challenge, unless you are visiting these institutions in person. One possible reason is that institutions in Zambia digitise for the purposes of internal collection management, preservation, and on-site use, like many other organisations. A rare exception is the digitised collection of the records of the United National Independence Party (UNIP) of Zambia, which was created in 2007 in collaboration with the Endangered Archives Programme of the British Library. While it cannot be accessed on any Zambian digital platform, it is available on the website of the British Library.

Is this situation (of very little accessible digital material online in the archives) common for all cultural sectors? Let us have a look at museums. In this domain, the Livingstone Museum was the first to carry out digitisation activities in 2009. The National Museum Board of Zambia, an umbrella organisation for 5 national and 2 community museums, also has an online presence with digitised images. However, trying to explore the Photo gallery or Audio/video files in the Multimedia section on the website returns the ominous 404 Page not found error although the Board definitely has plenty of objects to share. 

Certainly, one could argue that the poor institutional online digital presence is to be expected in a country within the Global South where a digital divide still exists.  After all, even finding data to assess the scale of this digital divide is a challenge, and the body of publications on digital divide in Africa had been quite limited with some 100 identified works over 12-year period (2000-2012). There is also a lack of recent estimates on the state of technological use in museums. Back in 2002, Lorna Abungu suggested that "[a]t present, out of 357 known museums throughout the African continent (including the Indian Ocean islands), only seventy-five have – on an institutional level – at least basic Internet access for e-mail." 

And, while tackling the digital divide is one of the big challenges of the Global South, when we look at it specifically from the digital cultural heritage perspective it has a global effect. Those within the divide are not able to use modern information and communication technologies to their full advantage. This is one of the reasons digitisation is either delayed or caters only for on-site use in Zambia, for example. But for those on the other side of the divide it results in impaired access to the digital heritage currently being accumulated in the regions affected by the digital divide. This is why the users searching for the sounds of hippopotamus splashing will have a chance to discover them only if they are deposited in a collection on the other side of the divide. 

To foster a change within this current situation of a lack of accessibility to the digital cultural heritage of Zambia, UCL Qatar joined forces with the National Museums Board of Zambia to deliver a day-long workshop on Innovation Labs in Cultural Heritage Institutions which was hosted on 1 August, 2019 by the Livingstone Museum. You can read more about this event , in a 'Reflections from the First Sub-Saharan African Workshop on Digital Innovation Labs in Cultural Heritage Institutions' blog post.

Fig. 1. After discussing how to overcome some of the disadvantages of the digital divide: Participants in the Innovation Labs in Cultural Heritage Institutions which was hosted on 1 August, 2019 by the Livingstone Museum
Fig. 1. After discussing how to overcome some of the disadvantages of the digital divide:
Participants in the Innovation Labs in Cultural Heritage Institutions which was hosted on 1 August, 2019 by the Livingstone Museum

There was a clear message from Mahendra Mahey, of British Library Labs that innovation in user engagement can start small, with the use of open source tools and popular web platforms. This event provided useful insights on the questions newcomers to the Innovation Lab community have to ask. In September, a Book Sprint to develop the first guide for setting up, running and maintaining a Digital Cultural Heritage Innovation Labs will be held in Doha, Qatar. 

Here are some of these interesting questions for the wider labs community:

  • Keeping in mind how the level of technological innovation is different on both sides of the divide; what should an innovation lab within the divide offer? Incremental innovation to the state of technology around or advanced innovation to match the global leaders?
  • How much can open platforms support innovation for these labs?
  • Can the route of using predominantly open tools and platforms for innovation labs be used also as a way to enhance open science in the Global South? 

Until a shift in the digital access happens, we will continue browsing some digital content on Zambian heritage coming from other cultural heritage organisations outside Zambia, beyond the digital divide.

Dr Milena Dobreva-McPherson, Associate Professor Library and Information Studies at UCL Qatar Dr Milena Dobreva-McPherson, is Associate Professor Library and Information Studies at UCL Qatar with international experience of working in Bulgaria, Scotland and Malta. Since graduating M.Sc. (Hons) in Informatics in 1991, Milena specialized in digital humanities and digital cultural heritage in the Bulgarian Academy of Sciences, where she earned her PhD in 1999 in Informatics and Applied Mathematics and served as the Founding Head of the first Digitisation Centre in Bulgaria (2004); she was also a member of the Executive Board of the National Commission of UNESCO. Milena’s research interests are in the areas of innovation diffusion in the cultural heritage sector; citizen science; and users of digital libraries. Milena is a member of the editorial board of the IFLA Journal - Sage, and of the International Journal on Digital Libraries (IJDL) - Springer and a member of the steering committed of the three biggest conference series in digital libraries, IJDL, TPDL and ICADL. Consultant of the Europeana Task Force on Research Requirements.  

Mr Tuesday Bwalya, Lecturer, Library and Information Science Department, The University of Zambia (UNZA) Mr Tuesday Bwalya, Lecturer, Library and Information Science Department, The University of Zambia (UNZA). He holds a Master’s Degree in Information Science from China. In addition, Mr. Bwalya has received training in India and Belgium in Library Automation with Free and Open Source Library Management Systems such as Koha and ABCD. His research interests include free and open source library management systems; open access publishing; database systems; web development; records management; cataloguing and classification.

Fidelity Phiri, Librarian at Moto Moto Museum and a visiting researcher at UCL Qatar Fidelity Phiri is currently employed as Librarian at Moto Moto Museum and a visiting researcher at UCL Qatar. He has worked for National Museums Board of Zambia since 2001. He  holds a Bachelor's degree in Library and Information Science from the University of Zambia. Fidelity  also graduated in April 2019 from UCL Qatar and  is a holder of a Master’s degree in Library and Information studies. His research interests are in bibliometrics studies and digital humanities/units  that provide access to digital collections.

Acknowledgements: We would like to thank Fred Nyambe for the photos and Dania Jalees for the editing.

Reflections from the First Sub-Saharan African Workshop on Digital Innovation Labs in Cultural Heritage Institutions

Guest posting by Milena Dobreva-McPherson, Associate Professor Library and Information Studies UCL Qatar with contributions from Tuesday Bwalya, Lecturer, Library and Information Science Department, The University of Zambia (UNZA) and Fidelity Phiri, Visiting Researcher, UCL Qatar.

Recently UCL Qatar joined forces with the National Museums Board of Zambia to deliver a day-long workshop on Innovation Labs in Cultural Heritage Institutions which was hosted on 1 August, 2019 by the Livingstone Museum, Zambia. This workshop was the first of its kind in Sub Saharan Africa and was made possible with the support of the Africa and the Middle East Teaching Fund of the UCL Global Engagement Office. Initially planned for 15 professionals from the cultural heritage sector, it attracted 27 participants (see Fig. 1) coming from six towns located in four out of the ten provinces in Zambia (see map).

Fig. 1.  Participants by sector and gender in the First Sub Saharan Workshop on Innovation Labs in Cultural Heritage Institutions in Zambia, 1‌ August 2019
Fig. 1.  Participants by sector and gender in the First Sub Saharan Workshop on Innovation Labs in Cultural Heritage Institutions in Zambia, 1‌ August 2019

After two vibrant events about Digital Innovation Labs in Cultural Heritage organisations, this was the first event bringing together a higher proportion of participants from museums and archives in addition to the libraries represented. The Building Library Labs event was the first of its kind ever held at the British Library in September 2018, followed by a second workshop in Copenhagen (March, 2019); both attracted mostly library professionals though there were a few attendees from Archives, Galleries and Museums.  

The Innovation Labs emerged as specialised library units supporting a variety of users in experimenting with digital content in the mid 2000s. However, engaging users with digital content is equally important for museums, archives and galleries. And the exchange of institutional experience across the digital cultural heritage sector is essential for professionals who work there, especially when the number of Innovation Labs around the world is growing steadily. The presenters at the event in Zambia included Milena Dobreva-McPherson, UCL Qatar, Fidelity Phiri, Mr Tuesday Bwalya, University of Zambia, Mr Fred Nyambe (Registrar of Collections, Livingstone Museum) and Mr Brian Mwale, (Chief Librarian, National Archives of Zambia). Fiona Clancy (Digitisation Workflow Manager, British Library), Mahendra Mahey (BL Labs Manager, British Library), and Somia Salim, who is an MA student in Library and Information Studies at UCL Qatar, also contributed online (see full programme with links to some of the presentations).

The call for innovation in the heritage sector was clearly communicated in the welcome address delivered on behalf of the Livingstone district acting commissioner Harriet Kawina; this had been duly reported in several publications in Zambian national newspapers (see for an example Fig.2).

Fig. 2. Article on the event in the MAST independent newspaper, 5.08.2019
Fig. 2. Article on the event in the MAST independent newspaper, 5 August 2019

The mixture of presentations discussing the current trends in user engagement with digital content and local examples of digitisation projects and how it works in reality, created a great opportunity to discuss the stumbling blocks in opening content for wider access and use. For some Zambian institutions, the main issue is a lack of a coherent and systematic digitisation efforts, and there was a shared feeling amongst attendees that there needed to be more guidance and clear policies about digitisation for them to follow, which are still not currently in place. Other institutions accumulated digital content and keep it available only internally, not looking into or even considering access and use to external audiences using online platforms on a systematic basis. 

The workshop discussions were lively and engaged; they identified that there is definitely a larger scope to learn from each other locally. In addition, there was a growing realisation amongst organisations that opening their digital content for use by an external audience is now the next step on the agenda of those who have already accumulated it. The feedback of one of the participants, which perhaps summarised this the most clearly, suggested what needs to happen after this workshop in three-steps: 

  • Put the knowledge acquired in the workshop to use ASAP.
  • Conduct a follow up workshop to determine progress in the innovation labs created.
  • Organise a massive awareness campaign to introduce potential users to the innovation labs created.

The workshop participants also experienced the traditional scheduled power outage for the day which explains why the photo illustrating the presentation of certificates is a bit dark (but hey, in the digital world we can easily fix such glitches!)

Fig.3. Participant receiving a certificate from Assoc. Prof. Milena Dobreva
Fig.3. Participant receiving a certificate from Associate Professor Milena Dobreva

Bringing for the first time to the Sub Saharan region the knowledge about innovation labs, fostering dialogue between representatives of different cultural heritage institutions, and discussing the issue of improving access to digital content is just a humble first step in what we hope will help local institutions to improve user engagement and overcome the current digital divide which keeps available digital content hidden from the world.  Read more about Innovation Labs and the digital divide.

Dr Milena Dobreva-McPherson, Associate Professor Library and Information Studies at UCL Qatar Dr Milena Dobreva-McPherson, is Associate Professor Library and Information Studies at UCL Qatar with international experience of working in Bulgaria, Scotland and Malta. Since graduating M.Sc. (Hons) in Informatics in 1991, Milena specialized in digital humanities and digital cultural heritage in the Bulgarian Academy of Sciences, where she earned her PhD in 1999 in Informatics and Applied Mathematics and served as the Founding Head of the first Digitisation Centre in Bulgaria (2004); she was also a member of the Executive Board of the National Commission of UNESCO. Milena’s research interests are in the areas of innovation diffusion in the cultural heritage sector; citizen science; and users of digital libraries. Milena is a member of the editorial board of the IFLA Journal - Sage, and of the International Journal on Digital Libraries (IJDL) - Springer and a member of the steering committed of the three biggest conference series in digital libraries, IJDL, TPDL and ICADL. Consultant of the Europeana Task Force on Research Requirements.  

 

Mr Tuesday Bwalya, Lecturer, Library and Information Science Department, The University of Zambia (UNZA) Mr Tuesday Bwalya, Lecturer, Library and Information Science Department, The University of Zambia (UNZA). He holds a Master’s Degree in Information Science from China. In addition, Mr. Bwalya has received training in India and Belgium in Library Automation with Free and Open Source Library Management Systems such as Koha and ABCD. His research interests include free and open source library management systems; open access publishing; database systems; web development; records management; cataloguing and classification.

 

Fidelity Phiri, Librarian at Moto Moto Museum and a visiting researcher at UCL Qatar Fidelity Phiri is currently employed as Librarian at Moto Moto Museum and a visiting researcher at UCL Qatar. He has worked for National Museums Board of Zambia since 2001. He  holds a Bachelor's degree in Library and Information Science from the University of Zambia. Fidelity  also graduated in April 2019 from UCL Qatar and  is a holder of a Master’s degree in Library and Information studies. His research interests are in bibliometrics studies and digital humanities/units  that provide access to digital collections.

Acknowledgements: We would like to thank Fred Nyambe for the photos and Dania Jalees for the infographic and the editing.