Digital scholarship blog

Enabling innovative research with British Library digital collections

28 posts categorized "Middle East"

25 September 2020

Making Data Into Sound

This is a guest post by Anne Courtney, Gulf History Cataloguer with the Qatar Digital Library, https://www.qdl.qa/en 

Sonification

Over the summer, I’ve been investigating the sonification of data. On the Qatar Project (QDL), we generate a large amount of data, and I wanted to experiment with different methods of representing it. Sonification was a new technique for me, which I learnt about through this article: https://programminghistorian.org/en/lessons/sonification.

 

What is sonification?

Sonification is the method of representing data in an aural format, rather than visual format, such as a graph. It is particularly useful for showing changes in data over time. Different trends are highlighted depending on the choices made during the process, in the same way as they would be when drawing a graph.

 

How does it work?

First, all the data must be put in the right format:

An example of data in Excel showing listed longitude points of
Figure 1: Excel data of longitude points where the Palsgrave anchored

Then, the data is used to generate a midi file. The Programming Historian provides an example python script for this, and by changing parts of it, it is possible to change the tempo, note length, scale, and other features.

Python script ready to output a midi file of occurrences of Anjouan over time
Figure 2: Python script ready to output a midi file of occurrences of Anjouan over time

Finally, to overlay the different midi files, edit them, and change the instruments, I used MuseScore, freely-downloadable music notation software. Other alternatives include LMMS and Garageband:

A music score with name labels of where the Discovery, Palsgrave, and Mary anchored on their journeys, showing different pitches and musical notations.
Figure 3: The score of the voyages of the Discovery, Palsgrave, and Mary, labelled to show the different places where they anchored.

 

The sound of authorities

Each item which the Qatar project catalogues has authority terms linked to it, which list the main subjects and places connected to the item. As each item is dated, it is possible to trace trends in subjects and places over time by assigning the dates of the items to the authority terms. Each authority term ends up with a list of dates when it was mentioned. By assigning different instruments to the different authorities, it is possible to hear how they are connected to each other.

This sound file contains the sounds of places connected with the trade in enslaved people, and how they intersect with the authority term ‘slave trade’. The file begins in 1700 and finishes in 1900. One of the advantages of sonification is that the silence is as eloquent as the data. The authority terms are mentioned more at the end of the time period than the start, and so the piece becomes noisier as the British increasingly concern themselves with these areas. The pitch of the instruments is determined, in this instance, by the months of the records in which they are mentioned.

Authorities

The authority terms are represented by these instruments:

Anjouan: piccolo

Madagascar: cello

Zanzibar: horn

Mauritius: piano

Slave Trade: tubular bell

 

Listening for ships

Ships

This piece follows the journeys of three ships from March 1633 to January 1637. In this example, the pitch is important because it represents longitude; the further east the ships travel, the higher the pitch. The Discovery and the Palsgrave mostly travelled together from Gravesend to India, and they both made frequent trips between the Gulf and India. The Mary set out from England in April 1636 to begin her own journey to India. The notes represent the time the ships spent in harbour, and the silence is the time spent at sea. The Discovery is represented by the flute, the Palsgrave by the violin, and the Mary by the horn.

14 September 2020

Digital geographical narratives with Knight Lab’s StoryMap

Visualising the journey of a manuscript’s creation

Working for the Qatar Digital Library (QDL), I recently catalogued British Library oriental manuscript 2361, a musical compendium copied in Mughal India during the reign of Aurangzeb (1618-1707; ruled from 1658). The QDL is a British Library-Qatar Foundation collaborative project to digitise and share Gulf-related archival records, maps and audio recordings as well as Arabic scientific manuscripts.

Portrait of Aurangzeb on a horse
Figure 1: Equestrian portrait of Aurangzeb. Mughal, c. 1660-70. British Library, Johnson Album, 3.4. Public domain.

The colophons to Or. 2361 fourteen texts contain an unusually large – but jumbled-up – quantity of information about the places and dates it was copied and checked, revealing that it was largely created during a journey taken by the imperial court in 1663.

Example of handwritten bibliographic information: Colophon to the copy of Kitāb al-madkhal fī al-mūsīqī by al-Fārābī
Figure 2: Colophon to the copy of Kitāb al-madkhal fī al-mūsīqī by al-Fārābī, transcribed in Delhi, 3 Jumādá I, 1073 hijrī/14 December 1662 CE, and checked in Lahore, 22 Rajab 1073/2 March 1663. Or. 2361, f. 240r.

Seeking to make sense of the mass of bibliographic information and unpick the narrative of the manuscript’s creation, I recorded all this data in a spreadsheet. This helped to clarify some patterns- but wasn’t fun to look at! To accompany an Asian and African Studies blog post, I wanted to find an interactive digital tool to develop the visual and spatial aspects of the story and convey the landscapes and distances experienced by the manuscript’s scribes and patron during its mobile production.

Screen shot of a spreadsheet of copy data for Or. 2361 showing information such as dates, locations, scribes etc.
Figure 3: Dull but useful spreadsheet of copy data for Or. 2361.

Many fascinating digital tools can present large datasets, including map co-ordinates. However, I needed to retell a linear, progressive narrative with fewer data points. Inspired by a QNF-BL colleague’s work on Geoffrey Prior’s trip to Muscat, I settled on StoryMap, one of an expanding suite of open-source reporting, data management, research, and storytelling tools developed by Knight Lab at Northwestern University, USA.

 

StoryMap: Easy but fiddly

Requiring no coding ability, the back-end of this free, easy-to-use tool resembles PowerPoint. The user creates a series of slides to which text, images, captions and copyright information can be added. Links to further online media, such as the millions of images published on the QDL, can easily be added.

Screen shot of someone editing in StoryMap
Figure 4: Back-end view of StoryMap's authoring tool.

The basic incarnation of StoryMap is accessed via an author interface which is intuitive and clear, but has its quirks. Slide layouts can’t be varied, and image manipulation must be completed pre-upload, which can get fiddly. Text was faint unless entirely in bold, especially against a backdrop image. A bug randomly rendered bits of uploaded text as hyperlinks, whereas intentional hyperlinks are not obvious.

 

The mapping function

StoryMap’s most interesting feature is an interactive map that uses OpenStreetMap data. Locations are inputted as co-ordinates, or manually by searching for a place-name or dropping a pin. This geographical data links together to produce an overview map summarised on the opening slide, with subsequent views zooming to successive locations in the journey.

Screen shot showing a preview of StoryMap with location points dropped on a world map
Figure 5: StoryMap summary preview showing all location points plotted.

I had to add location data manually as the co-ordinates input function didn’t work. Only one of the various map styles suited the historical subject-matter; however its modern street layout felt contradictory. The ‘ideal’ map – structured with global co-ordinates but correct for a specific historical moment – probably doesn’t exist (one for the next project?).

Screen shot of a point dropped on a local map, showing modern street layout
Figure 6: StoryMap's modern street layout implies New Delhi existed in 1663...

With clearly signposted advanced guidance, support forum, and a link to a GitHub repository, more technically-minded users could take StoryMap to the next level, not least in importing custom maps via Mapbox. Alternative platforms such as Esri’s Classic Story Maps can of course also be explored.

However, for many users, Knight Lab StoryMap’s appeal will lie in its ease of usage and accessibility; it produces polished, engaging outputs quickly with a bare minimum of technical input and is easy to embed in web-text or social media. Thanks to Knight Lab for producing this free tool!

See the finished StoryMap, A Mughal musical miscellany: The journey of Or. 2361.

 

This is a guest post by Jenny Norton-Wright, Arabic Scientific Manuscripts Curator from the British Library Qatar Foundation Partnership. You can follow the British Library Qatar Foundation Partnership on Twitter at @BLQatar.

12 June 2020

Making Watermarks Visible: A Collaborative Project between Conservation and Imaging

Some of the earliest documents being digitised by the British Library Qatar Foundation Partnership are a series of ship’s journals dating from 1605 - 1705, relating to the East India Company’s voyages. Whilst working with these documents, conservators Heather Murphy and Camille Dekeyser-Thuet noticed within the papers a series of interesting examples of early watermark design. Curious about the potential information these could give regarding the journals, Camille and Heather began undertaking research, hoping to learn more about the date and provenance of the papers, trade and production patterns involved in the paper industry of the time, and the practice of watermarking paper. There is a wealth of valuable and interesting information to be gained from the study of watermarks, especially within a project such as the BLQFP which provides the opportunity for study within both IOR and Arabic manuscript material. We hope to publish more information relating to this online with the Qatar Digital Library in the form of Expert articles and visual content.

The first step within this project involved tracing the watermark designs with the help of a light sheet in order to begin gathering a collection of images to form the basis of further research. It was clear that in order to make the best possible use of the visual information contained within these watermarks, they would need to be imaged in a way which would make them available to audiences in both a visually appealing and academically beneficial form, beyond the capabilities of simply hand tracing the designs.

Hand tracings of the watermark designs
Hand tracings of the watermark designs

 

This began a collaboration with two members of the BLQFP imaging team, Senior Imaging Technician Jordi Clopes-Masjuan and Senior Imaging Support Technician Matt Lee, who, together with Heather and Camille, were able to devise and facilitate a method of imaging and subsequent editing which enabled new access to the designs. The next step involved the construction of a bespoke support made from Vivak (commonly used for exhibition mounts and stands). This inert plastic is both pliable and transparent, which allowed the simultaneous backlighting and support of the journal pages required to successfully capture the watermarks.

Creation of the Vivak support
Creation of the Vivak support
Imaging of pages using backlighting
Imaging of pages using backlighting
Studio setup for capturing the watermarks
Studio setup for capturing the watermarks

 

Before capturing, Jordi suggested we create two comparison images of the watermarks. This involved capturing the watermarks as they normally appear on the digitised image (almost or completely invisible), and how they appear illuminated when the page is backlit. The theory behind this was quite simple: “to obtain two consecutive images from the same folio, in the exact same position, but using a specific light set-up for each image”.

By doing so, the idea was for the first image to appear in the same way as the standard, searchable images on the QDL portal. To create these standard image captures, the studio lights were placed near the camera with incident light towards the document.

The second image was taken immediately after, but this time only backlight was used (light behind the document). In using these two different lighting techniques, the first image allowed us to see the content of the document, but the second image revealed the texture and character of the paper, including conservation marks, possible corrections to the writing, as well as the watermarks.

One unexpected occurrence during imaging was, due to the varying texture and thickness of the papers, the power of the backlight had to be re-adjusted for each watermark.

First image taken under normal lighting conditions
First image taken under normal lighting conditions 
Second image of the same page taken using backlighting
Second image of the same page taken using backlighting 

https://www.qdl.qa/en/archive/81055/vdc_100000001273.0x000342

 

Previous to our adopted approach, other imaging techniques were also investigated: 

  • Multispectral photography: by capturing the same folio under different lights (from UV to IR) the watermarks, along with other types of hidden content such as faded ink, would appear. However, it was decided that this process would take too long for the number of watermarks we were aiming to capture.
  • Light sheet: Although these types of light sheets are extremely slim and slightly flexible, we experienced some issues when trying the double capture, as on many occasions the light sheet was not flexible enough, and was “moving” the page when trying to reach the gutter (for successful final presentation of the images it was mandatory that the folio on both captures was still).

Once we had successfully captured the images, Photoshop proved vital in allowing us to increase the contrast of the watermark and make it more visible. Because every image captured was different, the approach to edit the images was also different. This required varying adjustments of levels, curves, saturation or brightness, and combining these with different fusion modes to attain the best result. In the end, the tools used were not as important as the final image. The last stage within Photoshop was for both images of the same folio to be cropped and exported with the exact same settings, allowing the comparative images to match as precisely as possible.

The next step involved creating a digital line drawing of each watermark. Matt Lee, a Senior Imaging Support Technician, imported the high-resolution image captures onto an iPad and used the Procreate drawing app to trace the watermarks with a stylus pen. To develop an approach that provided accurate and consistent results, Matt first tested brushes and experimented with line qualities and thicknesses. Selecting the Dry Ink brush, he traced the light outlines of each watermark on a separate transparent layer. The tracings were initially drawn in white to highlight the designs on paper and these were later inverted to create black line drawings that were edited and refined.

Tracing the watermarks directly from the screen of an iPad provided a level of accuracy and efficiency that would be difficult to achieve on a computer with a graphics tablet, trackpad or computer mouse. There were several challenges in tracing the watermarks from the image captures. For example, the technique employed by Jordi was very effective in highlighting the watermarks, but it also made the laid and chain lines in the paper more prominent and these would merge or overlap with the light outline of the design.

Some of the watermarks also appeared distorted, incomplete or had handwritten text on the paper which obscured the details of the design. It was important that the tracings were accurate and some gaps had to be left. However, through the drawing process, the eye began to pick out more detail and the most exciting moment was when a vague outline of a horse revealed itself to be a unicorn with inset lettering.

Vector image of unicorn watermark
Vector image of unicorn watermark

 

In total 78 drawings of varying complexity and design were made for this project. To preserve the transparent backgrounds of the drawings, they were exported first as PNG files. These were then imported into Adobe Illustrator and converted to vector drawings that can be viewed at a larger size without loss of image quality.

Vector image of watermark featuring heraldic designs(Drawing)
Vector image of watermark featuring heraldic designs

 

Once the drawings were complete, we now had three images - the ‘traditional view’ (the page as it would normally appear), the ‘translucid view’ (the same page backlit and showing the watermark) and the ‘translucid + white view’ (the translucid view plus additional overlay of the digitally traced watermark in place on the page).

Traditional view
Traditional view
Translucid view
Translucid view
Translucid view with watermark highlighted by digital tracingtranslucid+white view
Translucid view with watermark highlighted by digital tracing

 

Jordi was able to take these images and, by using a multiple slider tool, was able to display them on an offline website. This enabled us to demonstrate this tool to our team and present the watermarks in the way we had been wishing from the beginning, allowing people to both study and appreciate the designs.

Watermarks Project Animated GIF

 

This is a guest post by Heather Murphy, Conservator, Jordi Clopes-Masjuan, Senior Imaging Technician and Matt Lee, Senior Imaging Support Technician from the British Library Qatar Foundation Partnership. You can follow the British Library Qatar Foundation Partnership on Twitter at @BLQatar.

 

20 January 2020

Using Transkribus for Arabic Handwritten Text Recognition

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

 

In the last couple of years we’ve teamed up with PRImA Research Lab in Salford to run competitions for automating the transcription of Arabic manuscripts (RASM2018 and RASM2019), in an ongoing effort to identify good solutions for Arabic Handwritten Text Recognition (HTR).

I’ve been curious to test our Arabic materials with Transkribus – one of the leading tools for automating the recognition of historical documents. We’ve already tried it out on items from the Library’s India Office collection as well as early Bengali printed books, and we were pleased with the results. Several months ago the British Library joined the READ-COOP – the cooperative taking up the development of Transkribus – as a founding member.

As with other HTR tools, Transkribus’ HTR+ engine cannot start automatic transcription straight away, but first needs to be trained on a specific type of script and handwriting. This is achieved by creating a training dataset – a transcription of the text on each page, as accurate as possible, and a segmentation of the page into text areas and line, demarcating the exact location of the text. Training sets are therefore comprised of a set of images and an equivalent set of XML files, containing the location and transcription of the text.

A screenshot from Transkribus, showing the segmentation and transcription of a page from Add MS 7474
A screenshot from Transkribus, showing the segmentation and transcription of a page from Add MS 7474.

 

This process can be done in Transkribus, but in this case I already had a training set created using PRImA’s software Aletheia. I used the dataset created for the competitions mentioned above: 120 transcribed and ground-truthed pages from eight manuscripts digitised and made available through QDL. This dataset is now freely accessible through the British Library’s Research Repository.

Transkribus recommends creating a training set of at least 75 pages (between 5,000 and 15,000 words), however I was interested to find out a few things. First, the methods submitted for the RASM2019 competition worked on a training set of 20 pages, with an evaluation set of 100 pages. Therefore, I wanted to see how Transkribus’ HTR+ engine dealt with the same scenario. It should be noted that the RASM2019 methods were evaluated using PRImA’s evaluation methods, and this is not the case with Transkribus evaluation method – therefore, the results shown here are not accurately comparable, but give some idea on how Transkribus performed on the same training set.

I created four different models to see how Transkribus’ recognition algorithms deal with a growing training set. The models were created as follows:

  • Training model of 20 pages, and evaluation set of 100 pages
  • Training model of 50 pages, and evaluation set of 70 pages
  • Training model of 75 pages, and evaluation set of 45 pages
  • Training model of 100 pages, and evaluation set of 20 pages

The graphs below show each of the four iterations, from top to bottom:

CER of 26.80% for a training set of 20 pages

CER of 19.27% for a training set of 50 pages

CER of 15.10% for a training set of 75 pages

CER of 13.57% for a training set of 100 pages

The results can be summed up in a table:

Training Set (pp.)

Evaluation Set (pp.)

Character Error Rate (CER)

Character Accuracy

20

100

26.80%

73.20%

50

70

19.27%

80.73%

75

45

15.10%

84.9%

100

20

13.57%

86.43%

 

Indeed the accuracy improved with each iteration of training – the more training data the neural networks in Transkribus’ HTR+ engine have, the better the results. With a training set of a 100 pages, Transkribus managed to automatically transcribe the rest of the 20 pages with 86.43% accuracy rate – which is pretty good for historical handwritten Arabic script.

As a next step, we could consider (1) adding more ground-truthed pages from our manuscripts to increase the size of the training set, and by that improve HTR accuracy; (2) adding other open ground truth datasets of handwritten Arabic to the existing training set, and checking whether this improves HTR accuracy; and (3) running a few manuscripts from QDL through Transkribus to see how its HTR+ engine transcribes them. If accuracy is satisfactory, we could see how to scale this up and make those transcriptions openly available and easily accessible.

In the meantime, I’m looking forward to participating at the OpenITI AOCP workshop entitled “OCR and Digital Text Production: Learning from the Past, Fostering Collaboration and Coordination for the Future,” taking place at the University of Maryland next week, and catching up with colleagues on all things Arabic OCR/HTR!

 

03 October 2019

BL Labs Symposium (2019): Book your place for Mon 11-Nov-2019

Posted by Mahendra Mahey, Manager of BL Labs

The BL Labs team are pleased to announce that the seventh annual British Library Labs Symposium will be held on Monday 11 November 2019, from 9:30 - 17:00* (see note below) in the British Library Knowledge Centre, St Pancras. The event is FREE, and you must book a ticket in advance to reserve your place. Last year's event was the largest we have ever held, so please don't miss out and book early!

*Please note, that directly after the Symposium, we have teamed up with an interactive/immersive theatre company called 'Uninvited Guests' for a specially organised early evening event for Symposium attendees (the full cost is £13 with some concessions available). Read more at the bottom of this posting!

The Symposium showcases innovative and inspiring projects which have used the British Library’s digital content. Last year's Award winner's drew attention to artistic, research, teaching & learning, and commercial activities that used our digital collections.

The annual event provides a platform for the development of ideas and projects, facilitating collaboration, networking and debate in the Digital Scholarship field as well as being a focus on the creative reuse of the British Library's and other organisations' digital collections and data in many other sectors. Read what groups of Master's Library and Information Science students from City University London (#CityLIS) said about the Symposium last year.

We are very proud to announce that this year's keynote will be delivered by scientist Armand Leroi, Professor of Evolutionary Biology at Imperial College, London.

Armand Leroi
Professor Armand Leroi from Imperial College
will be giving the keynote at this year's BL Labs Symposium (2019)

Professor Armand Leroi is an author, broadcaster and evolutionary biologist.

He has written and presented several documentary series on Channel 4 and BBC Four. His latest documentary was The Secret Science of Pop for BBC Four (2017) presenting the results of the analysis of over 17,000 western pop music from 1960 to 2010 from the US Bill Board top 100 charts together with colleagues from Queen Mary University, with further work published by through the Royal Society. Armand has a special interest in how we can apply techniques from evolutionary biology to ask important questions about culture, humanities and what is unique about us as humans.

Previously, Armand presented Human Mutants, a three-part documentary series about human deformity for Channel 4 and as an award winning book, Mutants: On Genetic Variety and Human Body. He also wrote and presented a two part series What Makes Us Human also for Channel 4. On BBC Four Armand presented the documentaries What Darwin Didn't Know and Aristotle's Lagoon also releasing the book, The Lagoon: How Aristotle Invented Science looking at Aristotle's impact on Science as we know it today.

Armands' keynote will reflect on his interest and experience in applying techniques he has used over many years from evolutionary biology such as bioinformatics, data-mining and machine learning to ask meaningful 'big' questions about culture, humanities and what makes us human.

The title of his talk will be 'The New Science of Culture'. Armand will follow in the footsteps of previous prestigious BL Labs keynote speakers: Dan Pett (2018); Josie Fraser (2017); Melissa Terras (2016); David De Roure and George Oates (2015); Tim Hitchcock (2014); Bill Thompson and Andrew Prescott in 2013.

The symposium will be introduced by the British Library's new Chief Librarian Liz Jolly. The day will include an update and exciting news from Mahendra Mahey (BL Labs Manager at the British Library) about the work of BL Labs highlighting innovative collaborations BL Labs has been working on including how it is working with Labs around the world to share experiences and knowledge, lessons learned . There will be news from the Digital Scholarship team about the exciting projects they have been working on such as Living with Machines and other initiatives together with a special insight from the British Library’s Digital Preservation team into how they attempt to preserve our digital collections and data for future generations.

Throughout the day, there will be several announcements and presentations showcasing work from nominated projects for the BL Labs Awards 2019, which were recognised last year for work that used the British Library’s digital content in Artistic, Research, Educational and commercial activities.

There will also be a chance to find out who has been nominated and recognised for the British Library Staff Award 2019 which highlights the work of an outstanding individual (or team) at the British Library who has worked creatively and originally with the British Library's digital collections and data (nominations close midday 5 November 2019).

As is our tradition, the Symposium will have plenty of opportunities for networking throughout the day, culminating in a reception for delegates and British Library staff to mingle and chat over a drink and nibbles.

Finally, we have teamed up with the interactive/immersive theatre company 'Uninvited Guests' who will give a specially organised performance for BL Labs Symposium attendees, directly after the symposium. This participatory performance will take the audience on a journey through a world that is on the cusp of a technological disaster. Our period of history could vanish forever from human memory because digital information will be wiped out for good. How can we leave a trace of our existence to those born later? Don't miss out on a chance to book on this unique event at 5pm specially organised to coincide with the end of the BL Labs Symposium. For more information, and for booking (spaces are limited), please visit here (the full cost is £13 with some concessions available). Please note, if you are unfortunate in not being able to join the 5pm show, there will be another performance at 1945 the same evening (book here for that one).

So don't forget to book your place for the Symposium today as we predict it will be another full house again and we don't want you to miss out.

We look forward to seeing new faces and meeting old friends again!

For any further information, please contact [email protected]

13 September 2019

Results of the RASM2019 Competition on Recognition of Historical Arabic Scientific Manuscripts

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Twitter as @BL_AdiKS.

 

Earlier this year, the British Library in collaboration with PRImA Research Lab and the Alan Turing Institute launched a competition on the Recognition of Historical Arabic Scientific Manuscripts, or in short, RASM2019. This competition was held in the context of the 15th International Conference on Document Analysis and Recognition (ICDAR2019). It was the second competition of this type, following RASM2018 which took place in 2018.

The Library has an extensive collection of Arabic manuscripts, comprising of almost 15,000 works. We have been digitising several hundred manuscripts as part of the British Library/Qatar Foundation Partnership, making them available on Qatar Digital Library. A natural next-step would be the creation of machine-readable content from scanned images, for enhanced search and whole new avenues of research.

Running a competition helps us identify software providers and tool developers, as well as introduce us to the specific challenges that pattern recognition systems face when dealing with historic, handwritten materials. For this year’s competition we provided a ground truth set of 120 images and associated XML files: 20 pages to be used to train text recognition systems to automatically identify Arabic script, and a 100 pages to evaluate the training.

Aside from providing larger training and evaluation sets, for this year’s competition we’ve added an extra challenge – marginalia. Notes written in the margins are often less consistent and less coherent than main blocks of text, and can go in different directions. The competition set out three different challenges: page segmentation, text line detection and Optical Character Recognition (OCR). Tackling marginalia was a bonus challenge!

We had just one submission for this year’s competition – RDI Company, Cairo University, who previously participated in 2018 and did very well. RDI submitted three different methods, and participated in two challenges: text line segmentation and OCR. When evaluating the results, PRImA compared established systems used in industry and academia – Tesseract 4.0, ABBYY FineReader Engine 12 (FRE12), and Google Cloud Vision API – to RDI’s submitted methods. The evaluation approach was the same as last year’s, with PRImA evaluating page analysis and recognition methods using different evaluation metrics, in order to gain an insight into the algorithms.

 

Results

Challenge 1 - Page Layout Analysis

The first challenge was set out to identify regions in a page, and find out where blocks of text are located on the page. RDI did not participate in this challenge, therefore an analysis was made only on common industry software mentioned above. The results can be seen in the chart below:

Chart showing RASM2019 page segmentation results
Chart showing RASM2019 page segmentation results

 

Google did relatively well here, and the results are quite similar to last year’s. Despite dealing with the more challenging marginalia text, Google’s previous accuracy score (70.6%) has gone down only very slightly to a still impressive 69.3%.

Example image showing Google’s page segmentation
Example image showing Google’s page segmentation

 

Tesseract 4 and FRE12 scored very similarly, with Tesseract decreasing from last year’s 54.5%. Interestingly, FRE12’s performance on text blocks including marginalia (42.5%) was better than last year’s FRE11 performance without marginalia, scoring at 40.9%. Analysis showed that Tesseract and FRE often misclassified text areas as illustrations, with FRE doing better than Tesseract in this regard.

 

Challenge 2 - Text Line Segmentation

The second challenge looked into segmenting text into distinct text lines. RDI submitted three methods for this challenge, all of which returned the text lines of the main text block (as they did not wish to participate in the marginalia challenge). Results were then compared with Tesseract and FineReader, and are reflected below:

Chart showing RASM2019 text line segmentation results
Chart showing RASM2019 text line segmentation results

 

RDI did very well with its three methods, with an accuracy level ranging between 76.6% and 77.6%. However, despite not attempting to segments marginalia text lines, their methods did not perform as well as last year’s method (with 81.6% accuracy). Their methods did seem to detect some marginalia, though very little overall, as seen in the screenshot below.

Example image showing RDI’s text line segmentation results
Example image showing RDI’s text line segmentation results

 

Tesseract and FineReader again scored lower than RDI, both with decreasing accuracy compared to RASM2018’s results (Tesseract 4 with 44.2%, FRE11 with 43.2%). This is due to the additional marginalia challenge. The Google method does not detect text lines, therefore the Text Line chart above does not include their results.

 

Challenge 3 - OCR Accuracy

The third and last challenge was all about text recognition, tackling the correct identification of characters and words in the text. Evaluation for this challenge was conducted four times: 1) on the whole page, including marginalia, 2) only on main blocks of text, excluding marginalia, 3) using the original texts, and 4) using normalised texts. Text normalisation was performed for both ground truth and OCR results, due to the historic nature of the material, occasional unusual spelling, and use/lack of diacritics. All methods performed slightly better when not tested on marginalia; accuracy rates are demonstrated in the charts below:

Chart showing OCR accuracy results, for main text body only (normalised, no marginalia)
Chart showing OCR accuracy results, for main text body only (normalised, no marginalia)
 
Chart showing OCR accuracy results for all text regions (normalised, with marginalia)
Chart showing OCR accuracy results for all text regions (normalised, with marginalia)

 

It is evident that there are minor differences in the character accuracies for the three RDI methods, with RDI2 performing slightly better than the others. When comparing the OCR accuracy between texts with and without marginalia, there are slightly higher success rates for the latter, though the difference is not significant. This means that tested methods performed on the marginalia almost as well as they did on the main text, which is encouraging.

Comparing RASM2018’s results, RDI’s results are good but not as good as last year (with 85.44% accuracy), likely to be a result of adding marginalia to the recognition challenge. Google performed very well too, considering they did not specifically train or optimised for this competition. Tesseract’s results went down from 30.45% to 25.13%, and FineReader Engine 12 performed better than its previous version FRE11, going up from 12.23% to 17.53% accuracy. However, it is still very low, as handwritten texts are not part of their target material.

 

Further Thoughts

RDI-Corporation has its own historical Arabic handwritten and typewritten OCR system, which has been built using different historical manuscripts. Its methods have done well, given the very challenging nature of the documents. Neither Tesseract nor ABBYY FineReader produce usable results, but that’s not surprising since they are both optimised for printed texts, and target contemporary material and not historical manuscripts.

As next steps, we would like to test these materials with Transkribus, which produced promising results for early printed Indian texts (see e.g. Tom Derrick’s blog post – stay tuned for some even more impressive results!), and potentially Kraken as well. All ground truth will be released through the Library’s future Open Access repository (now in testing phase), as well as through the website of IMPACT Centre for Competence. Watch this space for any developments!

 

14 June 2019

Palestine Open Maps mapathon: follow up and data usage experiments

This guest post is by Majd Al-Shihabi, he is a systems design engineer and urban planning graduate student at the American University of Beirut. He is the inaugural recipient of the Bassel Khartabil Free Culture Fellowship. You can find him on Twitter as @majdal.

 

Last Saturday, the British Library hosted a mapathon run by Palestine Open Maps team to vectorise the map content of 155 maps made at 1:20,000 scale by the British Mandate of Palestine.

Before the mapathon itself, I visited the maps collection at the Library, and after working with the maps for almost two years, I finally saw the original maps in physical form.

About 35 mappers participated in the mapathon, and they vectorised content covering most of historic Palestine. The flashing features in the animation below are the ones created through the mapathon.

View post on imgur.com

They include hundreds of features, including cisterns, schools, police stations, places of worship, parts of the road network, residential areas, and more.

Some of the features, such as towns, had Wikipedia articles and Wikidata items, which we linked to the map data as well.

Often, we are asked, what happens with the data that we produce through those mapathons? First and foremost, it is available for download here, under an Open Data Commons Attribution License.

The data is already being used by other projects. For example, Ahmad Barclay, a partner in the Palestine Open Maps project, has collaborated with the Palestinian Oral History Archive, to map all landmarks mentioned in testimonies by Palestinians recounting life in Palestine before the 1948 Nakba. The result is a map that serves as a spatial way of navigating oral history. View the map here.

 

19 March 2019

BL Labs 2018 Commercial Award Runner Up: 'The Seder Oneg Shabbos Bentsher'

This guest blog was written by David Zvi Kalman on behalf of the team that received the runner up award in the 2018 BL Labs Commercial category.

32_god_web2

The bentsher is a strange book, both invisible and highly visible. It is not among the more well known Jewish books, like the prayerbook, Hebrew Bible, or haggadah. You would be hard pressed to find a general-interest bookstore selling a copy. Still, enter the house of a traditional Jew and you’d likely find at least a few, possibly a few dozen. In Orthodox communities, the bentsher is arguably the most visible book of all.

Bentshers are handbooks containing the songs and blessings, including the Grace after Meals, that are most useful for Sabbath and holiday meals, as well as larger gatherings. They are, as a rule, quite small. These days, bentshers are commonly given out as party favors at Jewish weddings and bar/bat mitzvahs, since meals at those events require them anyway. Many bentshers today have personalized covers relating the events at which they were given.

Bentshers have never gone out of print. By this I mean that printing began with the invention of the printing press and has never stopped. They are small, but they have always been useful. Seder Oneg Shabbos, the version which I designed, was released 500 years after the first bentsher was published. It is, in a sense, a Half Millennium Anniversary Special Edition.

SederOneg_4

Bentshers, like other Jewish books, could be quite ornate; some were written and illustrated by hand. Over the years, however, bentshers have become less and less interesting, largely in order to lower the unit cost. In order to make it feasible for wedding planners to order hundreds at a time, all images were stripped from the books, the books themselves became very small, and any interest in elegant typography was quickly eliminated. My grandfather, who designed custom covers for wedding bentshers, simply called the book, “the insert.” Custom prayerbooks were no different from custom matchbooks.

This particular bentsher was created with the goal of bucking this trend; I attempted to give the book the feel of the some of the Jewish books and manuscripts of the past, using the research I was able to gather a graduate student in the field of Jewish history. Doing this required a great deal of image research; for this, the British Library’s online resources were incredible valuable. Of the more than one hundred images in the book, a plurality are from the British Library’s collections.

https://data.bl.uk/hebrewmanuscripts/

https://www.bl.uk/hebrew-manuscripts

OS_36_37

In addition to its visual element, this bentsher differs from others in two important ways. First, it contains ritual languages that is inclusive of those in the LGBTQ community, and especially for those conducting same-sex weddings. In addition, the book contains songs not just in Hebrew, but in Yiddish, as well; this was a homage to two Yiddishists who aided in creating the bentsher’s content. The bentsher was first used at their wedding.

SederOneg_3

More here: https://shabb.es/sederonegshabbos/

Watch David accepting the runner up award and talking about the Seder Oneg Shabbos Bentsher on our YouTube channel (clip runs from 5.33 to 7.26): 

David Zvi Kalman was responsible for the book’s design, including the choice of images. He is a doctoral candidate at the University of Pennsylvania, where he focuses on the relationship between Jewish history and the history of technology. Sarah Wolf is a specialist in rabbinics and is an assistant professor at the Jewish Theology Seminary of America. Joshua Schwartz is a doctoral student at New York University, where he studies Jewish mysticism. Sarah and Joshua were responsible for most of the books translations and transliterations. Yocheved and Yudis Retig are Yiddishists and were responsible for the book’s Yiddish content and translations.

Find out more about Digital Scholarship and BL Labs. If you have a project which uses British Library digital content in innovative and interesting ways, consider applying for an award this year! The 2019 BL Labs Symposium will take place on Monday 11 November at the British Library.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs