Digital scholarship blog

218 posts categorized "Experiments"

05 February 2018

8th Century Arabic science meets today's computer science

Or, Announcing a Competition for the Automatic Transcription of Historical Arabic Scientific Manuscripts 

“An impartial view of Digital Humanities (DH) scholarship in the present day reveals a stark divide between ‘the West and the rest’…Far fewer large-scale DH initiatives have focused on Asia and the non-Western world than on Western Europe and the Americas…Digital databases and text corpora – the ‘raw material’ of text mining and computational text analysis – are far more abundant for English and other Latin alphabetic scripts than they are for Chinese, Japanese, Korean, Sanskrit, Hindi, Arabic and other non-Latin orthographies…Troves of unread primary sources lie dormant because no text mining technology exists to parse them.”

-Dr. Thomas Mullaney, Associate Professor of Chinese History at Stanford University

Supporting the use of Asian & African Collections in digital scholarship means shining a light on this stark divide and seeking ways to close the gap. In this spirit, we are excited to announce the ICFHR2018 Competition on Recognition of Historical Arabic Scientific Manuscripts.

Add MS 7474_0043.script

The Competition

Drawing together experts from British Library, The Alan Turing Institute, Qatar Digital Library and PRImA Research Lab, our aim in launching this competition is to play an active roll in advancing the state-of-the-art in handwritten text recognition technologies for Arabic. For our first challenge we are focussing on finding an optimal solution for accurately and automatically transcribing historical Arabic scientific handwritten manuscripts.

Though such technologies are still in their infancy, unlocking historical handwritten Arabic manuscripts for large-scale text analysis has the potential to truly transform research. In conjunction with the competition we hope to build and make freely open and available a substantial image and ground truth dataset to support continued efforts in this area. 

Enter the Competition

Organisers

Apostolos Antonacopoulos Professor of Pattern Recognition, University of Salford and Head of (PRImA) research lab 
Christian Clausner Research Fellow at the Pattern Recognition and Image Analysis (PRImA) research lab  
Nora McGregor Digital Curator at British Library, Asian & African Collections
Daniel Lowe Curator at British Library, Arabic Collections
Daniel Wilson-Nunn, PhD student at University of Warwick & Turing PhD Student based at Alan Turing Institute 
• Bink Hallum, Arabic Scientific Manuscripts Curator at British Library/Qatar Foundation Partnership 

Further reading

For more on recent Digital Research Team text recognition and transcription projects see:

 

This post is by Nora McGregor, Digital Curator, British Library. She is on twitter as @ndalyrose

01 February 2018

BL Labs 2017 Symposium: A large-scale comparison of world music corpora with computational tools, Research Award Winner

A large-scale comparison of world music corpora with computational tools.

By Maria Panteli, Emmanouil Benetos, and Simon Dixon from the Centre for Digital Music, Queen Mary University of London

The comparative analysis of world music cultures has been the focus of several ethnomusicological studies in the last century. With the advances of Music Information Retrieval and the increased accessibility of sound archives, large-scale analysis of world music with computational tools is today feasible. We combine music recordings from two archives, the Smithsonian Folkways Recordings and British Library Sound Archive, to create one of the largest world music corpora studied so far (8200 geographically balanced recordings sampled from a total of 70000 recordings). This work was winner for the 2017 British Library Labs Awards - Research category.

Our aim is to explore relationships of music similarity between different parts of the world. The history of cultural exchange goes back many years and music, an essential cultural identifier, has travelled beyond country borders. But is this true for all countries? What if a country is geographically isolated or its society resisted external musical influence? Can we find such music examples whose characteristics stand out from other musics in the world? By comparing folk and traditional music from 137 countries we aim to identify geographical areas that have developed a unique musical character.

Maria Panteli fig 1

Methodology: Signal processing and machine learning methods are combined to extract meaningful music representations from the sound recordings. Data mining methods are applied to explore music similarity and identify outlier recordings.

We use digital signal processing tools to extract music descriptors from the sound recordings capturing aspects of rhythm, timbre, melody, and harmony. Machine learning methods are applied to learn high-level representations of the music and the outcome is a projection of world music recordings to a space respecting music similarity relations. We use data mining methods to explore this space and identify music recordings that are most distinct compared to the rest of our corpus. We refer to these recordings as ‘outliers’ and study their geographical patterns. More details on the methodology are provided here.

 

  Maria Panteli fig 2

 

Distribution of outliers per country: The colour scale corresponds to the normalised number of outliers per country, where 0% indicates that none of the recordings of the country were identified as outliers and 100% indicates that all of the recordings of the country are outliers.

We observed that out of 137 countries, Botswana had the most outlier recordings compared to the rest of the corpus. Music from China, characterised by bright timbres, was also found to be relatively distinct compared to music from its neighbouring countries. Analysis with respect to different features revealed that African countries such as Benin and Botswana, indicated the largest amount of rhythmic outliers with recordings often featuring the use of polyrhythms. Harmonic outliers originated mostly from Southeast Asian countries such as Pakistan and Indonesia, and African countries such as Benin and Gambia, with recordings often featuring inharmonic instruments such as the gong and bell. You can explore and listen to music outliers in this interactive visualisation. The datasets and code used in this project are included in this link.

Maria Panteli fig 3

Interactive visualisation to explore and listen to music outliers.

This line of research makes a large-scale comparison of recorded music possible, a significant contribution for ethnomusicology, and one we believe will help us understand better the music cultures of the world.

Posted by British Library Labs.

 

29 January 2018

BL Labs 2017 Symposium: Face Swap, Artistic Award Runner Up

Blog post by Tristan Roddis, Director of web development at Cogapp.

The genesis of this entry to the BL Labs awards 2017 (Artistic Award Runner up) can be traced back to an internal Cogapp hackathon back in July. Then I paired up with my colleague Jon White to create a system that was to be known as “the eyes have it”: the plan was to show the users webcam with two boxes for eyes overlaid, and they would have to move their face into position, whereupon the whole picture would morph into a portrait painting that had their eyes in the same locations.

So we set to work using OpenCV and Python to detect faces and eyes in both live video and a library of portraits from the National Portrait Gallery.

We quickly realised that this wasn’t going to work:

Green rectangles are what OpenCV think are eyes. I have too many. 
Green rectangles are what OpenCV think are eyes. I have too many. 

It turns out that eye detection is a bit too erratic, so we changed tack and only considered the whole face instead. I created a Python script to strip out the coordinates for faces from the portraits we had to hand, and another that would do the same for an individual frame from the webcam video sent from the browser to Python using websockets. Once we had both of these coordinates, the Python script sent the data back to the web front end, where Jon used the HTML <canvas> element to overlay the cropped portrait face exactly over the detected webcam face. As soon as we saw this in action, we realized we’d made something interesting and amusing!

image from media.giphy.com

And that was it for the first round of development. By the end of the day we had a rudimentary system that could successfully overlay faces on video. You can read more about that project on the Cogapp blog, or see the final raw output in this video:

A couple of months later, we heard about the British Library Labs Awards, and thought we should re-purpose this fledgling system to create something worth entering.

The first task was to swap out the source images for some from the British Library. Fortunately, the million public domain images that they published on Flickr contain 6,948 that have been tagged as “people”. So it was a simple matter to use a Flickr module for Python to download a few hundred of these and extract the face coordinates as before.

Once that was done, I roped in another colleague, Neil Hawkins, to help me add some improvements to the front-end display. In particular:

  • Handling more than one face in shot
  • Displaying the title of the work
  • Displaying a thumbnail image of the full source image

And that was it! The final result can be seen in the video below of us testing it in and around our office. We also plugged in a laptop running the system to a large monitor in the BL conference centre so that BL Labs Symposium delegates could experience it first-hand.

If you want to know more about this, please get in touch! Tristan Roddis [email protected]

A clip of Tristan receiving the Award is below (starts at 8:42 and finishes at 14:10)

 

23 January 2018

Using Transkribus for handwritten text recognition with the India Office Records

In this post, Alex Hailey, Curator, Modern Archives and Manuscripts, describes the Library's work with handwritten text recognition.

National Handwriting Day seems like a good time to introduce the Library’s initial work with the Transkribus platform to produce automatic Handwritten Text Recognition models for use with the India Office Records.

Transkribus is produced and supported as part of the READ project, and provides a platform 'for the automated recognition, transcription and searching of historical documents'. Users upload images and then identify areas of writing (text regions) and lines within those regions. Once a page has been segmented in this way, users transcribe the text to produce a 'ground truth' transcription – an accurate representation of the text on the page. The ground truth texts and images are then used to train a recurrent neural network to produce a tool to transcribe texts from images: a Handwritten Text Recognition (HTR) model.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2018-01-22/8f108ba6-3247-429a-995c-6db42a4d3d7f.png
Page segmented using the automated line identification tool. The document structure tree can be seen in the left panel.

After hearing about the project at the Linnean Society’s From Cabinet to Internet conference in 2015, we decided to run a small pilot project using material digitised as part of the Botany in British India project.

Producing ground truth text and Handwritten Text Recognition (HTR) models

We created an initial set of ground truth training data for 200 images, produced by India Office curators and with the help of a PhD student. This data was sent to the Transkribus team to produce our first HTR model. We also supplied material for the construction of a dictionary to be used alongside the HTR, based on the text from the botany chapter of Science and the Changing Environment in India 1780-1920 and contemporary botanical texts.

The accuracy of an HTR model can be determined by generating an automated transcription, correcting any errors, and then comparing the two versions. The Transkribus comparison tool calculates a Character Error Rate (CER) and a Word Error Rate (WER), and also provides a handy visualisation. With our first HTR model we saw an average CER of 30% and WER of 50%, which reflected the small size of the training set and the number of different hands across the collections.

(Transkribus recommends using collections with one or two consistent hands, but we thought we would push on regardless to get an idea of the challenges when using complex, multi-authored archives).

Doc18776img16
WER and CER are quite unforgiving measures of accuracy. The image above has 18.5% WER and 9.5% CER

For our second model we created an additional 500 pages of ground truth text, resulting in a training set of 83,358 words over 14,599 lines. We saw a marked improvement in results with this second HTR model – an average WER of 30%, and CER of 15%.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2018-01-22/a59e02fd-b126-424b-97c8-57aa42172c10.png
Graph showing the learning curve for our second HTR model, measured in CER

Improvements in the automatic layout detection and the ability to run the HTR over images in batch means that we can now generate ground truth more quickly by correcting computer-produced transcriptions than we could through a fully-manual process. We have since generated and corrected an additional 200 pages of transcriptions, and have expanded the training dataset for our next HTR model.

Lessons learned and next steps

We have now produced over 800 pages of corrected transcriptions using Transkribus, and have a much better idea of the challenges that the India Office material poses for current HTR technologies. Pages with margins and inconsistent paragraph widths prove challenging for the automatic layout detection, although the line identification has improved significantly, and tends to require only minor corrections (if any). Faint text, numerals, and tabulated text appeared to pose problems for our HTR models, as did particularly elaborate or lengthy ascenders and descenders.

More positively, we have signed a Memorandum of Understanding with the READ project, and are now able to take part in the exciting conversations around the transcription and searching of digitised manuscript materials, which we can hopefully start to feed into developments at the Library. The presentations from the recent Transkribus Conference are a good place to start if you want to learn more.

The transcriptions will be made available to researchers via data.bl.uk, and we are also planning to use them to test the ingest and delivery of transcriptions for manuscript material via the Universal Viewer.

By Alex Hailey, Curator, Modern Archives and Manuscripts

If you liked this post, you might also be interested in The good, the bad, and the cross-hatched on the Untold Lives blog.

22 January 2018

BL Labs 2017 Symposium: Data Mining Verse in 18th Century Newspapers by Jennifer Batt

Dr Jennifer Batt, Senior Lecturer at the University of Bristol, reported on an investigation in finding verse using text and data-mining methods in a collection of digitised eighteenth-century newspapers in the British Library’s Burney Collection to recover a complex, expansive, ephemeral poetic culture that has been lost to us for well over 250 years. The collection equates to around 1 million pages, around 700 or so bound volumes of 1271 titles of newspapers and news pamphlets published in London and also some English provincial, Irish and Scottish papers, and a few examples from the American colonies.

A video of her presentation is available below:

Jennifer's slides are available on SlideShare by clicking on the image below or following the link:

Datamining for verse in eighteenth-century newspapers
Datamining for verse in eighteenth-century newspapers

https://www.slideshare.net/labsbl/datamining-for-verse-in-eighteenthcentury-newsapers 

 

 

30 December 2017

The Flitch of Bacon: An Unexpected Journey Through the Collections of the British Library

Digital Curator Dr. Mia Ridge writes: we're excited to feature this guest post from an In the Spotlight participant. Edward Mills is a PhD student at the University of Exeter working on Anglo-Norman didactic literature. He also runs his own (somewhat sporadic) blog, ‘Anglo-Normantics’, and can be found Tweeting, rather more frequently, at @edward_mills.

Many readers of [Edward's] blog will doubtless be familiar with the work being done by the Digital Scholarship team, of which one particularly remarkable example is the ‘In the Spotlight‘ project. The idea behind the project, for anyone who may have missed it, is absolutely fascinating: to create crowd-sourced transcriptions of part of the Library’s enormous collection of playbills. The part of the project that I’ve been most involved with so far is concerned with titles, and it’s a two-part process; first, the title is identified out of the (numerous) lines of text on the page, and once this has been verified by multiple volunteers, it is then fed back into the database as an item for transcription.

PlaybillsPizarro
In the Spotlight interface

Often, though, the titles alone are more than sufficient to pique my interest. One such intriguing morsel came to light during a recent transcribing stint, when I found myself faced with a title that raised even more questions than Love, Law, & Physic:

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/85a34802-64e9-4beb-8156-9aa1517413cd.png
Playbill for a performance of The Flitch of Bacon

In my day-job, I’m actually a medievalist, which meant that any play entitled The Flitch of Bacon was bound to pique my interest. The ‘flitch’ refers to an ancient – and certainly medieval –  custom in Dunmow, Essex, wherein couples who could prove that they had never once regretted their marriage in a year and a day would be awarded a ‘flitch’ (side) of bacon in recognition of their fidelity. I first came across the custom of these ‘flitch trials’ while watching an episode of the excellent Citation Needed podcast, and was intrigued to learn from there that references to the trials existed as far back as Chaucer (more on which later). The trials have an unbroken tradition stretching back centuries, and videos from 1925, 1952 and 2012 go some way towards demonstrating their continuing popularity. What the British Library project revealed, however, was that the flitch also served as the driver for artistic creation in its own right. A little bit of digging revealed that the libretto to the 1776 Flitch of Bacon farce has been digitised as part of the British Library’s own collections, and the lyrics are every bit as spectacular as one might expect them to be.

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/36b47ae7-9dc4-48dc-8d5a-3e023eae6f27.png
Rev. Henry Bate, The Flitch of Bacon: A Comic Opera in Two Acts (London: T. Evans, 1779), p. 24.

So far, so … unique. But, of course, the medievalist that dwells deep within me couldn’t resist digging into the history of the tradition, and once again the British Library’s collections came up trumps. The official website for the Dunmow Flitch Trials (because of course such a thing exists) proudly asserts that ‘a reference … can even be found within Chaucer’s 14th-century Canterbury Tales‘, which of course can easily be checked with a quick skim through the Library’s wonderful catalogue of digitised manuscripts. The Wife of Bath’s Prologue opens with the titular wife describing her attitude towards her first three husbands, whom she ‘hadde […] hoolly in myn honde’. She keeps them so busy that they soon come to regret their marriage to her, forfeiting their right to ‘the bacoun …that som men fecche in Essex an Donmowe’ in the process:

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/8e410cff-7b1c-4413-ae03-635c2f58fac9.png
‘The bacoun was nought fet for hem I trowe / That som men fecche in Essex an Donmowe’. From the Wife of Bath’s Tale (British Library, MS Harley 7334, fol. 89r).

Chaucer’s reference to the flitch custom is frequently taken, along with William Langland’s allusion in Piers Plowman to couples who ‘do hem to Donemowe […] To folwe for the fliche’, to be the earliest reference to the tradition that can be found in English literature. Once again, though, the British Library’s collections can help us to put this particular statement to the test; as you’ve probably guessed by now, they show that there is indeed an earlier reference to the custom waiting to be found.

Baconanglonorman

Our source for this precocious French-language reference is MS Harley 4657. Like many surviving medieval manuscripts, this codex is often described as a ‘miscellany’: that is, a collection of shorter works brought together into a single volume. In the case of Harley 4657, the book appears to have been designed as a coherent whole, with the texts copied together at around the same time and sharing quires with each other; this is perhaps explained by the fact that the texts contained within it are all devotional and didactic in nature. (Miscellanies that were, by contrast, put together at a later date are known as recueils factices – another useful term, along with the ‘flitch of bacon’, to slip into conversation with friends and family members.) The bulk of the book is taken up by the Manuel des pechez, a guide to confession that was later translated into English by Robert Manning as Handling Synne. It’s in this text that the flitch custom makes an appearance, as part of a description of how many couples do not deserve any recompense for loyalty on account of their mutual mistrust (fol. 21):

image from https://s3.amazonaws.com/feather-client-files-aviary-prod-us-east-1/2017-12-21/7f90c385-77be-4cdb-9160-94c0aa7ce873.png

17 October 2017

Imaginary Cities – Collaborations with Technologists

Posted by Mahendra Mahey (Manager of BL Labs) on behalf of Michael Takeo Magruder (BL Labs Artist/Researcher in Residence).

In developing the Imaginary Cities project, I enlisted two long-standing colleagues to help collaboratively design the creative-technical infrastructures required to realise my artistic vision.

The first area of work sought to address my desire to create an automated system that could take a single map image from the British Library’s 1 Million Images from Scanned Books Flickr Commons collection and from it generate an endless series of everchanging aesthetic iterations. This initiative was undertaken by the software architect and engineer David Steele who developed a server-side program to realise this concept.

David’s server application links to a curated set of British Library maps through their unique Flickr URLs. The high-resolution maps are captured and stored by the server, and through a pre-defined algorithmic process are transformed into ultra-high-resolution images that appear as mandala-esque ‘city plans’. This process of aesthetic transformation is executed once per day, and is affected by two variables. The first is simply the passage of time, while the second is based on external human or network interaction with the original source maps in the digital collection (such as changes to meta data tags, view counts, etc.).


Time-lapse of algorithmically generated images (showing days 1, 7, 32 and 152) constructed from a 19th-century map of Paris

The second challenge involved transforming the algorithmically created 2D assets into real-time 3D environments that could be experienced through leading-edge visualisation systems, including VR headsets. This work was led by the researcher and visualisation expert Drew Baker, and was done using the 3D game development platform Unity. Drew produced a working prototype application that accessed the static image ‘city plans’ generated by David’s server-side infrastructure, and translated them into immersive virtual ‘cityscapes’.

The process begins with the application analysing an image bitmap and converting each pixel into a 3D geometry that is reminiscent of a building. These structures are then textured and aligned in a square grid that matches the original bitmap. Afterwards, the camera viewpoint descends into the newly rezzed city and can be controlled by the user.

Takeo_DS-Blog3-2_Unity1
Analysis and transformation of the source image bitmap
Takeo_DS-Blog3-3_Unity2
View of the procedurally created 3D cityscape

At present I am still working with David and Drew to refine and expand these amazing systems that they have created. Moving forward, our next major task will be to successfully use the infrastructures as the foundation for a new body of artwork.

You can see a presentation from me at the British Library Labs Symposium 2017 at the British Library Conference Centre Auditorium in London, on Monday 30th of October, 2017. For more information and to book (registration is FREE), please visit the event page.

About the collaborators:

Takeo_DS-Blog3-4_D-Steele
David Steele

David Steele is a computer scientist based in Arlington, Virginia, USA specialising in progressive web programming and database architecture. He has been working with a wide range of web technologies since the mid-nineties and was a pioneer in pairing cutting-edge clients to existing corporate infrastructures. His work has enabled a variety of advanced applications from global text messaging frameworks to re-entry systems for the space shuttle. He is currently Principal Architect at Crunchy Data Solutions, Inc., and is involved in developing massively parallel backup solutions to protect the world's ever-growing data stores.

Takeo_DS-Blog3-5_D-Baker
Drew Baker

Drew Baker is an independent researcher based in Melbourne Australia. Over the past 20 years he has worked in visualisation of archaeology and cultural history. His explorations in 3D digital representation of spaces and artefacts as a research tool for both virtual archaeology and broader humanities applications laid the foundations for the London Charter, establishing internationally-recognised principles for the use of computer-based visualisation by researchers, educators and cultural heritage organisations. He is currently working with a remote community of Indigenous Australian elders from the Warlpiri nation in the Northern Territory’s Tanami Desert, digitising their intangible cultural heritage assets for use within the Kurdiji project – an initiative that seeks to improve mental health and resilience in the nation’s young people through the use mobile technologies.

26 September 2017

BL Labs Symposium (2017), Mon 30 Oct: book your place now!

Bl_labs_logo

Posted by Mahendra Mahey, BL Labs Manager

The BL Labs team are pleased to announce that the fifth annual British Library Labs Symposium will be held on Monday 30 October, from 9:30 - 17:30 in the British Library Conference Centre, St Pancras. The event is FREE, although you must book a ticket in advance. Don't miss out!

The Symposium showcases innovative projects which use the British Library’s digital content, and provides a platform for development, networking and debate in the Digital Scholarship field.

Josie-Fraser
Josie Fraser will be giving the keynote at this year's Symposium

This year, Dr Adam Farquhar, Head of Digital Scholarship at the British Library, will launch the Symposium and Josie Fraser, Senior Technology Adviser on the National Technology Team, based in the Department for Digital, Culture, Media and Sport in the UK Government, will be presenting the keynote. 

There will be presentations from BL Labs Competition (2016) runners up, artist/researcher Michael Takeo Magruder about his 'Imaginary Cities' project and lecturer/researcher Jennifer Batt about her 'Datamining verse in Eighteenth Century Newspapers' project.

After lunch, the winners of the BL Labs Awards (2017) will be announced followed by presentations of their work. The Awards celebrates researchers, artists, educators and entrepreneurs from around the world who have made use of the British Library's digital content and data, in each of the Awards’ categories:

  • BL Labs Research Award. Recognising a project or activity which shows the development of new knowledge, research methods or tools.
  • BL Labs Artistic Award. Celebrating a creative or artistic endeavour which inspires, stimulates, amazes and provokes.
  • BL Labs Commercial Award. Recognising work that delivers or develops commercial value in the context of new products, tools or services that build on, incorporate or enhance the British Library's digital content.
  • BL Labs Teaching / Learning Award. Celebrating quality learning experiences created for learners of any age and ability that use the British Library's digital content.
  • BL Labs Staff Award. Recognising an outstanding individual or team who have played a key role in innovative work with the British Library's digital collections.  

The Symposium's endnote will be followed by a networking reception which will conclude the event, at which delegates and staff can mingle and network over a drink.  

Tickets are going fast, so book your place for the Symposium today!

For any further information please contact [email protected]

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs