29 April 2015
The British Library Machine Learning Experiment
The British Library Big Data Experiment is an ongoing collaboration between British Library Digital Research and UCL Department of Computer Science, facilitated by UCL Centre for Digital Humanities, that enables and engages students in computer science with humanities research and digital libraries as part of their core assessed work.
The experiment plays host to undergraduate and postgraduate student projects that provide the Digital Research team with an experimental test-bed for developing, exploring and exploiting technical infrastructure and digital content in ways that may benefit humanities researchers. Enables Computer Science students to develop skills in a new (and often foreign) domain encourages critical thinking and questioning of their assumptions about the role of library and humanities scholars through real-world, complex projects that stretch and develop both their technical abilities and understanding of user requirements. Further, having Computer Science students engage with Humanities scholars as a routine part this work creates deeper mutual understanding of research needs and discipline specific practices.
The 'big data' in question here is a collection of circa 68k 16th – 19th century Public Domain digitised volumes. The data contains both optical character recognition derived text and over 1 million illustrations of which little is known apart from the size of the images and in which and on which page they appear (for more on the dataset see Ben O'Steen 'A million first steps').
The latest output from the project - the British Library Machine Learning Experiment - is led by a BSc systems engineering module team (Durrant, Rafdi, Sarraf). Together the team designed a public service built around a range of open source services and software (MongoDB, Heroku, Node.js, Weka). This services indexes a subset of the 1 million image collection using tags generated by two public image recognition APIs (Alchemy and Imagga) and a bespoke algorithm. Confidence values are returned and features implemented that allow users to not only search for tags but also browse by tag and by frequently co-occurring tags. The interface even allows a user to tag a random image themselves to see how quickly image recognition APIs can assign tags to images.
The British Library Machine Learning Experiment can be found at http://blbigdata.herokuapp.com/. A video demonstration detailing the service functionality is embedded below. It is clear from using the experimental service that machine learning approaches to image recognition remains a maturing field. Nevertheless, as was underscored by a British Library Labs event last year on large scale image analysis (see my notes from the event), significant advances have been made in recent years. Searches of the British Library Machine Learning Experiment for the tags 'animal', 'bird', or 'church' confirm this trend.
Code from the British Library Machine Learning Experiment is available for reuse under a MIT licence. As this project is very much an experiment, we welcome your feedback via this blog, an email, or GitHub.
Rafdi, Muhammad; Sarraf, Ali; Durrant, James; Baker, James (2015). British Library Machine Learning Experiment. Zenodo. 10.5281/zenodo.17168
Curator, Digital Research
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: embeds to and from external sources