Digital scholarship blog

Enabling innovative research with British Library digital collections

207 posts categorized "Experiments"

04 September 2015

What makes the Crowdsourcing Arcade Machine tick?

2015-09-02 16.31.43

Can crowdsourcing be done in public? I've spent a few days building a large arcade-style cabinet that is tough and rugged and that the general public can interact with. There is no external keyboard or mouse to this, but you can think of it as a normal computer.

The joystick and two buttons are a constraint, intended to encourage more casual applications and use. Can a machine that looks like it has come from the 1980s, help with crowdsourcing applications? Are there any games that can both run with these constraints AND provide data about cultural collections?

To start this conversation properly, we have just launched a Game Jam: https://itch.io/jam/britishlibrary. This is open to anyone who wants to write something that fits with this machine. We are interested in prototypes, full functioning games or even just ideas of what might make for a fun game. The only key point is that there is some aspect to the game which might tell us something interesting and new about our collections. 

Specification:

  • Raspberry Pi 2 - Quad core 700MHz by default, but can be overclocked if necessary.
    • Running Raspian by default, but will run whatever flavour of OS needed for a game.
  • 4:3 LCD screen, up to 1280x1024 screen resolution.
  • It should have a wifi connection in most locations (however, as you might expect with wifi, it may not work all the time!)
  • Illuminated marquee
  • Stereo sound (Speakers above the screen)
  • Joystick - movement is mapped to the up,down, left, and right cursor keys
  • Two input buttons - also mapped to key presses, Left Ctrl and Left Alt by default, but can be changed if necessary.
  • Up to two auxiliary buttons - on the front of the cabinet, also mapped to key presses.


2015-09-02 16.33.19


From top to bottom: Raspberry Pi 2, Amplifier and Power source (5v and 12v)


2015-09-02 16.33.34


The underside of the control panel and the bottom of the mounted LCD. Uses standard arcade controls (Happ brand in this case) and an I-PAC2 to map these onto keyboard presses for convenience. 

13 August 2015

Fin; or reflections on thirty months of Digital Research

Thirty months ago I joined the British Library Digital Research Team. In that time we (often with the folks from British Library Labs) have achieved a huge amount, not least putting over one million public domain images on Flickr, developing our internal training provision, and repurposing British Library collections to enrich the education and outlook of computer science and game design students. This week I say goodbye.

Picture1

The Digital Research Team was created in 2010 with a broad mission that covers everything from enabling computational analysis of large scale digitised collections and creative reuse of openly licenced collections to advocacy of clear data citation and digital skills training. I often have summed our role up by saying that we are here to ensure that the British Library's digital collections are used in ways that go beyond looking at them on a webpage, an open, data, and creativity orientated approach that is at the forefront of the British Library's vision.

Picture4

I came to the team from academia and a background in studying long eighteenth-century satirical prints. My data was small, perspectives narrow, and foobar modest, but my eyes, ears, and mind open. And they needed to be, for in my first month in the job the British Library celebrated enhanced powers to collect non-print materials published in the UK. In effect this meant that this library of around 170 million things had the power to collect the UK web domain. Since then the library has collected over 2 billion web pages, fundamentally changing our collection profile (see the UK Web Archive blog for more), making the British Library a place full of data as much as books. Even the beloved manuscript, I soon learnt, was not 'safe' from the bitstream for also changing our collection profile were the small but growing volume of floppy disks, CD-ROMs, hard-drives, and email archives that are the archives of life in the 'Information Age'. And these personal digital archives are more than just collections of 'proper' born-digital documents typed up on personal computers, they include software, browser-caches, spam, and downloads folders, in fact they include every bit on every disk: captures of whole computing environments that can be booted up to offer an experiential window into a person's interaction with their machine.

Picture5

I say can but in most cases they aren't. For as unpublished material these archives, like their paper counterparts, can only be made available to readers once we are sure we have complied with things like The Data Protection Act, a time consuming process that requires people to examine each and every digital object. This clash of possibilities speaks to two overarching themes of my thirty months with the Digital Research Team. The first is the gap that often appears between well thought out established practice and the demands of large and/or complex digital collections: in the case of born-digital manuscript collections, responsibilities to both readers and depositors compete when faced with hundreds of thousands of files. The second is the important - but often forgotten - role of decisions made by people in the creation, management, and marshalling of large and/or complex digital collections. This role may be self-evident. But data does tend to flatten and depersonalise. And interfaces to data tend to emphasise those qualities in their haste to ensure that experiences are smooth, that tensions recede from view. As someone trained to trace the provenance of evidence and to examine the role of agency and power in humanistic phenomena, I see it as important to put the personal back into our use of data. Why? Well, when you search Explore the British Library and Google Books you don't just search databases of 56 million things and over 30 million books respectively, rather you search accumulations of human labour, expertise, and decision making shaped (and constrained) by local, temporal, and organisational priorities and worldviews. When you browse Wikipedia, Wikimedia Commons, or Wikisource you rely on the production of human labour mediated through community guidelines and practices that - perhaps inevitably - introduce prejudices. When you use any computational process to take data in and push data out, the bit in the middle isn't the work of a machine but the work of people instructing a machine, people - as Mia Ridge, Ramon Amaro and the Software Sustainability Institute, among others, remind us - with opinions, perspectives, fears, and dreams. And when you seek solace in a standard, you seek solace in something that, as a product of human agency, can never wholly be neutral.

Screenshot 2015-04-07 15.30.46 - Copy

This may all sound a bit negative. But my point is that many of the achievements of the Digital Research Team stem from this sort of thinking, an approach that is deeply critical of techno-evangelist perspectives to the role of digital collections, methods, and approaches in society and culture. We don't assume that digital technology is the solution but rather that an approach that sees people using digital technology is one solution among many possible solutions. My job over the last thirty months has been to collaborate with amazing people both in and outside to British Library to chose the right solutions. As I move to a new position outside the British Library, I look forward to seeing the fruits of these and future decisions appear on the Digital Scholarship Blog.

James Baker -- Curator, Digital Research -- @j_w_baker

05 August 2015

Crowdsourcing as Interesting Decisions: Update from BL Labs 2015 Competition Winner

Posted by Mahendra Mahey (BL Labs Manager) on behalf of Adam Crymble, a Lecturer in Digital History at the University of Hertfordshire, and one of the winners of the 2015 British Library Labs competition, describes the current progress of his project, ‘Mechanical Curator Arcade’.

When I was nine years old my friend Robbie and I spent an inordinate amount of time in the local video game arcade, and far more money than either of us would like to admit. We watched enviously as the teenagers hogged the Street Fighter II machine near the entrance. Robbie and I retreated deeper into the arcade, where we found a favourite in The Simpsons Arcade Game.

 

We even beat it once.

Like many children of the 1970s, 80s, and 90s, video games were a staple of our formative years. Many of us have developed a superhuman ability to stare at screens for long periods without blinking. We know instinctively that there is something behind this wall, and that some combination of buttons will help us discover it:

Gaming_wall

But how many of us know Why a game is fun? I only recently began to ask myself that question, and I came across a quote attributed to renowned video game maker Sid Meier, the creator of the Civilization franchise. Meier noted that 'a game is a series of interesting choices'.

Not everyone agrees with that definition, but it's a surprisingly simple and astute observation. Games lay down a series of rules - they generate the conditions of a virtual universe. We learn the rules, and our challenge is to win the game by making choices that lead us through that world, to victory.

But a game is about more than just choices. A game is about losing. Or at least, the threat of losing. If we make the wrong choice - jump on a prickly enemy, for example - we're punished. We die.

This revelation has been important for me, because for the past few months I've been trying to make crowdsourcing fun. Crowdsourcing is an increasingly common practice amongst historians, whereby a simple but repetitive task - such as transcription or tagging a huge set of images - is shared across a large number of volunteers. It adheres to the adage, 'many hands make light work'. Like games, crowdsourcing is inherently about choices. Depending on the task, the volunteer makes a choice. If they're transcribing handwritten documents, they have to decide what word they see on the screen. If they're asked to tag a historic image, they had to decide the appropriate tag.

In order to make crowdsourcing more fun, some projects have attempted to offer a series of incentives. High scores and leaderboards are popular now in 'gamified' crowdsourcing experiences. But I've yet to come across a crowdsourcing game in which you can REALLY lose. It's all carrot, and no stick, and that's why it's no fun.

Counterintuitive, perhaps, but once you hit the age of 5 and your competitive streak kicks in, it's the threat of losing that makes you want to win. And this is where crowdsourcing faces its biggest challenge if we want users to have a 'fun' experience. Because in order for you to lose, the maker of the game needs to know when you've done something wrong - when you've broken the rules of the virtual universe. That's easy enough for Super Mario, because the game is programmed to check when you've bumped into a bad guy, or fallen down a hole. But in crowdsourcing, we have no idea if you've given us the right answer - if you've tagged the image correctly, or transcribed the word right. If we knew that, we wouldn't have to ask you to do it in the first place. That means we can't punish you consistently. And it means you won't have fun the minute you realise that. Because at that point, your interesting decisions become meaningless and any correct information you provide comes down to your good will rather than your desire to win.

That's where we currently stand in our efforts to make crowdsourcing fun. It's a big challenge, but it's one I believe someone out there can tackle. So in the spirit of crowdsourcing, we're turning to the crowd, and we're hosting a virtual 'Game Jam' from 4-11 September 2015 to engage with amateur video game makers everywhere who think they've got the answer.

To help them get started with an appropriate crowdsourcing task, we've put together a sample set of these historic images - around 100 to 200 illustrations each of people, music, architecture, flora, fauna and even cycling - along with several hundred images that we know very little about. We thought this might help to validate the results of the crowdsourced content.

The sample link is: http://bl-labs.github.io/arcadeinterface/sample_images.html

An ideal game draws a random image from the set and through gameplay the player tells us something about the content of the image. Perhaps they choose from our limited set of tags (flora, fauna, mineral, human portrait, landscape, manmade - eg. machine, buildings, ship, abstract, artistic, music, map), or gamemakers can opt to be more creative.

If we like what we see, we've set aside up to £500 (courtesy of the Andrew Mellon Foundation) to work with someone to polish their game and release it as part of our 'Mechanical Curator Arcade Game', a 1980s-style arcade console that we're planning to install in the British Library this autumn. The Game Jam is open to anyone, but only those over the age of 18 are elligible to work for us.

All completed games (whether they fit the crowdsourcing theme or not) will also be eligible to enter the British Library Labs Awards, with a chance to win an additional £500 in prizes, as long as they use the British Library digital content such as the sounds and images from the open collections.

If you're up for the challenge, you can find out more on our Game Jam event page. We're looking forward to working with one of you, and get in touch at [email protected] if you'd like to discuss ideas. We're here to listen and learn.

 

24 July 2015

British Library Labs Project Awards (2015): Call for entries!

Posted by Hana Lewis, BL Labs Project Officer @BL_Labs

The British Library Labs Awards (2015) recognises and promotes work that uses the British Library digital collections / data.

The Awards acknowledge exceptional work within three categories: Research, Creativity and Entrepreneurship.

Research

This category is for work produced within the context of a research project or activity. These entries will demonstrate the development of new knowledge related to content, research methods, or research tools.

Creativity

This category is for work that uses the British Library's digital content in the context of artistic or creative endeavours. Such entries will inspire, stimulate, amaze and provoke.

Entrepreneurship

The final category is for work that delivers or develops commercial value. These entries are likely to be in the context of new products, tools, or services that build on, incorporate, or enhance the British Library's digital content to produce commercial value.

Entries can be submitted until Monday 14th September 2015 (midnight BST).

The submission process is simple and further information can be found through the following link:

About the awards and how to apply.

Each proposal will be assessed by an independent panel of experienced researchers, experts and British Library staff.

Shortlisted entrants will be contacted via email by Monday 12th of October 2015, and invited to participate in our annual Symposium on Monday November 2nd 2015, where the winners will be awarded. At the Symposium, each of the three category winners will receive a £500 prize and an opportunity to promote their work!

Some really fantastic work has already been produced using our digital content, so please spread the word and let’s keep those entries rolling in!

03 July 2015

Turning research questions into computational queries: outputs from the 'Enabling Complex Analysis of Large Scale Digital Collections' project

'Enabling Complex Analysis of Large Scale Digital Collections', a project funded by the Jisc Research Data Spring, empowers researchers to turn their research questions into computational queries and gathers social and technical requirements for infrastructures and services that allow computational exploration of big humanities data. Melissa Terras, Professor of Digital Humanities at UCL and Principal Investigator for the project, blogged in May about initial work to align our data - ALTO XML for 60k+ 17th, 18, and 19th century books - with the performance characteristics of UCL's High Performance Computing Facilities. We have been learning a huge amount about the complexities associated with redeploying architectures designed to work with scientific data (massive yet structured) to the processing of humanities data (not massive instead unstructured). As part of this learning, in June we ran two workshops to which we invited a small, hand-picked group of researchers (from doctoral candidates to mid-career scholars) with queries they wanted to ask of the data that couldn't be satisfied by the sort of search and discovery orientated graphical user interfaces typically served up them.

The researchers were clustered into three groups by their interests, with one group looking for words/strings over time, a second for words/strings in context, and a third for patterns relating to non-textual elements. Each group rotated between three workstations. At one workstation James Hetherington worked with them realise their questions as queries that returned useful derived data. At a second they collaborated with Martin Zaltz Austwick to explore and experiment with ways in which they could represent the data visually. And at a third workstation David Beavan captured their thoughts on the process (such as, does the time taken to wait for results to return impact on your interpretation of those results?), their sense of how computational queries could enrich their research, and their learning outcomes in terms of next steps.

Librarian books and occurrencesSome very sensible best practices emerged from this work: the need to build multiple datasets (counts of books per year, words per year, pages per book, words per book) to normalise results against in different ways; the necessity of explaining and clearly documenting the decisions taken when processing the data (for example taking the earliest year found in the metadata for a given book as the publication year, even if we know that to be incorrect); and the value of having a fixed, definable chunk of data for researchers to work with and explain their results in relation to (and in turn for us, the risks associated with adding more data to the pot at a later date).

Pointmap_largeMoreover, we have outputs on our Github repos that you can work with. We have queries (written in Python) that provide a framework from which you might search for words, phrases, or non-textual elements in this or comparable collections of digital text. We have data from searches across the whole collection on occurrences of disease related words, on the contexts in which librarians appear, and on the location and relative size in the page of every non-textual element (ergo, in most cases, illustration). And we have visualisations, with associated code and iPython Notebooks, of these results. These include a graph of disease references over time per 1000 words (an interactive version is available if you download this html and open it in your browser); a point map charting the size over time of circa 1 million figures (as a percentage of the size of the page the appear in); and, moving our macroscope closer, graphs that show the size of images across the length of single books, that map the illustrative 'heartbeat' of those books, alongside hacky workflow for getting to that point.

Diseases (WEB)The next step is to package these outputs up as 'recipe books' demonstrative of the steps needed to work with large and complex digital collections. We hope that the community - Systems Architects designing services, Research Software Engineers collaborating in humanities research, Humanists dabbling with data and code - can learn from these, build them into their workflows, and push forward our collective ability to make the best of these digital collections.

James Baker -- Curator, Digital Research -- @j_w_baker

27 May 2015

Digital Conversations @ BL: Digital Music Analysis

Last week the BL Digital Research team organised another Digital Conversations event to discuss research projects and trends on digital music analysis. The theme couldn’t be more timely as we just heard the news that the Library has been awarded a £9.5M grant from the Heritage Lottery Fund, as part of the BL’s Save Our Sounds campaign, to digitise and provide access to 500,000 rare, unique and at-risk sound recordings from our Sound Archive and other key audio collections in the UK.

Dr. Tillman Weyde kicked off the event by presenting some interesting findings from the Digital Music Lab, an AHRC funded project aimed at developing new software infrastructure to support musicologists to enquire into large collections of audio files, comparing and interpreting results applying innovative methodological approaches into musicology research. By analysing thousands of sounds recordings and metadata from the BL, CHARM and I Like Music datasets, researchers are now able to discover common patterns shared by specific musical genres, compare information on relationships between different musical styles and visualise changes in tonality, pitch and tempo as applied to a variety of genres as well as within a single piece recorded by various artists in different times and locations. One of the outcomes of this project was the development of an open Web interface that shows to the general public the various ways in which musical genres can be compared according to specific music parameters.

Digital Conversation 8 no 1.compressedAquiles Alencar-Brayner introducing the speakers

Prof. David Rowland and Dr. Simon Brown spoke about the Listening Experience Database project aimed at creating a database of transcribed personal accounts – mainly from manuscripts and printed sources – describing public responses to music. The LED database is a successful example on the importance of crowdsourcing activities for collecting and generating new data. So far the project has received 10,000 entries from the public and researchers involved in the project are interested in expanding the community of contributors so as to add more information to the database. If you are interested in contributing to this project on a more regular basis, or in learning more about the contribution process generally, please send an email to [email protected] .

Prof. Mark Plumbley spoke about the ESRC funded project “Musical Audio Repurposing using Source Separation” lead by Queen Mary University, London.  The aim of the project is to develop a new approach to new methods for musical audio source separation, focussing on soloing and remixing of content to be generated during the project. Researchers involved in this project will also develop a software infrastructure to identify and extract different sounds from a single recording such, for example, the separation of each instrument in an orchestra recording or extraction of different sounds in environmental and wildlife audio files which will become available for researchers by the end of the project in 2017. 

Our colleague, Dr. Sandra Tuppen, discussed the Big Data History of Music, another AHRC funded project involving the British Library in partnership with Royal Holloway aimed at bringing together the world’s biggest datasets on published sheet music, music manuscripts and classical concerts (in excess of 5 million records). Through statistical analysis, manipulation and visualisation of this data, the project will develop new methods for researching music history in innovative ways, associating information from various library catalogues to analyse long term patterns in music trends, music dissemination and popularity, development of music taste, performances, relationships and influences between composers since the 15th century. As Sandra remarked, humans create catalogues and catalogues (as well as humans) change over time, hence the importance for today’s researchers to understand how early music data has been collected and described over the last seven centuries. BL catalogue of printed music used for the Big Data History of Music is available for download and re-use under CC0 license at the British Library open data page.

The last speaker of the evening, Dr Erinma Ochu, discussed the Hookedonmusic project she has been involved which aims to collect information on what makes a tune catchy for the general public. The data used for the project is based on a crowdsourcing activity via a Web based game interface that presents some music extracts to the player who decides which tunes are mostly associated with memories of past experiences. So far 175,000 people have played the hookedonmusic game helping to build the research database of musical memory. Amongst many interesting and multifaceted results (did you know that the catchiest tune since the 1950s according to the information provided by the players is Wannabe by the Spice Girls?) Hookedonmusic is helping researchers to better understand how long term memory is trigged in Alzheimer’s patients through connection between life facts and the music to which they are associated so as to support the treatment of individuals suffering from memory loss. Have a go on the game and bring back the good moments you lived through music.

The event, chaired by Prof. Stephen Cotrell Head of the Music Department at City University London, raised interesting points for debate with the audience. The main message of the evening, at least from my perspective, was that the interdisciplinary work these projects are promoting by putting together musicologists, computer scientists, engineers, archivist and content curators are an essential step to demonstrate how important digital scholarship is for today’s researchers – no matter what discipline we work in!

 

Aquiles Alencar-Brayner

Curator, Digital Research

@AquilesBrayner

29 April 2015

The British Library Machine Learning Experiment

The British Library Big Data Experiment is an ongoing collaboration between British Library Digital Research and UCL Department of Computer Science, facilitated by UCL Centre for Digital Humanities, that enables and engages students in computer science with humanities research and digital libraries as part of their core assessed work.

The experiment plays host to undergraduate and postgraduate student projects that provide the Digital Research team with an experimental test-bed for developing, exploring and exploiting technical infrastructure and digital content in ways that may benefit humanities researchers. Enables Computer Science students to develop skills in a new (and often foreign) domain encourages critical thinking and questioning of their assumptions about the role of library and humanities scholars through real-world, complex projects that stretch and develop both their technical abilities and understanding of user requirements. Further, having Computer Science students engage with Humanities scholars as a routine part this work creates deeper mutual understanding of research needs and discipline specific practices.

The 'big data' in question here is a collection of circa 68k 16th – 19th century Public Domain digitised volumes. The data contains both optical character recognition derived text and over 1 million illustrations of which little is known apart from the size of the images and in which and on which page they appear (for more on the dataset see Ben O'Steen 'A million first steps').

The latest output from the project - the British Library Machine Learning Experiment - is led by a BSc systems engineering module team (Durrant, Rafdi, Sarraf). Together the team designed a public service built around a range of open source services and software (MongoDB, Heroku, Node.js, Weka). This services indexes a subset of the 1 million image collection using tags generated by two public image recognition APIs (Alchemy and Imagga) and a bespoke algorithm. Confidence values are returned and features implemented that allow users to not only search for tags but also browse by tag and by frequently co-occurring tags. The interface even allows a user to tag a random image themselves to see how quickly image recognition APIs can assign tags to images.

Screenshot 2015-04-07 15.30.46 - Copy

The British Library Machine Learning Experiment can be found at http://blbigdata.herokuapp.com/. A video demonstration detailing the service functionality is embedded below. It is clear from using the experimental service that machine learning approaches to image recognition remains a maturing field. Nevertheless, as was underscored by a British Library Labs event last year on large scale image analysis (see my notes from the event), significant advances have been made in recent years. Searches of the British Library Machine Learning Experiment for the tags 'animal', 'bird', or 'church' confirm this trend.

Code from the British Library Machine Learning Experiment is available for reuse under a MIT licence. As this project is very much an experiment, we welcome your feedback via this blog, an email, or GitHub.

Rafdi, Muhammad; Sarraf, Ali; Durrant, James; Baker, James (2015). British Library Machine Learning Experiment. Zenodo. 10.5281/zenodo.17168

James Baker

Curator, Digital Research

@j_w_baker

---

Creative Commons Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: embeds to and from external sources

25 March 2015

Enabling Complex Analysis of Large Scale Digital Collections

Jisc have announced the projects that have been funded through their Research Data Spring programme. One of those chosen is 'Enabling Complex Analysis of Large Scale Digital Collections', a project led by Melissa Terras (Professor of Digital Humanities, UCL) in collaboration with the British Library Digital Research team.

Research Data Spring aims to find new technical tools, software, and service solutions which will improve researchers’ workflows and the use and management of their data. Following an invitational sandpit event in Birmingham last month aimed to encouraging co-design, 'Enabling Complex Analysis of Large Scale Digital Collections' was chosen from over 40 proposed projects to proceed to a three month development phase.

Our rationale for the project is that lots of money has been spent digitising heritage collections and that - as well as being objects that can be presented online for research and public use and reuse - digitised heritage collections are data. The problem of course is that non-computationally trained scholars often don't know what to ask of large quantities of data, it is common that they do not have access to high performance computing facilities, and the exemplar workflows that they need are hard to find. As a consequence, support from content providers for this category of work is regularly ad hoc and difficult to justify substantial investment in. 'Enabling Complex Analysis of Large Scale Digital Collections' aims to address this fundamental problem by extending research data management processes in order to enable novel research and a deeper understanding of emerging research needs. In the initial three month pilot period we will index a collection of circa 60,000 public domain digitised books (see 'A Million First Steps') at UCL Research IT Services and work with a small number of researchers to turn their research questions in computational analysis. The outputs from each research scenario - including derived data, queries, documentation, and indicative visualisations - will be made available as citeable, CC-BY workflow packages suitable for teaching, self-learning, and reuse. Moreover these workflows will deepen understanding of complex, poorly structured, and heterogeneous humanities data and the questions researchers could ask of that data, highlighting through use cases the potential for process and service development in the cultural sector. Details of the proposed work for after the initial three month phase are on the Figshare document embedded above.

We are also delighted that two other projects with British Library involvement have been funded through the Research Data Spring. 'Unlocking the UK's thesis data through persistent identifiers' will investigate integrating ORCID personal identifiers and DataCite DOIs into our ever growing and unique UK thesis collection. 'Methods for Accessing Sensitive Data', otherwise known as AMASED, will adapt and implement DataSHIELD technology in order to (legally) circumvent key copyright, licensing, and privacy obstacles preventing analysis of digital datasets in the humanities and academic publishing. The British Library will supply the same circa 60,000 public domain digitised books to this project to test the extension of DataSHIELD to textual data.

James Baker

Curator, Digital Research

@j_w_baker

---

Creative Commons Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exceptions: embeds to and from external sources

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs