Digital scholarship blog

Enabling innovative research with British Library digital collections

205 posts categorized "Experiments"

11 September 2020

BL Labs Public Awards 2020: enter before NOON GMT Monday 30 November 2020! REMINDER

The sixth BL Labs Public Awards 2020 formally recognises outstanding and innovative work that has been carried out using the British Library’s data and / or digital collections by researchers, artists, entrepreneurs, educators, students and the general public.

The closing date for entering the Public Awards is NOON GMT on Monday 30 November 2020 and you can submit your entry any time up to then.

Please help us spread the word! We want to encourage any one interested to submit over the next few months, who knows, you could even win fame and glory, priceless! We really hope to have another year of fantastic projects to showcase at our annual online awards symposium on the 15 December 2020 (which is open for registration too), inspired by our digital collections and data!

This year, BL Labs is commending work in four key areas that have used or been inspired by our digital collections and data:

  • Research - A project or activity that shows the development of new knowledge, research methods, or tools.
  • Artistic - An artistic or creative endeavour that inspires, stimulates, amazes and provokes.
  • Educational - Quality learning experiences created for learners of any age and ability that use the Library's digital content.
  • Community - Work that has been created by an individual or group in a community.

What kind of projects are we looking for this year?

Whilst we are really happy for you to submit your work on any subject that uses our digital collections, in this significant year, we are particularly interested in entries that may have a focus on anti-racist work or projects about lock down / global pandemic. We are also curious and keen to have submissions that have used Jupyter Notebooks to carry out computational work on our digital collections and data.

After the submission deadline has passed, entries will be shortlisted and selected entrants will be notified via email by midnight on Friday 4th December 2020. 

A prize of £150 in British Library online vouchers will be awarded to the winner and £50 in the same format to the runner up in each Awards category at the Symposium. Of course if you enter, it will be at least a chance to showcase your work to a wide audience and in the past this has often resulted in major collaborations.

The talent of the BL Labs Awards winners and runners up over the last five years has led to the production of remarkable and varied collection of innovative projects described in our 'Digital Projects Archive'. In 2019, the Awards commended work in four main categories – Research, Artistic, Community and Educational:

BL_Labs_Winners_2019-smallBL  Labs Award Winners for 2019
(Top-Left) Full-Text search of Early Music Prints Online (F-TEMPO) - Research, (Top-Right) Emerging Formats: Discovering and Collecting Contemporary British Interactive Fiction - Artistic
(Bottom-Left) John Faucit Saville and the theatres of the East Midlands Circuit - Community commendation
(Bottom-Right) The Other Voice (Learning and Teaching)

For further detailed information, please visit BL Labs Public Awards 2020, or contact us at [email protected] if you have a specific query.

Posted by Mahendra Mahey, Manager of British Library Labs.

04 August 2020

Having a Hoot for International Owl Awareness Day

Who doesn’t love owls? Here at the British Library we certainly do.

Often used as a symbol of knowledge, they are the perfect library bird. A little owl is associated and frequently depicted with the Greek goddess of wisdom Athena. The University of Bath even awarded Professor Yoda the European eagle owl a library card in recognition of his valuable service deterring seagulls from nesting on their campus.

The British Library may not have issued a reader pass to an owl (as far as I am aware!), but we do have a wealth of owl sound recordings in our wildlife and environmental sounds collection, you can read about and listen to some of these here.

Little Owl calls recorded by Nigel Tucker in Somerset, England (BL ref 124857)

Owls can also be discovered in our UK Web Archive. Our UK Web Archivists recently examined the Shine dataset to explore which UK owl species is the most popular on the archived .uk domain. Read here to find out which owl is the winner.

They also curate an Online Enthusiast Communities in the UK collection, which features bird watching and some owl related websites in the Animal related hobbies subsection. If you know of websites that you think should be included in this collection, then please fill in their online nomination form.

Here in Digital Scholarship I recently found many fabulous illustrations of owls in our Mechanical Curator Flickr image collection of over a million Public Domain images. So to honour owls on International Owl Awareness Day, I put together an owl album.

These owl illustrations are freely available, without copyright restrictions, for all types of creative projects, including digital collages. My colleague Hannah Nagle blogged about making collages recently and provided this handy guide. For finding more general images of nature for your collages, you may find it useful to browse other Mechanical Curator themed albums, such as Flora & Fauna, as these are rich resources for finding illustrations of trees, plants, animals and birds.

If you creatively use our Mechanical Curator Flickr images, please do share them with us on twitter, using the hashtag #BLdigital, we always love to see what people have done with them. Plus if you use any of our owls today, remember to include the #InternationalOwlAwarenessDay hashtag too!

We also urge you to be eagle-eyed (sorry wrong bird!) and look out for some special animated owls during the 4th August, like this one below, which uses both sounds and images taken from our collections. These have been created by Carlos Rarugal, our arty Assistant Web Archivist and will shared from the WildlifeWeb Archive and Digital Scholarship Twitter accounts. 


Video created by Carlos Rarugal,  using Tawny Owl hoots recorded by Richard Margoschis in Gloucestershire, England (BL ref 09647) and British Library digitised image from page 79 of "Woodland Wild: a selection of descriptive poetry. From various authors. With ... illustrations on steel and wood, after R. Bonheur, J. Bonheur, C. Jacque, Veyrassat, Yan Dargent, and other artists"

One of the benefits of making digital art, is that there is no risks of spilling paint or glue on your furniture! As noted in this tweet from Damyanti Patel "Thanks for the instructions, my kids were entertained & I had no mess to clean up after their art so a clear win win, they really enjoyed looking through the albums". I honestly did not ask them to do this, but it is really cool that her children included this fantastic owl in the centre of one of their digital collages:

I quite enjoy it when my library life and goth life connect! During the covid-19 lockdown I have attended several online club nights. A few months ago I was delighted to see that one of these; How Did I Get Here? Alternative 80s Night! regularly uses the British Library Flickr images to create their event flyers, using illustrations of people in strange predicaments to complement the name of their club; like this sad lady sitting inside a bird cage, in the flyer below.

Their next online event is Saturday 22nd August and you can tune in here. If you are a night owl, you could even make some digital collages, while listening to some great tunes. Sounds like a great night in to me!

Illustration of a woman sitting in a bird cage with a book on the floor just outside the cage
Flyer image for How Did I Get Here? Alternative 80s Night!

This post is by Digital Curator Stella Wisdom (@miss_wisdom

22 July 2020

World of Wikimedia

During recent months of working from home, the Wikimedia family of platforms, including Wikidata and Wikisource, have enabled many librarians and archivists to do meaningful work, to enhance and amplify access to the collections that they curate.

I’ve been very encouraged to learn from other institutions and initiatives who have been working with these platforms. So I recently invited some wonderful speakers to give a “World of Wikimedia” series of remote guest lectures for staff, to inspire my colleagues in the British Library.

Circle of logos from the Wikimedia family of platforms
Logos of the Wikimedia Family of platforms

Stuart Prior from Wikimedia UK kicked off this season with an introduction to Wikimedia and the projects within it, and how it works with galleries, libraries, archives and museums. He was followed by Dr Martin Poulter, who had been the Bodleian Library’s Wikimedian In Residence. Martin shared his knowledge of how books, authors and topics are represented in Wikidata, how Wikidata is used to drive other sites, including Wikipedia, and how Wikipedia combines data and narrative to tell the world about notable books and authors.

Continuing with the theme of books, Gavin Willshaw spoke about the benefits of using Wikisource for optical character recognition (OCR) correction and staff engagement. Giving an overview of the National Library of Scotland’s fantastic project to upload 3,000 digitised Scottish Chapbooks to Wikisource during the Covid-19 lockdown. Focusing on how the project came about, its impact, and how the Library plans to take activity in this area forward in the future.

Illustration of two 18th century men fighting with swords
Tippet is the dandy---o. The toper's advice. Picking lilies. The dying swan, shelfmark L.C.2835(14), from the National Library of Scotland's Scottish Chapbooks collection

Closing the World of Wikimedia season, Adele Vrana and Anasuya Sengupta gave an extremely thought provoking talk about Whose Knowledge? This is a global multilingual campaign, which they co-founded, to centre the knowledges of marginalised communities (the majority of the world) online. Their work includes the annual #VisibleWikiWomen campaign to make women more visible on Wikipedia, which I blogged about recently.

One of the silver linings of the covid-19 lockdown has been that I’ve been able to attend a number of virtual events, which I would not have been able to travel to, if they had been physical events. These have included LD4 Wikidata Affinity Group online meetings; which is a biweekly zoom call on Tuesdays at 9am PDT (5pm BST).

I’ve also remotely attended some excellent online training sessions: “Teaching with Wikipedia: a practical 'how to' workshop” ran by Ewan McAndrew, Wikimedian in Residence at The University of Edinburgh. Also “Wikimedia and Libraries - Running Online Workshops” organised by the Chartered Institute of Library and Information Professionals in Scotland (CILIPS), presented by Dr Sara Thomas, Scotland Programme Coordinator for Wikimedia UK, and previously the Wikimedian in Residence at the Scottish Library and Information Council. From attending the latter, I learned of an online “How to Add Suffragettes & Women Activists to Wikipedia” half day edit-a-thon event taking place on the 4th July organised by Sara, Dr t s Beall and Clare Thompson from the Protests and Suffragettes project, this is a wonderful project, which recovers and celebrates the histories of women activists in Govan, Glasgow.

We have previously held a number of in person Wikipedia edit-a-thon events at the British Library, but this was the first time that I had attended one remotely, via Zoom, so this was a new experience for me. I was very impressed with how it had been organised, using break out rooms for newbies and more experienced editors, including multiple short comfort breaks into the schedule and having very do-able bite size tasks, which were achievable in the time available. They used a comprehensive, but easy to understand, shared spreadsheet for managing the tasks that attendees were working on. This is definitely an approach and a template that I plan to adopt and adapt for any future edit-a-thons I am involved in planning.

Furthermore, it was a very fun and friendly event, the organisers had created We Can [edit]! Zoom background template images for attendees to use, and I learned how to use twinkles on videocalls! This is when attendees raise both hands and wiggle their fingers pointing upwards, to indicate agreement with what is being said, without causing a soundclash. This hand signal has been borrowed it from the American Sign Language word for applause, it is also used by the Green Party and the Occupy Movement.

With enthusiasm fired up from my recent edit-a-thon attending experience, last Saturday I joined the online Wikimedia UK 2020 AGM. Lucy Crompton-Reid, Chief Executive of Wikimedia UK, gave updates on changes in the global Wikimedia movement, such as implementing the 2030 strategy, rebranding Wikimedia, the Universal Code of Conduct and plans for Wikipedia’s 20th birthday. Lucy also announced that three trustees Kelly Foster, Nick Poole and Doug Taylor, who stood for the board were all elected. Nick and Doug have both been on the board since July 2015 and were re-elected. I was delighted to learn that Kelly is a new trustee joining the board for the first time. As Kelly has previously been a trainer at BL Wikipedia edit-a-thon events, and she coached me to create my first Wikipedia article on Coventry godcakes at a Wiki-Food and (mostly) Women edit-a-thon in 2017.

In addition to these updates, Gavin Willshaw, gave a keynote presentation about the NLS Scottish chapbooks Wikisource project that I mentioned earlier, and there were three lightning talks: Andy Mabbett; 'Wiki Hates Newbies', Clare Thompson, Lesley Mitchell and Dr t s Beall; 'Protests and Suffragettes: Highlighting 100 years of women’s activism in Govan, Glasgow, Scotland' and Jason Evans; 'An update from Wales'.

Before the event ended, there was a 2020 Wikimedia UK annual awards announcement, where libraries and librarians did very well indeed:

  • UK Wikimedian of the Year was awarded to librarian Caroline Ball for education work and advocacy at the University of Derby (do admire her amazing Wikipedia dress in the embedded tweet below!)
  • Honourable Mention to Ian Watt for outreach work, training, and efforts around Scotland's COVID-19 data
  • Partnership of the Year was given to National Library of Scotland for the WikiSource chapbooks project led by Gavin Willshaw
  • Honourable Mention to University of Edinburgh for work in education and Wikidata
  • Up and Coming Wikimedian was a joint win to Emma Carroll for work on the Scottish Witch data project and Laura Wood Rose for work at University of Edinburgh and on the Women in Red initiative
  • Michael Maggs was given an Honorary Membership, in recognition of his very significant contribution to the charity over a number of years.

Big congratulations to all the winners. Their fantastic work, and also in Caroline's case, her fashion sense, is inspirational!

For anyone interested, the next online event that I’m planning to attend is a #WCCWiki Colloquium organised by The Women’s Classical Committee, which aims to increase the representation of women classicists on Wikipedia. Maybe I’ll virtually see you there…

This post is by Digital Curator Stella Wisdom (@miss_wisdom

10 June 2020

International Conference on Interactive Digital Storytelling 2020: Call for Papers, Posters and Interactive Creative Works

It has been heartening to see many joyful responses to our recent post featuring The British Library Simulator; an explorable, miniature, virtual version of the British Library’s building in St Pancras.

If you would like to learn more about our Emerging Formats research, which is informing our work in collecting examples of complex digital publications, including works made with Bitsy, then my colleague Giulia Carla Rossi (who built the Bitsy Library) is giving a Leeds Libraries Tech Talk on Digital Literature and Interactive Storytelling this Thursday, 11th June at 12 noon, via Zoom.

Giulia will be joined by Leeds Libraries Central Collections Manager, Rhian Isaac, who will showcase some of Leeds Libraries exciting collections, and also Izzy Bartley, Digital Learning Officer from Leeds Museums and Galleries, who will talk about her role in making collections interactive and accessible. Places are free, but please book here.

If you are a researcher, or writer/artist/maker, of experimental interactive digital stories, then you may want to check out the current call for submissions for The International Conference on Interactive Digital Storytelling (ICIDS), organised by the Association for Research in Digital Interactive Narratives, a community of academics and practitioners concerned with the advancement of all forms of interactive narrative. The deadline for proposing Research Papers, Exhibition Submissions, Posters and Demos, has been extended to the 26th June 2020, submissions can be made via the ICIDS 2020 EasyChair Site.

The ICIDS 2020 dates, 3-6 November, on a photograph of Bournemouth beach

ICIDS showcases and shares research and practice in game narrative and interactive storytelling, including the theoretical, technological, and applied design practices. It is an interdisciplinary gathering that combines computational narratology, narrative systems, storytelling technology, humanities-inspired theoretical inquiry, empirical research and artistic expression.

For 2020, the special theme is Interactive Digital Narrative Scholarship, and ICIDS will be hosted by the Department of Creative Technology of Bournemouth University (also hosts of the New Media Writing Prize, which I have blogged about previously). Their current intention is to host a mixed virtual and physical conference. They are hoping that the physical meeting will still take place, but all talks and works will also be made available virtually for those who are unable to attend physically due to the COVID-19 situation. This means that if you submit work, you will still need to register and present your ideas, but for those who are unable to travel to Bournemouth, the conference organisers will be making allowances for participants to contribute virtually.

ICIDS also includes a creative exhibition, showcasing interactive digital artworks, which for 2020 will explore the curatorial theme “Texts of Discomfort”. The exhibition call is currently seeking Interactive digital art works that generate discomfort through their form and/or their content, which may also inspire radical changes in the way we perceive the world.

Creatives are encouraged to mix technologies, narratives, points of view, to create interactive digital artworks that unsettle interactors’ assumptions by tackling the world’s global issues; and/or to create artworks that bring to a crisis interactors’ relation with language, that innovate in their way to intertwine narrative and technology. Artworks can include, but are not limited to:

  • Augmented, mixed and virtual reality works
  • Computer games
  • Interactive installations
  • Mobile and location-based works
  • Screen-based computational works
  • Web-based works
  • Webdocs and interactive films
  • Transmedia works

Submissions to the ICIDS art exhibition should be made using this form by 26th June. Any questions should be sent to [email protected]. Good luck!

This post is by Digital Curator Stella Wisdom (@miss_wisdom

21 May 2020

The British Library Simulator

The British Library Simulator is a mini game built using the Bitsy game engine, where you can wander around a pixelated (and much smaller) version of the British Library building in St Pancras. Bitsy is known for its compact format and limited colour-palette - you can often recognise your avatar and the items you can interact with by the fact they use a different colour from the background.

The British Library building depicted in Bitsy
The British Library Simulator Bitsy game

Use the arrow keys on your keyboard (or the WASD buttons) to move around the rooms and interact with other characters and objects you meet on the way - you might discover something new about the building and the digital projects the Library is working on!

Bitsy works best in the Chrome browser and if you’re playing on your smartphone, use a sliding movement to move your avatar and tap on the text box to progress with the dialogues.

Most importantly: have fun!

The British Library, together with the other five UK Legal Deposit Libraries, has been collecting examples of complex digital publications, including works made with Bitsy, as part of the Emerging Formats Project. This collection area is continuously expanding, as we include new examples of digital media and interactive storytelling. The formats and tools used to create these publications are varied, and allow for innovative and often immersive solutions that could only be delivered via a digital medium. You can read more about freely-available tools to write interactive fiction here.

This post is by Giulia Carla Rossi, Curator of Digital Publications (@giugimonogatari).

20 May 2020

Bringing Metadata & Full-text Together

This is a guest post by enthusiastic data and metadata nerd Andy Jackson (@anjacks0n), Technical Lead for the UK Web Archive.

In Searching eTheses for the openVirus project we put together a basic system for searching theses. This only used the information from the PDFs themselves, which meant the results looked like this:

openVirus EThOS search results screen
openVirus EThOS search results screen

The basics are working fine, but the document titles are largely meaningless, the last-modified dates are clearly suspect (26 theses in the year 1600?!), and the facets aren’t terribly useful.

The EThOS metadata has much richer information that the EThOS team has collected and verified over the years. This includes:

  • Title
  • Author
  • DOI, ISNI, ORCID
  • Institution
  • Date
  • Supervisor(s)
  • Funder(s)
  • Dewey Decimal Classification
  • EThOS Service URL
  • Repository (‘Landing Page’) URL

So, the question is, how do we integrate these two sets of data into a single system?

Linking on URLs

The EThOS team supplied the PDF download URLs for each record, but we need a common identifer to merge these two datasets. Fortunately, both datasets contain the EThOS Service URL, which looks like this:

https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755301

This (or just the uk.bl.ethos.755301 part) can be used as the ‘key’ for the merge, leaving us with one data set that contains the download URLs alongside all the other fields. We can then process the text from each PDF, and look up the URL in this metadata dataset, and merge the two together in the same way.

Except… it doesn’t work.

The web is a messy place: those PDF URLs may have been direct downloads in the past, but now many of them are no longer simple links, but chains of redirects. As an example, this original download URL:

http://repository.royalholloway.ac.uk/items/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

Now redirects (HTTP 301 Moved Permanently) to the HTTPS version:

https://repository.royalholloway.ac.uk/items/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

Which then redirects (HTTP 302 Found) to the actual PDF file:

https://repository.royalholloway.ac.uk/file/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

So, to bring this all together, we have to trace these links between the EThOS records and the actual PDF documents.

Re-tracing Our Steps

While the crawler we built to download these PDFs worked well enough, it isn’t quite a sophisticated as our main crawler, which is based on Heritrix 3. In particular, Heritrix offers details crawl logs that can be used to trace crawler activity. This functionality would be fairly easy to add to Scrapy, but that’s not been done yet. So, another approach is needed.

To trace the crawl, we need to be able to look up URLs and then analyse what happened. In particular, for every starting URL (a.k.a. seed) we want to check if it was a redirect and if so, follow that URL to see where it leads.

We already use content (CDX) indexes to allow us to look up URLs when accessing content. In particular, we use OutbackCDX as the index, and then the pywb playback system to retrieve and access the records and see what happened. So one option is to spin up a separate playback system and query that to work out where the links go.

However, as we only want to trace redirects, we can do something a little simpler. We can use the OutbackCDX service to look up what we got for each URL, and use the same warcio library that pywb uses to read the WARC record and find any redirects. The same process can then be repeated with the resulting URL, until all the chains of redirects have been followed.

This leaves us with a large list, linking every URL we crawled back to the original PDF URL. This can then be used to link each item to the corresponding EThOS record.

This large look-up table allowed the full-text and metadata to be combined. It was then imported into a new Solr index that replaced the original service, augmenting the records with the new metadata.

Updating the Interface

The new fields are accessible via the same API as before – see this simple search as an example.

The next step was to update the UI to take advantage of these fields. This was relatively simple, as it mostly involved exchanging one field name for another (e.g. from last_modified_year to year_i), and adding a few links to take advantage of the fact we now have access to the URLs to the EThOS records and the landing pages.

The result can be seen at:

EThOS Faceted Search Prototype

The Results

This new service provides a much better interface to the collection, and really demonstrates the benefits of combining machine-generated and manually curated metadata.

New openVirus EThOS search results interface
New improved openVirus EThOS search results interface

There are still some issues with the source data that need to be resolved at some point. In particular, there are now only 88,082 records, which indicates that some gaps and mismatches emerged during the process of merging these records together.

But it’s good enough for now.

The next question is: how do we integrate this into the openVirus workflow? 

 

14 May 2020

Searching eTheses for the openVirus project

This is a guest post by Andy Jackson (@anjacks0n), Technical Lead for the UK Web Archive and enthusiastic data-miner.

Introduction

The COVID-19 outbreak is an unprecedented global crisis that has prompted an unprecedented global response. I’ve been particularly interested in how academic scholars and publishers have responded:

It’s impressive how much has been done in such a short time! But I also saw one comment that really stuck with me:

“Our digital libraries and archives may hold crucial clues and content about how to help with the #covid19 outbreak: particularly this is the case with scientific literature. Now is the time for institutional bravery around access!”
– @melissaterras

Clearly, academic scholars and publishers are already collaborating. What could digital libraries and archives do to help?

Scale, Audience & Scope

Almost all the efforts I’ve seen so far are focused on helping scientists working on the COVID-19 response to find information from publications that are directly related to coronavirus epidemics. The outbreak is much bigger than this. In terms of scope, it’s not just about understanding the coronavirus itself. The outbreak raises many broader questions, like:

  • What types of personal protective equipment are appropriate for different medical procedures?
  • How effective are the different kinds of masks when it comes to protecting others?
  • What coping strategies have proven useful for people in isolation?

(These are just the examples I’ve personally seen requests for. There will be more.)

Similarly, the audience is much wider than the scientists working directly on the COVID-19 response. From medical professions wanting to know more about protective equipment, to journalists looking for context and counter-arguments.

As a technologist working at the British Library, I felt like there must be some way I could help this situation. Some way to help a wider audience dig out any potentially relevant material we might hold?

The openVirus Project

While looking out for inspiration, I found Peter Murray-Rust’s openVirus project. Peter is a vocal supporter of open source and open data, and had launched an ambitious attempt to aggregate information relating to viruses and epidemics from scholarly publications.

In contrast to the other efforts I’d seen, Peter wanted to focus on novel data-mining methods, and on pulling in less well-known sources of information. This dual focus on text analysis and on opening up underutilised resources appealed to me. And I already had a particular resource in mind…

EThOS

Of course, the British Library has a very wide range of holdings, but as an ex-academic scientist I’ve always had a soft spot for EThOS, which provides electronic access to UK theses.

Through the web interface, users can search the metadata and abstracts of over half a million theses. Furthermore, to support data mining and analysis, the EThOS metadata has been published as a dataset. This dataset includes links to institutional repository pages for many of the theses.

Although doctoral theses are not generally considered to be as important as journal articles, they are a rich and underused source of information, capable of carrying much more context and commentary than a brief article[1].

The Idea

Having identified EThOS as source of information, the idea was to see if I could use our existing UK Web Archive tools to collect and index the full-text of these theses, build a simple faceted search interface, and perform some basic data-mining operations. If that worked, it would allow relevant theses to be discovered and passed to the openVirus tools for more sophisticated analysis.

Preparing the data sources

The links in the EThOS dataset point to the HTML landing-page for each theses, rather than to the full text itself. To get to the text, the best approach would be to write a crawler to find the PDFs. However, it would take a while to create something that could cope with the variety of ways the landing pages tend to be formatted. For machines, it’s not always easy to find the link to the actual theses!

However, many of the universities involved have given the EThOS team permission to download a copy of their theses for safe-keeping. The URLs of the full-text files are only used once (to collect each thesis shortly after publication), but have nevertheless been kept in the EThOS system since then. These URLs are considered transient (i.e. likely to ‘rot’ over time) and come with no guarantees of longer-term availability (unlike the landing pages), so are not included in the main EThOS dataset. Nevertheless, the EThOS team were able to give me the list of PDF URLs, making it easier to get started quickly.

This is far from ideal: we will miss theses that have been moved to new URLs, and from universities that do not take part (which, notably, includes Oxford and Cambridge). This skew would be avoided if we were to use the landing-page URLs provided for all UK digital theses to crawl the PDFs. But we need to move quickly.

So, while keeping these caveats in mind, the first task was to crawl the URLs and see if the PDFs were still there…

Collecting the PDFs

A simple Scrapy crawler was created, one that could read the PDF URLs and download them without overloading the host repositories. The crawler itself does nothing with them, but by running behind warcprox the web requests and responses (including the PDFs) can be captured in the standardised Web ARChive (WARC) format.

For 35 hours, the crawler attempted to download the 130,330 PDF URLs. Quite a lot of URLs had already changed, but 111,793 documents were successfully downloaded. Of these, 104,746 were PDFs.

All the requests and responses generated by the crawler were captured in 1,433 WARCs each around 1GB in size, totalling around 1.5TB of data.

Processing the WARCs

We already have tools for handling WARCs, so the task was to re-use them and see what we get. As this collection is mostly PDFs, Apache Tika and PDFBox are doing most of the work, but the webarchive-discovery wrapper helps run them at scale and add in additional metadata.

The WARCs were transferred to our internal Hadoop cluster, and in just over an hour the text and associated metadata were available as about 5GB of compressed JSON Lines.

A Legal Aside

Before proceeding, there’s legal problem that we need to address. Despite being freely-available over the open web, the rights and licenses under which these documents are being made available can be extremely varied and complex.

There’s no problem gathering the content and using it for data mining. The problem is that there are limitations on what we can redistribute without permission: we can’t redistribute the original PDFs, or any close approximation.

However, collections of facts about the PDFs are fine.

But for the other openVirus tools to do their work, we need to be able to find out what each thesis are about. So how can we make this work?

One answer is to generate statistical summaries of the contents of the documents. For example, we can break the text of each document up into individual words, and count how often each word occurs. These word frequencies are a no substitute for the real text, but are redistributable and suitable for answering simple queries.

These simple queries can be used to narrow down the overall dataset, picking out a relevant subset. Once the list of documents of interest is down to a manageable size, an individual researcher can download the original documents themselves, from the original hosts[2]. As the researcher now has local copies, they can run their own tools over them, including the openVirus tools.

Word Frequencies

second, simpler Hadoop job was created, post-processing the raw text and replacing it with the word frequency data. This produced 6GB of uncompressed JSON Lines data, which could then be loaded into an instance of the Apache Solr search tool [3].

While Solr provides a user interface, it’s not really suitable for general users, nor is it entirely safe to expose to the World Wide Web. To mitigate this, the index was built on a virtual server well away from any production systems, and wrapped with a web server configured in a way that should prevent problems.

The API this provides (see the Solr documentation for details) enables us to find which theses include which terms. Here are some example queries:

This is fine for programmatic access, but with a little extra wrapping we can make it more useful to more people.

APIs & Notebooks

For example, I was able to create live API documentation and a simple user interface using Google’s Colaboratory:

Using the openVirus EThOS API

Google Colaboratory is a proprietary platform, but those notebooks can be exported as more standard Jupyter Notebooks. See here for an example.

Faceted Search

Having carefully exposed the API to the open web, I was also able to take an existing browser-based faceted search interface and modify to suite our use case:

EThOS Faceted Search Prototype

Best of all, this is running on the Glitch collaborative coding platform, so you can go look at the source code and remix it yourself, if you like:

EThOS Faceted Search Prototype – Glitch project

Limitations

The main limitation of using word-frequencies instead of full-text is that phrase search is broken. Searching for face AND mask will work as expected, but searching for “face mask” doesn’t.

Another problem is that the EThOS metadata has not been integrated with the raw text search. This would give us a much richer experience, like accurate publication years and more helpful facets[4].

In terms of user interface, the faceted search UI above is very basic, but for the openVirus project the API is likely to be of more use in the short term.

Next Steps

To make the search more usable, the next logical step is to attempt to integrate the full-text search with the EThOS metadata.

Then, if the results look good, we can start to work out how to feed the results into the workflow of the openVirus tool suite.

 


1. Even things like negative results, which are informative but can be difficult to publish in article form. ↩︎

2. This is similar data sharing pattern used by Twitter researchers. See, for example, the DocNow Catalogue. ↩︎

3. We use Apache Solr a lot so this was the simplest choice for us. ↩︎

4. Note that since writing this post, this limitation has been rectified. ↩︎

 

24 April 2020

BL Labs Learning & Teaching Award Winners - 2019 - The Other Voice - RCA

Innovations in sound and art

Dr Matt Lewis, Tutor of Digital Direction and Dr Eleanor Dare, Reader of Digital Media both at the School of Communication, at the Royal College of Art and Mary Stewart Curator, Oral History and Deputy Director of National Life Stories at the British Library reflect on an ongoing and award-winning collaboration (posted on behalf of them by Mahendra Mahey, BL Labs Manager).

In spring 2019, based in both the British Library and the Royal College of Art School of Communication, seven students from the MA Digital Direction course participated in an elective module entitled The Other Voice. After listening in-depth to a selection of oral history interviews, the students learnt how to edit and creatively interpret oral histories, gaining insight into the complex and nuanced ethical and practical implications of working with other people’s life stories. The culmination of this collaboration was a two-day student-curated showcase at the British Library, where the students displayed their own creative and very personal responses to the oral history testimonies.

The culmination of this collaboration was a two-day student-curated showcase at the British Library, where the students displayed their own creative and very personal responses to the oral history testimonies. The module was led by Eleanor Dare (Head of Programme for MA Digital Direction, RCA), Matt Lewis (Sound Artist and Musician and RCA Tutor) and Mary Stewart (British Library Oral History Curator). We were really pleased that over 100 British Library staff took the time to come to the showcase, engage with the artwork and discuss their responses with the students.

Eleanor reflects:

The students have benefited enormously from this collaboration, gaining a deeper understanding of the ethics of editing, the particular power of oral history and of course, the feedback and stimulation of having a show in the British Library.”

We were all absolutely delighted that the Other Voice group were the winners of the BL Labs Teaching and Learning Award 2019, presented in November 2019 at a ceremony at the British Library Knowledge Centre.  Two students, Karthika Sakthivel and Giulia Brancati, also showcased their work at the 2019 annual Oral History Society Regional Network Event at the British Library - and contributed to a wide ranging discussion reflecting on their practice and the power of oral history with a group of 35 oral historians from all over the UK.  The collaboration has continued as Mary and Matt ran ‘The Other Voice’ elective in spring 2020, where the students adapted to the Covid-19 Pandemic, producing work under lockdown, from different locations around the world. 

Here is just a taster of the amazing works the students created in 2019, which made them worthy winners of the BL Labs Teaching and Learning Award 2019.

Karthika Sakthivel and Giulia Brancati were both inspired by the testimony of Irene Elliot, who was interviewed by Dvora Liberman in 2014 for an innovative project on Crown Court Clerks. They were both moved by Irene’s rich description of her mother’s hard work bringing up five children in 1950s Preston.

On the way back by Guilia Brancati

Giulia created On the way back an installation featuring two audio points – one with excerpts of Irene’s testimony and another an audio collage inspired by Irene’s description. Two old fashioned telephones played the audio, which the listener absorbed while curled up in an arm chair in a fictional front room. It was a wonderfully immersive experience.

Irene-eilliot
Irene Elliot's testimony interwoven with the audio collage (C1674/05)
Audio collage and photography © Giulia Brancati.
Listen here

Giulia commented:

In a world full of noise and overwhelming information, to sit and really pay attention to someone’s personal story is an act of mindful presence. This module has been continuous learning experience in which ‘the other voice’ became a trigger for creativity and personal reflection.”

Memory Foam by Karthika Sakthivel

Inspired by Irene’s testimony Karthika created a wonderful sonic quilt, entitled Memory Foam.

Karthika explains,

There was power in Irene’s voice, enough to make me want to sew - something I’d never really done on my own before. But in her story there was comfort, there was warmth and that kept me going.”

Illustrated with objects drawn from Irene's memories, each square of the patchwork quilt encased conductive fabric that triggered audio clips. Upon touching each square, the corresponding story would play.

Karthika further commented,

The initial visitor interactions with the piece gave me useful insights that enabled me to improve the experience in real time by testing alternate ways of hanging and displaying the quilt. After engaging with the quilt guests walked up to me with recollections of their own mothers and grandmothers – and these emotional connections were deeply rewarding.”

Karthika, Giulia and the whole group were honoured that Irene and her daughter Jayne travelled from Preston to come to the exhibition, Karthika:

"It was the greatest honour to have her experience my patchwork of her memories. This project for me unfurled yards of possibilities, the common thread being - the power of a voice.”

Memory-foam
Irene and her daughter Jayne experiencing Memory Foam © Karthika Sakthivel.
Irene's words activated by touching the lime green patch with lace and a zip (top left of the quilt) (C1674/05)
Listen here

Meditations in Clay by James Roadnight and David Sappa

Listening to ceramicist Walter Keeler's memories of making a pot inspired James Roadnight and David Sappa to travel to Cornwall and record new oral histories to create Meditations in Clay. This was an immersive documentary that explores what we, as members of this modern society, can learn from the craft of pottery - a technology as old as time itself. The film combines interviews conducted at the Bernard Leach pottery with audio-visual documentation of the St Ives studio and its rugged Cornish surroundings.


Meditations in Clay, video montage © James Roadnight and David Sappa.

Those attending the showcase were bewitched as they watched the landscape documentary on the large screen and engaged with the selection of listening pots, which when held to the ear played excerpts of the oral history interviews.

James and David commented,

This project has taught us a great deal about the deep interview techniques involved in Oral History. Seeing visitors at the showcase engage deeply with our work, watching the film and listening to our guided meditation for 15, 20 minutes at a time was more than we could have ever imagined.”

Beyond Form

Raf Martins responded innovatively to Jonathan Blake’s interview describing his experiences as one of the first people in the UK to be diagnosed with HIV. In Beyond Form Raf created an audio soundscape of environmental sounds and excerpts from the interview which played alongside a projected 3D hologram based on the cellular structure of the HIV virus. The hologram changed form and shape when activated by the audio – an intriguing visual artefact that translated the vibrant individual story into a futuristic media.

Beyond-form
Jonathan Blake's testimony interwoven with environmental soundscape (C456/104) Soundscape and image © Raf Martins.
Listen here

Stiff Upper Lip

Also inspired by Jonathan Blake’s interview was the short film Stiff Upper Lip by Kinglsey Tao which used clips of the interview as part of a short film exploring sexuality, identity and reactions to health and sickness.

Donald in Wonderland

Donald Palmer’s interview with Paul Merchant contained a wonderful and warm description of the front room that his Jamaican-born parents ‘kept for best’ in 1970s London. Alex Remoleux created a virtual reality tour of the reimagined space, entitled Donald in Wonderland, where the viewer could point to various objects in the virtual space and launch the corresponding snippet of audio.

Alex commented,

I am really happy that I provided a Virtual Reality experience, and that Donald Palmer himself came to see my work. In the picture below you can see Donald using the remote in order to point and touch the objects represented in the virtual world.”

Donald-wonderland
Donald Palmer describes his parents' front room (C1379/102)
Interviewee Donald Palmer wearing the virtual reality headset, exploring the virtual reality space (pictured) created by Alex Remoleux.
Listen here

Showcase at the British Library

The reaction to the showcase from the visitors and British Library staff was overwhelmingly positive, as shown by this small selection of comments. We were incredibly grateful to interviewees Irene and Donald for attending the showcase too. This was an excellent collaboration: RCA students and staff alike gained new insights into the significance and breadth of the British Library Oral History collection and the British Library staff were bowled over by the creative responses to the archival collection.

Feedback
Examples of feedback from British Library showcase of 'The Other Voice' by Royal College of Art

With thanks to the MA Other Voice cohort Giulia Brancati, Raf Martins, Alexia Remoleux, James Roadnight, Karthika Sakthivel, David Sappa and Kingsley Tao, RCA staff Eleanor Dare and Matt Lewis & BL Oral History Curator Mary Stewart, plus all the interviewees who recorded their stories and the visitors who took the time to attend the showcase.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs