Digital scholarship blog

Enabling innovative research with British Library digital collections

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

21 May 2020

The British Library Simulator

The British Library Simulator is a mini game built using the Bitsy game engine, where you can wander around a pixelated (and much smaller) version of the British Library building in St Pancras. Bitsy is known for its compact format and limited colour-palette - you can often recognise your avatar and the items you can interact with by the fact they use a different colour from the background.

The British Library building depicted in Bitsy
The British Library Simulator Bitsy game

Use the arrow keys on your keyboard (or the WASD buttons) to move around the rooms and interact with other characters and objects you meet on the way - you might discover something new about the building and the digital projects the Library is working on!

Bitsy works best in the Chrome browser and if you’re playing on your smartphone, use a sliding movement to move your avatar and tap on the text box to progress with the dialogues.

Most importantly: have fun!

The British Library, together with the other five UK Legal Deposit Libraries, has been collecting examples of complex digital publications, including works made with Bitsy, as part of the Emerging Formats Project. This collection area is continuously expanding, as we include new examples of digital media and interactive storytelling. The formats and tools used to create these publications are varied, and allow for innovative and often immersive solutions that could only be delivered via a digital medium. You can read more about freely-available tools to write interactive fiction here.

This post is by Giulia Carla Rossi, Curator of Digital Publications (@giugimonogatari).

20 May 2020

Bringing Metadata & Full-text Together

This is a guest post by enthusiastic data and metadata nerd Andy Jackson (@anjacks0n), Technical Lead for the UK Web Archive.

In Searching eTheses for the openVirus project we put together a basic system for searching theses. This only used the information from the PDFs themselves, which meant the results looked like this:

openVirus EThOS search results screen
openVirus EThOS search results screen

The basics are working fine, but the document titles are largely meaningless, the last-modified dates are clearly suspect (26 theses in the year 1600?!), and the facets aren’t terribly useful.

The EThOS metadata has much richer information that the EThOS team has collected and verified over the years. This includes:

  • Title
  • Author
  • DOI, ISNI, ORCID
  • Institution
  • Date
  • Supervisor(s)
  • Funder(s)
  • Dewey Decimal Classification
  • EThOS Service URL
  • Repository (‘Landing Page’) URL

So, the question is, how do we integrate these two sets of data into a single system?

Linking on URLs

The EThOS team supplied the PDF download URLs for each record, but we need a common identifer to merge these two datasets. Fortunately, both datasets contain the EThOS Service URL, which looks like this:

https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755301

This (or just the uk.bl.ethos.755301 part) can be used as the ‘key’ for the merge, leaving us with one data set that contains the download URLs alongside all the other fields. We can then process the text from each PDF, and look up the URL in this metadata dataset, and merge the two together in the same way.

Except… it doesn’t work.

The web is a messy place: those PDF URLs may have been direct downloads in the past, but now many of them are no longer simple links, but chains of redirects. As an example, this original download URL:

http://repository.royalholloway.ac.uk/items/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

Now redirects (HTTP 301 Moved Permanently) to the HTTPS version:

https://repository.royalholloway.ac.uk/items/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

Which then redirects (HTTP 302 Found) to the actual PDF file:

https://repository.royalholloway.ac.uk/file/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

So, to bring this all together, we have to trace these links between the EThOS records and the actual PDF documents.

Re-tracing Our Steps

While the crawler we built to download these PDFs worked well enough, it isn’t quite a sophisticated as our main crawler, which is based on Heritrix 3. In particular, Heritrix offers details crawl logs that can be used to trace crawler activity. This functionality would be fairly easy to add to Scrapy, but that’s not been done yet. So, another approach is needed.

To trace the crawl, we need to be able to look up URLs and then analyse what happened. In particular, for every starting URL (a.k.a. seed) we want to check if it was a redirect and if so, follow that URL to see where it leads.

We already use content (CDX) indexes to allow us to look up URLs when accessing content. In particular, we use OutbackCDX as the index, and then the pywb playback system to retrieve and access the records and see what happened. So one option is to spin up a separate playback system and query that to work out where the links go.

However, as we only want to trace redirects, we can do something a little simpler. We can use the OutbackCDX service to look up what we got for each URL, and use the same warcio library that pywb uses to read the WARC record and find any redirects. The same process can then be repeated with the resulting URL, until all the chains of redirects have been followed.

This leaves us with a large list, linking every URL we crawled back to the original PDF URL. This can then be used to link each item to the corresponding EThOS record.

This large look-up table allowed the full-text and metadata to be combined. It was then imported into a new Solr index that replaced the original service, augmenting the records with the new metadata.

Updating the Interface

The new fields are accessible via the same API as before – see this simple search as an example.

The next step was to update the UI to take advantage of these fields. This was relatively simple, as it mostly involved exchanging one field name for another (e.g. from last_modified_year to year_i), and adding a few links to take advantage of the fact we now have access to the URLs to the EThOS records and the landing pages.

The result can be seen at:

EThOS Faceted Search Prototype

The Results

This new service provides a much better interface to the collection, and really demonstrates the benefits of combining machine-generated and manually curated metadata.

New openVirus EThOS search results interface
New improved openVirus EThOS search results interface

There are still some issues with the source data that need to be resolved at some point. In particular, there are now only 88,082 records, which indicates that some gaps and mismatches emerged during the process of merging these records together.

But it’s good enough for now.

The next question is: how do we integrate this into the openVirus workflow? 

 

18 May 2020

Tree Collage Challenge

Today is the start of Mental Health Awareness Week (18-24 May 2020) and this year’s theme is kindness. In my opinion this starts with being kinder to yourself and there are many ways to do this. As my colleague Hannah Nagle recently reminded me in her recent blog post, creative activities can help you to relax, lift your mood and enable you to express yourself. Also, I personally find that spending time in green spaces and appreciating nature is of great benefit to my mental wellbeing.  UK mental health charity Mind promote ecotherapy and have a helpful section on their website all about nature and mental health.

However, I appreciate that it is not always possible for people to get outside to enjoy nature, especially in the current corona pandemic situation. However, there are ways to bring nature into our homes, such as listening to recordings of bird songs, looking at pictures, and watching videos of wildlife and landscapes. For more ideas on digital ways of connecting to nature, I suggest checking out “Nature and Wellbeing in the Digital Age” by Sue Thomas, who believes we don’t need to disconnect from the internet to reconnect with the earth, sea and sky.

Furthermore, why not participate in this year’s Urban Tree Festival (16-24 May 2020), which is completely online. There is a wide programme of talks and activities, including meditation, daily birdsong, virtual tours, radio and a book club. The festival also includes some brilliant art activities.

Urban Tree Festival logo with a photograph depicting a tree canopy
Urban Tree Festival 2020

Save Our Street Trees Northampton have invited people to create a virtual urban forest in their windows, by building a tree out of paper, then adding leaves every day to slowly build up a tree canopy. People are then encouraged to share photos of their paper trees on social media tagging them #NewLeaf.

Another Urban Tree Festival art project is Branching out with Ruth Broadbent, where people are invited to co-create imaginary trees by observing and drawing selected branches and foliage from sections of different trees. These might be seen from gardens or windows, from photos or from memory.

Paintings and drawings of trees are also celebrated in the Europeana’s Trees in Art online gallery, which has been launched by the festival today, to showcase artworks, which depict trees in urban and rural landscapes, from the digitised collections of museums, galleries, libraries and archives across Europe, including tree book illustrations from the British Library.

Thumbnail pictures of paintings of trees from a website gallery
Europeana Trees in Art online gallery

Not wanting to be left out of the fun, here at the British Library, we have set a Tree Collage Challenge, which invites you to make artistic collages featuring trees and nature, using our book illustrations from the British Library’s Flickr account.

This collection of over a million Public Domain images can be used by anyone for free, without copyright restrictions. The images are illustrations taken from the pages of 17th, 18th and 19th century books. You can read more about them here.

As a starting point, for finding images for your collages, you may find it useful to browse themed albums.  In particular the Flora & Fauna albums are rich resources for finding trees, plants, animals and birds.

To learn how to make digital collages, my colleague Hannah Nagle has written a handy guide, to help get you started. You can download this here.

We hope you have fun and we can’t wait to see your collage creations! So please post your pictures to Twitter and Instagram using #GreatTree and #UrbanTreeFestival. British Library curators will be following the challenge with interest and showcasing their favourite tree collages in future blog posts, so watch this space!

This post is by Digital Curator Stella Wisdom (@miss_wisdom