Digital scholarship blog

Introduction

Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

06 July 2020

Archivists, Stop Wasting Your Ref-ing Time!

“I didn’t get where I am today by manually creating individual catalogue references for thousands of archival records!”

One of the most laborious yet necessary tasks of an archivist is the generation of catalogue references. This was once the bane of my life. But I now have a technological solution, which anyone can download and use for free.

Meet ReG: the newest team member of the Endangered Archives Programme (EAP). He’s not as entertaining as Reginald D Hunter. She’s not as lyrical as Regina Spektor. But like 1970s sitcom character Reggie Perrin, ReG provides a logical solution to the daily grind of office life - though less extreme and hopefully more successful.

Two pictures of musicians, Reginald Hunter and Regina Spektor

Reginald D Hunter (left), [Image originally posted by Pete Ashton at https://flickr.com/photos/51035602859@N01/187673692]; Regina Spektor (right), [Image originally posted by Beny Shlevich at https://www.flickr.com/photos/17088109@N00/417238523]

Reggie Perrin’s boss CJ was famed for his “I didn’t get where I am today” catchphrase, and as EAP’s resident GJ, I decided to employ my own ReG, without whom I wouldn’t be where I am today. Rather than writing this blog, my eyes would be drowning in metadata, my mind gathering dust, and my ears fleeing from the sound of colleagues and collaborators banging on my door, demanding to know why I’m so far behind in my work.

Image of two men at their offices from British sitcom The Rise and Fall of Reginald Perrin

CJ (left) [http://www.leonardrossiter.com/reginaldperrin/12044.jpg] and Reginald Perrin (right) [https://www.imdb.com/title/tt0073990/mediaviewer/rm1649999872] from The Rise and Fall of Reginald Perrin.

The problem

EAP metadata is created in spreadsheets by digitisation teams all over the world. It is then processed by the EAP team in London and ingested into the British Library’s cataloguing system.

When I joined EAP in 2018 one of the first projects to process was the Barbados Mercury and Bridgetown Gazette. It took days to create all of the catalogue references for this large newspaper collection, which spans more than 60 years.

Microsoft Excel’s fill down feature helped automate part of this task, but repeating this for thousands of rows is time-consuming and error-prone.

I needed to find a solution to this.

During 2019 I established new workflows to semi-automate several aspects of the cataloguing process using OpenRefine - but OpenRefine is primarily a data cleaning tool, and its difficulty in understanding hierarchical relationships meant that it was not suitable for this task.

Learning to code

For some time I toyed with the idea of learning to write computer code using the Python programming language. I dabbled with free online tutorials. But it was tough to make practical sense of these generic tutorials, hard to find time, and my motivation dwindled.

When the British Library teamed up with The National Archives and Birkbeck University of London to launch a PG Cert in Computing for Information Professionals, I jumped at the chance to take part in the trial run.

It was a leap certainly worth taking because I now have the skills to write code for the purpose of transforming and analysing large volumes of data. And the first product of this new skillset is a computer program that accurately generates catalogue references for thousands of rows of data in mere seconds.

The solution - ReG in action

By coincidence, one of the first projects I needed to catalogue after creating this program was another Caribbean newspaper digitised by the same team at the Barbados Archives Department: The Barbadian.

This collection was a similar size and structure to the Barbados Mercury, but the generation of all the catalogue references took just a few seconds. All I needed to do was:

Open ReG
Enter the project ID for the collection (reference prefix)
Enter the filename of the spreadsheet containing the metadata

And Bingo! All my references were generated in a new file..

How it works in a nutshell

The basic principle of the program is that it reads a single column in the dataset, which contains the hierarchical information. In the example above, it read the “Level” column.

It then uses this information to calculate the structured numbering of the catalogue references, which it populates in the “Reference” column.

Reference format

The generated references conform to the following format:

Each reference begins with a prefix that is common to the whole dataset. This is the prefix that the user enters at the start of the program. In the example above, that is “EAP1251”.
Forward slashes ( / ) are used to indicate a new hierarchical level.
Each record is assigned its own number relative to its sibling records, and that number is shared with all of the children of that record.

In the example above, the reference for the first collection is formatted:

The reference for the first series of the first collection is formatted:

The reference for the second series of the first collection is:

No matter how complex the hierarchical structure of the dataset, the program will quickly and accurately generate references for every record in accordance with this format.

Download for wider re-use

While ReG was designed primarily for use by EAP, it should work for anyone that generates reference numbers using the same format.

For users of the Calm cataloguing software, ReG could be used to complete the “RefNo” column, which determines the tree structure of a collection when a spreadsheet is ingested into Calm.

With wider re-use in mind, some settings can be configured to suit individual requirements.

For example, you can configure the names of the columns that ReG reads and generates references in. For EAP, the reference generation column is named “Reference”, but for Calm users, it could be configured as “RefNo”.

Users can also configure their own hierarchy. You have complete freedom to set the hierarchical terms applicable to your institution and complete freedom to set the hierarchical order of those terms.

It is possible that some minor EAP idiosyncrasies might preclude reuse of this program for some users. If this is the case, by all means get in touch; perhaps I can tweak the code to make it more applicable to users beyond EAP - though some tweaks may be more feasible than others.

Additional validation features

While generating references is the core function, to that end it includes several validation features to help you spot and correct problems with your data.

Unexpected item in the hierarchy area

For catalogue references to be calculated, all the data in the level column must match a term within the configured hierarchy. The program therefore checks this and if a discrepancy is found, users will be notified and they have two options to proceed.

Option 1: Rename unexpected terms

First, users have the option to rename any unexpected terms. This is useful for correcting typographical errors, such as this example - where “Files” should be “File”.

Option 2: Build a one-off hierarchy

Alternatively, users can create a one-off hierarchy that matches the terms in the dataset. In the following example, the unexpected hierarchical term “Specimen” is a bona fide term. It is just not part of the configured hierarchy.

Rather than force the user to quit the program and amend the configuration file, they can simply establish a new, one-off hierarchy within the program.

This hierarchy will not be saved for future instances. It is just used for this one-off occasion. If the user wants “Specimen” to be recognised in the future, the configuration file will also need to be updated.

Single child records

To avoid redundant information, it is sometimes advisable for an archivist to eliminate single child records from a collection. ReG will identify any such records, notify the user, and give them three options to proceed:

Delete single child records
Delete the parents of single child records
Keep the single child records and/or their parents

Depending on how the user chooses to proceed, ReG will produce one of three results, which affects the rows that remain and the structure of the generated references.

In this example, the third series in the original dataset contains a single child - a single file.

The most notable result is option B, where the parent was deleted. Looking at the “Level” column, the single child now appears to be a sibling of the files from the second series. But the reference number indicates that this file is part of a different branch within the tree structure.

This is more clearly illustrated by the following tree diagrams.

This functionality means that ReG will help you spot any single child records that you may otherwise have been unaware of.

But it also gives you a means of creating an appropriate hierarchical structure when cataloguing in a spreadsheet. If you intentionally insert dummy parents for single child records, ReG can generate references that map the appropriate tree structure and then remove the dummy parent records in one seamless process.

And finally ...

If you’ve got this far, you probably recognise the problem and have at least a passing interest in finding a solution. If so, please feel free to download the software, give it a go, and get in touch.

If you spot any problems, or have any suggested enhancements, I would welcome your input. You certainly won’t be wasting my time - and you might just save some of yours.

Download links

For making this possible, I am particularly thankful to Jody Butterworth, Sam van Schaik, Nora McGregor, Stelios Sotiriadis, and Peter Wood.

This blog post is by Dr Graham Jevon, Endangered Archives Programme cataloguer. He is on twitter as @GJHistory.

Posted by Digital Research Team at 10:05 AM

Tags

Data, Digital scholarship, Projects, Research collaboration

15 June 2020

Marginal Voices in UK Digital Comics

I am an AHRC Collaborative Doctoral Partnership student based at the British Library and Central Saint Martins, University of the Arts London (UAL). The studentship is funded by the Arts and Humanities Research Council’s Collaborative Doctoral Partnership Programme.

Supervised jointly by Stella Wisdom from the British Library, Roger Sabin and Ian Hague from UAL, my research looks to explore the potential for digital comics to take advantage of digital technologies and the digital environment to foster inclusivity and diversity. I aim to examine the status of marginal voices within UK digital comics, while addressing the opportunities and challenges these comics present for the British Library’s collection and preservation policies.

A cartoon strip of three vertical panel images, in the first a caravan is on the edge of a cliff, in the second a dog asleep in a bed, in the third the dog wakes up and sits up in bed

Digital comics have been identified as complex digital publications, meaning this research project is connected to the work of the broader Emerging Formats Project. On top of embracing technological change, digital comics have the potential to reflect, embrace and contribute to social and cultural change in the UK. Digital comics not only present new ways of telling stories, but whose story is told.

One of the comic creators, whose work I have been recently examining is Jaime Huxtable, a Welsh cartoonist/illustrator based in Worthing, West Sussex. He has worked on a variety of digital comics projects, from webcomics to interactive comics, and also runs various comics related workshops.

Samir's Christmas by Jaime Huxtable, this promotional comic strip was created for Freedom From Torture’s 2019 Christmas Care Box Appeal. This comic was made into a short animated video by Hands Up, copyright © Jaime Huxtable

My thesis will explore whether the ways UK digital comics are published and consumed means that they can foreground marginal, alternative voices similar to the way underground comix and zine culture has. Comics scholarship has focused on the technological aspects of digital comics, meaning their potentially significant contribution reflecting and embracing social and cultural change in the UK has not been explored. I want to establish whether the fact digital comics can circumvent traditional gatekeepers means they provide space to foreground marginal voices. I will also explore the challenges and opportunities digital comics might present for legal deposit collection development policy.

As well as being a member of the Comics Research Hub (CoRH) at UAL, I have already begun working with colleagues from the UK Web Archive, and hope to be able to make a significant contribution to the Web Comic Archive. Issues around collection development and management are central to my research, I feel very fortunate to be based at the British Library, to have the chance to learn from and hopefully contribute to practice here.

If anyone would like to know more about my research, or recommend any digital comics for me to look at, please do contact me at [email protected] or @thmsgbhrt on Twitter. UK digital comic creators and publishers can use the ComicHaus app to send their digital comics directly to The British Library digital archive. More details about this process are here.

This post is by British Library collaborative doctoral student Thomas Gebhart (@thmsgbhrt).

Posted by Digital Research Team at 9:56 AM

Tags

Collaborations, Comics-unmasked, Contemporary Britain, Legal deposit, Literature, Projects, Research collaboration, Writing

12 June 2020

Making Watermarks Visible: A Collaborative Project between Conservation and Imaging

Some of the earliest documents being digitised by the British Library Qatar Foundation Partnership are a series of ship’s journals dating from 1605 - 1705, relating to the East India Company’s voyages. Whilst working with these documents, conservators Heather Murphy and Camille Dekeyser-Thuet noticed within the papers a series of interesting examples of early watermark design. Curious about the potential information these could give regarding the journals, Camille and Heather began undertaking research, hoping to learn more about the date and provenance of the papers, trade and production patterns involved in the paper industry of the time, and the practice of watermarking paper. There is a wealth of valuable and interesting information to be gained from the study of watermarks, especially within a project such as the BLQFP which provides the opportunity for study within both IOR and Arabic manuscript material. We hope to publish more information relating to this online with the Qatar Digital Library in the form of Expert articles and visual content.

The first step within this project involved tracing the watermark designs with the help of a light sheet in order to begin gathering a collection of images to form the basis of further research. It was clear that in order to make the best possible use of the visual information contained within these watermarks, they would need to be imaged in a way which would make them available to audiences in both a visually appealing and academically beneficial form, beyond the capabilities of simply hand tracing the designs.

Hand tracings of the watermark designs

This began a collaboration with two members of the BLQFP imaging team, Senior Imaging Technician Jordi Clopes-Masjuan and Senior Imaging Support Technician Matt Lee, who, together with Heather and Camille, were able to devise and facilitate a method of imaging and subsequent editing which enabled new access to the designs. The next step involved the construction of a bespoke support made from Vivak (commonly used for exhibition mounts and stands). This inert plastic is both pliable and transparent, which allowed the simultaneous backlighting and support of the journal pages required to successfully capture the watermarks.

Creation of the Vivak support

Imaging of pages using backlighting

Studio setup for capturing the watermarks

Before capturing, Jordi suggested we create two comparison images of the watermarks. This involved capturing the watermarks as they normally appear on the digitised image (almost or completely invisible), and how they appear illuminated when the page is backlit. The theory behind this was quite simple: “to obtain two consecutive images from the same folio, in the exact same position, but using a specific light set-up for each image”.

By doing so, the idea was for the first image to appear in the same way as the standard, searchable images on the QDL portal. To create these standard image captures, the studio lights were placed near the camera with incident light towards the document.

The second image was taken immediately after, but this time only backlight was used (light behind the document). In using these two different lighting techniques, the first image allowed us to see the content of the document, but the second image revealed the texture and character of the paper, including conservation marks, possible corrections to the writing, as well as the watermarks.

One unexpected occurrence during imaging was, due to the varying texture and thickness of the papers, the power of the backlight had to be re-adjusted for each watermark.

First image taken under normal lighting conditions

Second image of the same page taken using backlighting

https://www.qdl.qa/en/archive/81055/vdc_100000001273.0x000342

Previous to our adopted approach, other imaging techniques were also investigated:

Multispectral photography: by capturing the same folio under different lights (from UV to IR) the watermarks, along with other types of hidden content such as faded ink, would appear. However, it was decided that this process would take too long for the number of watermarks we were aiming to capture.
Light sheet: Although these types of light sheets are extremely slim and slightly flexible, we experienced some issues when trying the double capture, as on many occasions the light sheet was not flexible enough, and was “moving” the page when trying to reach the gutter (for successful final presentation of the images it was mandatory that the folio on both captures was still).

Once we had successfully captured the images, Photoshop proved vital in allowing us to increase the contrast of the watermark and make it more visible. Because every image captured was different, the approach to edit the images was also different. This required varying adjustments of levels, curves, saturation or brightness, and combining these with different fusion modes to attain the best result. In the end, the tools used were not as important as the final image. The last stage within Photoshop was for both images of the same folio to be cropped and exported with the exact same settings, allowing the comparative images to match as precisely as possible.

The next step involved creating a digital line drawing of each watermark. Matt Lee, a Senior Imaging Support Technician, imported the high-resolution image captures onto an iPad and used the Procreate drawing app to trace the watermarks with a stylus pen. To develop an approach that provided accurate and consistent results, Matt first tested brushes and experimented with line qualities and thicknesses. Selecting the Dry Ink brush, he traced the light outlines of each watermark on a separate transparent layer. The tracings were initially drawn in white to highlight the designs on paper and these were later inverted to create black line drawings that were edited and refined.

Tracing the watermarks directly from the screen of an iPad provided a level of accuracy and efficiency that would be difficult to achieve on a computer with a graphics tablet, trackpad or computer mouse. There were several challenges in tracing the watermarks from the image captures. For example, the technique employed by Jordi was very effective in highlighting the watermarks, but it also made the laid and chain lines in the paper more prominent and these would merge or overlap with the light outline of the design.

Some of the watermarks also appeared distorted, incomplete or had handwritten text on the paper which obscured the details of the design. It was important that the tracings were accurate and some gaps had to be left. However, through the drawing process, the eye began to pick out more detail and the most exciting moment was when a vague outline of a horse revealed itself to be a unicorn with inset lettering.

Vector image of unicorn watermark

In total 78 drawings of varying complexity and design were made for this project. To preserve the transparent backgrounds of the drawings, they were exported first as PNG files. These were then imported into Adobe Illustrator and converted to vector drawings that can be viewed at a larger size without loss of image quality.

Vector image of watermark featuring heraldic designs

Once the drawings were complete, we now had three images - the ‘traditional view’ (the page as it would normally appear), the ‘translucid view’ (the same page backlit and showing the watermark) and the ‘translucid + white view’ (the translucid view plus additional overlay of the digitally traced watermark in place on the page).

Traditional view

Translucid view

Translucid view with watermark highlighted by digital tracing

Jordi was able to take these images and, by using a multiple slider tool, was able to display them on an offline website. This enabled us to demonstrate this tool to our team and present the watermarks in the way we had been wishing from the beginning, allowing people to both study and appreciate the designs.

This is a guest post by Heather Murphy, Conservator, Jordi Clopes-Masjuan, Senior Imaging Technician and Matt Lee, Senior Imaging Support Technician from the British Library Qatar Foundation Partnership. You can follow the British Library Qatar Foundation Partnership on Twitter at @BLQatar.

Posted by Digital Research Team at 7:28 AM

Tags

Digital scholarship, Middle East, Projects

10 June 2020

International Conference on Interactive Digital Storytelling 2020: Call for Papers, Posters and Interactive Creative Works

It has been heartening to see many joyful responses to our recent post featuring The British Library Simulator; an explorable, miniature, virtual version of the British Library’s building in St Pancras.

If you would like to learn more about our Emerging Formats research, which is informing our work in collecting examples of complex digital publications, including works made with Bitsy, then my colleague Giulia Carla Rossi (who built the Bitsy Library) is giving a Leeds Libraries Tech Talk on Digital Literature and Interactive Storytelling this Thursday, 11^th June at 12 noon, via Zoom.

Giulia will be joined by Leeds Libraries Central Collections Manager, Rhian Isaac, who will showcase some of Leeds Libraries exciting collections, and also Izzy Bartley, Digital Learning Officer from Leeds Museums and Galleries, who will talk about her role in making collections interactive and accessible. Places are free, but please book here.

If you are a researcher, or writer/artist/maker, of experimental interactive digital stories, then you may want to check out the current call for submissions for The International Conference on Interactive Digital Storytelling (ICIDS), organised by the Association for Research in Digital Interactive Narratives, a community of academics and practitioners concerned with the advancement of all forms of interactive narrative. The deadline for proposing Research Papers, Exhibition Submissions, Posters and Demos, has been extended to the 26^th June 2020, submissions can be made via the ICIDS 2020 EasyChair Site.

ICIDS showcases and shares research and practice in game narrative and interactive storytelling, including the theoretical, technological, and applied design practices. It is an interdisciplinary gathering that combines computational narratology, narrative systems, storytelling technology, humanities-inspired theoretical inquiry, empirical research and artistic expression.

For 2020, the special theme is Interactive Digital Narrative Scholarship, and ICIDS will be hosted by the Department of Creative Technology of Bournemouth University (also hosts of the New Media Writing Prize, which I have blogged about previously). Their current intention is to host a mixed virtual and physical conference. They are hoping that the physical meeting will still take place, but all talks and works will also be made available virtually for those who are unable to attend physically due to the COVID-19 situation. This means that if you submit work, you will still need to register and present your ideas, but for those who are unable to travel to Bournemouth, the conference organisers will be making allowances for participants to contribute virtually.

ICIDS also includes a creative exhibition, showcasing interactive digital artworks, which for 2020 will explore the curatorial theme “Texts of Discomfort”. The exhibition call is currently seeking Interactive digital art works that generate discomfort through their form and/or their content, which may also inspire radical changes in the way we perceive the world.

Creatives are encouraged to mix technologies, narratives, points of view, to create interactive digital artworks that unsettle interactors’ assumptions by tackling the world’s global issues; and/or to create artworks that bring to a crisis interactors’ relation with language, that innovate in their way to intertwine narrative and technology. Artworks can include, but are not limited to:

Augmented, mixed and virtual reality works
Computer games
Interactive installations
Mobile and location-based works
Screen-based computational works
Web-based works
Webdocs and interactive films
Transmedia works

Submissions to the ICIDS art exhibition should be made using this form by 26th June. Any questions should be sent to [email protected]. Good luck!

This post is by Digital Curator Stella Wisdom (@miss_wisdom)

Posted by Digital Research Team at 10:47 AM

Tags

Contemporary Britain, Digital scholarship, Events, Experiments, Games, Legal deposit, Literature, Visual arts, Writing

29 May 2020

IIIF Week 2020

As a founding member of the International Image Interoperability Framework Consortium (IIIF), here at the British Library we are looking forward to the upcoming IIIF Week, which has organised a programme of free online events taking place during 1-5 June.

IIIF Week sessions will discuss digital strategy for cultural heritage, introduce IIIF’s capabilities and community through introductory presentations and demonstrations of use cases. Plus explore the future of IIIF and digital research needs more broadly.

Converting the IIIF annual conference into a virtual event held using Zoom, provides an opportunity to bring together a wider group of the IIIF community. Enabling many to attend, including myself, who otherwise would not have been able join the in-person event in Boston, due to budget, travel restrictions, and other obligations.

Both IIIF newbies and experienced implementers will find events scheduled at convenient times, to allow attendees to form regional community connections in their parts of the world. Attendees can sign up for all events during the week, or just the ones that interest them. Proceedings will be in English unless otherwise indicated, and all sessions will be recorded, then made available following the conference on the IIIF YouTube channel.

To those who know me, it will come as no surprise that I’m especially looking forward to the Fun with IIIF session on Friday 5 June, 4-5pm BST, facilitated by Tristan Roddis from Cogapp. Most of the uses of the International Image Interoperability Framework (IIIF) have focused on scholarly and research applications. This session, however, will look at the opposite extreme: the state of the art for creating playful and fun applications of the IIIF APIs. From tile puzzles, to arcade games, via terapixel fractals, virtual galleries, 3D environments, and the Getty's really cool Nintendo Animal Crossing integration.

Hey #AnimalCrossing fans! Hang Van Gogh’s “Irises” on your wall, wear Manet’s “Spring” on your shirt, or decorate your island with ancient sculpture.

Add art to your game using Getty’s new custom pattern-making tool: https://t.co/0014ri2rTO pic.twitter.com/iWoJCdj3jr
— Getty (@GettyMuseum) April 16, 2020

In addition to the IIIF Week programme, aimed for anyone wanting a more in-depth and practical hands-on teaching, there is a free workshop on getting started with IIIF, the week following the online conference. This pilot course will run over 5 days between 8-12 June, participation is limited to 25 places, available on a first come, first served basis. It will cover:

Getting started with the Image API
Creating IIIF Manifests with the Bodleian manifest editor
Annotating IIIF resources and setting up an annotation server
Introduction to various IIIF tools and techniques for scholarship

Tutors will assist participants to create a IIIF project and demonstrate it on a zoom call at the end of the week.

You can view and sign up for IIIF Week events at https://iiif.io/event/2020/iiifweek/. All attendees are expected to adhere to the IIIF Code of Conduct and encouraged to join the IIIF-Week Slack channel for ongoing questions, comments, and discussion (you’ll need to join the IIIF Slack first, which is open to anyone).

For following and participating in more open discussion on twitter, use the hashtags #IIIF and #IIIFWeek, and if you have any specific questions about the event, please get in touch with the IIIF staff at [email protected].

See you there :-)

This post is by Digital Curator Stella Wisdom (@miss_wisdom)

Posted by Digital Research Team at 9:53 AM

Tags

Collaborations, Digital scholarship, Events, Games

21 May 2020

The British Library Simulator

The British Library Simulator is a mini game built using the Bitsy game engine, where you can wander around a pixelated (and much smaller) version of the British Library building in St Pancras. Bitsy is known for its compact format and limited colour-palette - you can often recognise your avatar and the items you can interact with by the fact they use a different colour from the background.

The British Library building depicted in Bitsy

The British Library Simulator Bitsy game

Use the arrow keys on your keyboard (or the WASD buttons) to move around the rooms and interact with other characters and objects you meet on the way - you might discover something new about the building and the digital projects the Library is working on!

Bitsy works best in the Chrome browser and if you’re playing on your smartphone, use a sliding movement to move your avatar and tap on the text box to progress with the dialogues.

Most importantly: have fun!

<a href="https://giuliac.itch.io/the-british-library-simulator">The British Library Simulator by GiuliaC</a>

<a href="https://giuliac.itch.io/the-british-library-simulator">Play The British Library Simulator on itch.io</a>

The British Library, together with the other five UK Legal Deposit Libraries, has been collecting examples of complex digital publications, including works made with Bitsy, as part of the Emerging Formats Project. This collection area is continuously expanding, as we include new examples of digital media and interactive storytelling. The formats and tools used to create these publications are varied, and allow for innovative and often immersive solutions that could only be delivered via a digital medium. You can read more about freely-available tools to write interactive fiction here.

This post is by Giulia Carla Rossi, Curator of Digital Publications (@giugimonogatari).

Posted by Digital Research Team at 9:03 AM

Tags

Contemporary Britain, Experiments, Games, Legal deposit

20 May 2020

Bringing Metadata & Full-text Together

This is a guest post by enthusiastic data and metadata nerd Andy Jackson (@anjacks0n), Technical Lead for the UK Web Archive.

In Searching eTheses for the openVirus project we put together a basic system for searching theses. This only used the information from the PDFs themselves, which meant the results looked like this:

openVirus EThOS search results screen

The basics are working fine, but the document titles are largely meaningless, the last-modified dates are clearly suspect (26 theses in the year 1600?!), and the facets aren’t terribly useful.

The EThOS metadata has much richer information that the EThOS team has collected and verified over the years. This includes:

Title
Author
DOI, ISNI, ORCID
Institution
Date
Supervisor(s)
Funder(s)
Dewey Decimal Classification
EThOS Service URL
Repository (‘Landing Page’) URL

So, the question is, how do we integrate these two sets of data into a single system?

Linking on URLs

The EThOS team supplied the PDF download URLs for each record, but we need a common identifer to merge these two datasets. Fortunately, both datasets contain the EThOS Service URL, which looks like this:

https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755301

This (or just the uk.bl.ethos.755301 part) can be used as the ‘key’ for the merge, leaving us with one data set that contains the download URLs alongside all the other fields. We can then process the text from each PDF, and look up the URL in this metadata dataset, and merge the two together in the same way.

Except… it doesn’t work.

The web is a messy place: those PDF URLs may have been direct downloads in the past, but now many of them are no longer simple links, but chains of redirects. As an example, this original download URL:

http://repository.royalholloway.ac.uk/items/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

Now redirects (HTTP 301 Moved Permanently) to the HTTPS version:

https://repository.royalholloway.ac.uk/items/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

Which then redirects (HTTP 302 Found) to the actual PDF file:

https://repository.royalholloway.ac.uk/file/bf7a78df-c538-4bff-a28d-983a91cf0634/1/10090181.pdf

So, to bring this all together, we have to trace these links between the EThOS records and the actual PDF documents.

Re-tracing Our Steps

While the crawler we built to download these PDFs worked well enough, it isn’t quite a sophisticated as our main crawler, which is based on Heritrix 3. In particular, Heritrix offers details crawl logs that can be used to trace crawler activity. This functionality would be fairly easy to add to Scrapy, but that’s not been done yet. So, another approach is needed.

To trace the crawl, we need to be able to look up URLs and then analyse what happened. In particular, for every starting URL (a.k.a. seed) we want to check if it was a redirect and if so, follow that URL to see where it leads.

We already use content (CDX) indexes to allow us to look up URLs when accessing content. In particular, we use OutbackCDX as the index, and then the pywb playback system to retrieve and access the records and see what happened. So one option is to spin up a separate playback system and query that to work out where the links go.

However, as we only want to trace redirects, we can do something a little simpler. We can use the OutbackCDX service to look up what we got for each URL, and use the same warcio library that pywb uses to read the WARC record and find any redirects. The same process can then be repeated with the resulting URL, until all the chains of redirects have been followed.

This leaves us with a large list, linking every URL we crawled back to the original PDF URL. This can then be used to link each item to the corresponding EThOS record.

This large look-up table allowed the full-text and metadata to be combined. It was then imported into a new Solr index that replaced the original service, augmenting the records with the new metadata.

Updating the Interface

The new fields are accessible via the same API as before – see this simple search as an example.

The next step was to update the UI to take advantage of these fields. This was relatively simple, as it mostly involved exchanging one field name for another (e.g. from last_modified_year to year_i), and adding a few links to take advantage of the fact we now have access to the URLs to the EThOS records and the landing pages.

The result can be seen at:

EThOS Faceted Search Prototype

The Results

This new service provides a much better interface to the collection, and really demonstrates the benefits of combining machine-generated and manually curated metadata.

New openVirus EThOS search results interface

New improved openVirus EThOS search results interface

There are still some issues with the source data that need to be resolved at some point. In particular, there are now only 88,082 records, which indicates that some gaps and mismatches emerged during the process of merging these records together.

But it’s good enough for now.

The next question is: how do we integrate this into the openVirus workflow?

Posted by Digital Research Team at 9:01 AM

Tags

Collaborations, Contemporary Britain, Data, Digital scholarship, Experiments, Science

18 May 2020

Tree Collage Challenge

Today is the start of Mental Health Awareness Week (18-24 May 2020) and this year’s theme is kindness. In my opinion this starts with being kinder to yourself and there are many ways to do this. As my colleague Hannah Nagle recently reminded me in her recent blog post, creative activities can help you to relax, lift your mood and enable you to express yourself. Also, I personally find that spending time in green spaces and appreciating nature is of great benefit to my mental wellbeing. UK mental health charity Mind promote ecotherapy and have a helpful section on their website all about nature and mental health.

However, I appreciate that it is not always possible for people to get outside to enjoy nature, especially in the current corona pandemic situation. However, there are ways to bring nature into our homes, such as listening to recordings of bird songs, looking at pictures, and watching videos of wildlife and landscapes. For more ideas on digital ways of connecting to nature, I suggest checking out “Nature and Wellbeing in the Digital Age” by Sue Thomas, who believes we don’t need to disconnect from the internet to reconnect with the earth, sea and sky.

Furthermore, why not participate in this year’s Urban Tree Festival (16-24 May 2020), which is completely online. There is a wide programme of talks and activities, including meditation, daily birdsong, virtual tours, radio and a book club. The festival also includes some brilliant art activities.

Urban Tree Festival 2020

Save Our Street Trees Northampton have invited people to create a virtual urban forest in their windows, by building a tree out of paper, then adding leaves every day to slowly build up a tree canopy. People are then encouraged to share photos of their paper trees on social media tagging them #NewLeaf.

Want to #craft some #family time & celebrate #nature on your doorstep? Join our UK-wide #NewLeaf project for @UrbanTreeFest & grow a virtual #urbanforest in your window May 16-24! Go to https://t.co/JtMCn8kj4u for more info 💚🌳 #urbantreefestival #saveourstreettrees #trees #tree pic.twitter.com/FWd7CYoFNN
— SaveOurStreetTrees (@SaveStreetTrees) May 14, 2020

Another Urban Tree Festival art project is Branching out with Ruth Broadbent, where people are invited to co-create imaginary trees by observing and drawing selected branches and foliage from sections of different trees. These might be seen from gardens or windows, from photos or from memory.

Paintings and drawings of trees are also celebrated in the Europeana’s Trees in Art online gallery, which has been launched by the festival today, to showcase artworks, which depict trees in urban and rural landscapes, from the digitised collections of museums, galleries, libraries and archives across Europe, including tree book illustrations from the British Library.

Thumbnail pictures of paintings of trees from a website gallery

Europeana Trees in Art online gallery

Not wanting to be left out of the fun, here at the British Library, we have set a Tree Collage Challenge, which invites you to make artistic collages featuring trees and nature, using our book illustrations from the British Library’s Flickr account.

This collection of over a million Public Domain images can be used by anyone for free, without copyright restrictions. The images are illustrations taken from the pages of 17th, 18th and 19th century books. You can read more about them here.

As a starting point, for finding images for your collages, you may find it useful to browse themed albums. In particular the Flora & Fauna albums are rich resources for finding trees, plants, animals and birds.

To learn how to make digital collages, my colleague Hannah Nagle has written a handy guide, to help get you started. You can download this here.

We hope you have fun and we can’t wait to see your collage creations! So please post your pictures to Twitter and Instagram using #GreatTree and #UrbanTreeFestival. British Library curators will be following the challenge with interest and showcasing their favourite tree collages in future blog posts, so watch this space!

This post is by Digital Curator Stella Wisdom (@miss_wisdom)

Posted by Digital Research Team at 10:25 AM

Tags

Collaborations, Events, Printed books, Visual arts