Digital scholarship blog

Enabling innovative research with British Library digital collections

131 posts categorized "Tools"

13 March 2024

Rethinking Web Maps to present Hans Sloane’s Collections

A post by Dr Gethin Rees, Lead Curator, Digital Mapping...

I have recently started a community fellowship working with geographical data from the Sloane Lab project. The project is titled A Generous Approach to Web Mapping Sloane’s Collections and deals with the collection of Hans Sloane, amassed in the eighteenth century and a foundation collection for the British Museum and subsequently the Natural History Museum and the British Library. The aim of the fellowship is to create interactive maps that enable users to view the global breadth of Sloane’s collections, to discover collection items and to click through to their web pages. The Sloane Lab project, funded by the UK’s Arts and Humanities Research Council as part of the Towards a National collection programme, has created the Sloane Lab knowledge base (SLKB), a rich and interconnected knowledge graph of this vast collection. My fellowship seeks to link and visualise digital representations of British Museum and British Library objects in the SLKB and I will be guided by project researchers, Andreas Vlachidis and Daniele Metilli from University College, London.

Photo of a bust sculpture of a men in a curled wig on a red brick wall
Figure 1. Bust of Hans Sloane in the British Library.

The first stage of the fellowship is to use data science methods to extract place names from the records of Sloane’s collections that exist in the catalogues today. These records will then be aligned with a gazetteer, a list of places and associated data, such as World Historical Gazetteer (https://whgazetteer.org/). Such alignment results in obtaining coordinates in the form of latitude and longitude. These coordinates mean the places can be displayed on a map, and the fellowship will draw on Peripleo web map software to do this (https://github.com/britishlibrary/peripleo).

Image of a rectangular map with circles overlaid on locations
Figure 2 Web map using Web Mercator projection, from the Georeferencer.

https://britishlibrary.oldmapsonline.org/api/v1/density

The fellowship also aims to critically evaluate the use of mapping technologies (eg Google Maps Embed API, MapBoxGL, Leaflet) to present cultural heritage collections on the web. One area that I will examine is the use of the Web Mercator projection as a standard option for presenting humanities data using web maps. A map projection is a method of representing part of the surface of the earth on a plane (flat) surface. The transformation from a sphere or similar to a flat representation always introduces distortion. There are innumerable projections or ways to make this transformation and each is suited to different purposes, with strengths and weaknesses. Web maps are predominantly used for navigation and the Web Mercator projection is well suited to this purpose as it preserves angles.

Image of a rectangular map with circles illustrating that countries nearer the equator are shown as relatively smaller
Figure 3 Map of the world based on Mercator projection including indicatrices to visualise local distortions to area. By Justin Kunimune. Source https://commons.wikimedia.org/wiki/File:Mercator_with_Tissot%27s_Indicatrices_of_Distortion.svg Used under CC-BY-SA-4.0 license. 

However, this does not necessarily mean it is the right projection for presenting humanities data. Indeed, it is unsuitable for the aims and scope of Sloane Lab, first, due to well-documented visual compromises —such as the inflation of landmasses like Europe at the expense of, for example, Africa and the Caribbean— that not only hamper visual analysis but also recreate and reinforce global inequities and injustices. Second, the Mercator projection has a history, entangled with processes like colonialism, empire and slavery that also shaped Hans Sloane’s collections. The fellowship therefore examines the use of other projections, such as those that preserve distance and area, to represent contested collections and collecting practices in interactive maps like Leaflet or Open Layers. Geography is intimately connected with identity and thus digital maps offer powerful opportunities for presenting cultural heritage collections. The fellowship examines how reinvention of a commonly used visualisation form can foster thought-provoking engagement with Sloane’s collections and hopefully be applied to visualise the geography of heritage more widely.

Image of a curved map that represents the relative size of countries more accurately
Figure 4 Map of the world based on Albers equal-area projection including indicatrices to visualise local distortions to area. By Justin Kunimune. Source  https://commons.wikimedia.org/wiki/File:Albers_with_Tissot%27s_Indicatrices_of_Distortion.svg Used under CC-BY-SA-4.0 license. 

21 September 2023

Convert-a-Card: Helping Cataloguers Derive Records with OCLC APIs and Python

This blog post is by Harry Lloyd, Research Software Engineer in the Digital Research team, British Library. You can sometimes find him at the Rose and Crown in Kentish Town.

Last week Dr Adi Keinan-Schoonbaert delved into the invaluable work that she and others have done on the Convert-a-Card project since 2015. In this post, I’m going to pick up where she left off, and describe how we’ve been automating parts of the workflow. When I joined the British Library in February, Victoria Morris and former colleague Giorgia Tolfo had prototyped programmatically extracting entities from transcribed catalogue cards and searching by title and author in the OCLC WorldCat database for any close matches. I have been building on this work, and addressing the last yellow rectangle below: “Curator disambiguation and resolution”. Namely how curators choose between OCLC results and develop a MARC record fit for ingest into British Library systems.

A flow chart of the Convert-a-card workflow. Digital catalogue cards to Transkribus to bespoke language model to OCR output (shelfmark, title, author, other text) to OCLC search and retrieval and shelfmark correction to spreadsheet with results to curator disambiguation and resolution to collection metadata ingest
The Convert-a-Card workflow at the start of 2023

 

Entity Extraction

We’re currently working with the digitised images from two drawers of cards, one Urdu and one Chinese. Adi and Giorgia used a layout model on Transkribus to successfully tag different entities on the Urdu cards. The transcribed XML output then had ‘title’, ‘shelfmark’ and ‘author’ tags for the relevant text, making them easy to extract.

On the left an image of an Urdu catalogue card, on the right XML describing the transcribed text, including a "title" tag for the title line
Card with layout model and resulting XML for an Urdu card, showing the `structure {type:title;}` parameter on line one

The same method didn’t work for the Chinese cards, possibly because the cards are less consistently structured. There is, however, consistency in the vertical order of entities on the card: shelfmark comes above title comes above author. This meant I could reuse some code we developed for Rossitza Atanassova’s Incunabula project, which reliably retrieved title and author (and occasionally an ISBN).

Two Chinese cards side-by-side, with different layouts.
Chinese cards. Although the layouts are variable, shelfmark is reliably the first line, with title and author following.

 

Querying OCLC WorldCat

With the title and author for each card, we were set-up to query WorldCat, but how to do this when there are over two thousand cards in these two drawers alone? Victoria and Giorgia made impressive progress combining Python wrappers for the Z39.50 protocol (PyZ3950) and MARC format (Pymarc). With their prototype, a lot of googling of ASN.1, BER and Z39.50, and a couple of quiet weeks drifting through the web of references between the two packages, I built something that could turn a table of titles and authors for the Chinese cards into a list of MARC records. I had also brushed up on enough UTF-8 to fix why none of the Chinese characters were encoded correctly.

For all that I enjoyed trawling through it, Z39.50 is, in the words of a 1999 tutorial, “rather hard to penetrate” and nearly 35 years old. PyZ39.50, the Python wrapper, hasn’t been maintained for two years, and making any changes to the code is a painstaking process. While Z39.50 remains widely used for transferring information between libraries, that doesn’t mean there aren’t better ways of doing things, and in the name of modernity OCLC offer a suite of APIs for their services. Crucially there are endpoints on their Metadata API that allow search and retrieval of records in MARCXML format. As the British Library maintains a cataloguing subscription to OCLC, we have access to the APIs, so all that’s needed is a call to the OCLC OAuth Server, a search on the Metadata API using title and author, then retrieval of the MARCXML for any results. This is very straightforward in Python, and with the Requests package and about ten lines of code we can have our MARCXML matches.

Selecting Matches

At all stages of the project we’ve needed someone to select the best match for a card from WorldCat search results. This responsibility currently lies with curators and cataloguers from the relevant collection area. With that audience in mind, I needed a way to present MARC data from WorldCat so curators could compare the MARC fields for different matches. The solution needed to let a cataloguer choose a card, show the card and a table with the MARC fields for each WorldCat result, and ideally provide filters so curators could use domain knowledge to filter out bad results. I put out a call on the cross-government data science network, and a colleague in the 10DS data science team suggested Streamlit.

Streamlit is a Python package that allows fast development of web apps without needing to be a web app developer (which is handy as I’m not one). Adding Streamlit commands to the script that processes WorldCat MARC records into a dataframe quickly turned it into a functioning web app. The app reads in a dataframe of the cards in one drawer and their potential worldcat matches, and presents it as a table of cards to choose from. You then see the image of the card you’re working on and a MARC field table for the relevant WorldCat matches. This side-by-side view makes it easy to scan across a particular MARC field, and exclude matches that have, for example, the wrong physical dimensions. There’s a filter for cataloguing language, sort options for things like number of subject access fields and total number of fields, and the ability to remove bad matches from view. Once the cataloguer has chosen a match they can save a match to the original dataframe, or note that there were no good matches, or only a partial match.

Screenshot from the Streamlit web app, with an image of a Chinese catalogue card above a table containing MARC data for different WorldCat matches relating to the card.
Screenshot from the Streamlit Convert-a-Card web app, showing the card and the MARC table curators use to choose between matches. As the cataloguers are familiar with MARC, providing the raw fields is the easiest way to choose between matches.

After some very positive initial feedback, we sat down with the Chinese curators and had them test the app out. That led to a fun, interactive, user experience focussed feedback session, and a whole host of GitHub issues on the repository for bugs and design suggestions. Behind the scenes discussion on where to host the app and data are ongoing and not straightforward, but this has been a deeply easy product to prototype, and I’m optimistic it will provide a light weight, gentle learning curve complement to full deriving software like Aleph (the Library’s main cataloguing system).

Next Steps

The project currently uses a range of technologies in  Transkribus, the OCLC APIs, and Streamlit, and tying these together has in itself been a success. Going forwards, we have the possibility of extracting non-English text from the cards to look forward to, and the richer list of entities this would make available. Working with the OCLC APIs has been a learning curve, and they’re not working perfectly yet, but they represent a relatively accessible option compared to Z39.50. And my hope for the Streamlit app is that it will be a useful tool beyond the project for wherever someone wants to use Worldcat to help derive records from minimal information. We still have challenges in terms of design, data storage, and hosting to overcome, but these discussions should have their own benefits in making future development easier. The goal for automation part of the project is a smooth flow of data from Transkribus, through OCLC and on to the curators, and while it’s not perfect, we’re definitely getting there.

14 September 2023

What's the future of crowdsourcing in cultural heritage?

The short version: crowdsourcing in cultural heritage is an exciting field, rich in opportunities for collaborative, interdisciplinary research and practice. It includes online volunteering, citizen science, citizen history, digital public participation, community co-production, and, increasingly, human computation and other systems that will change how participants relate to digital cultural heritage. New technologies like image labelling, text transcription and natural language processing, plus trends in organisations and societies at large mean constantly changing challenges (and potential). Our white paper is an attempt to make recommendations for funders, organisations and practitioners in the near and distant future. You can let us know what we got right, and what we could improve by commenting on Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper.

The longer version: The Collective Wisdom project was funded by an AHRC networking grant to bring experts from the UK and the US together to document the state of the art in designing, managing and integrating crowdsourcing activities, and to look ahead to future challenges and unresolved issues that could be addressed by larger, longer-term collaboration on methods for digitally-enabled participation.

Our open access Collective Wisdom Handbook: perspectives on crowdsourcing in cultural heritage is the first outcome of the project, our expert workshops were a second.

Mia (me) and Sam Blickhan launched our White Paper for comment on pubpub at the Digital Humanities 2023 conference in Graz, Austria, in July this year, with Meghan Ferriter attending remotely. Our short paper abstract and DH2023 slides are online at Zenodo

So - what's the future of crowdsourcing in cultural heritage? Head on over to Recommendations, Challenges and Opportunities for the Future of Crowdsourcing in Cultural Heritage: a White Paper and let us know what you think! You've got until the end of September…

You can also read our earlier post on 'community review' for a sense of the feedback we're after - in short, what resonates, what needs tweaking, what examples could we include?

To whet your appetite, here's a preview of our five recommendations. (To find out why we make those recommendations, you'll have to read the White Paper):

  • Infrastructure: Platforms need sustainability. Funding should not always be tied to novelty, but should also support the maintenance, uptake and reuse of well-used tools.
  • Evidencing and Evaluation: Help create an evaluation toolkit for cultural heritage crowdsourcing projects; provide ‘recipes’ for measuring different kinds of success. Shift thinking about value from output/scale/product to include impact on participants' and community well-being.
  • Skills and Competencies: Help create a self-guided skills inventory assessment resource, tool, or worksheet to support skills assessment, and develop workshops to support their integrity and adoption.
  • Communities of Practice: Fund informal meetups, low-cost conferences, peer review panels, and other opportunities for creating and extending community. They should have an international reach, e.g. beyond the UK-US limitations of the initial Collective Wisdom project funding.
  • Incorporating Emergent Technologies and Methods: Fund educational resources and workshops to help the field understand opportunities, and anticipate the consequences of proposed technologies.

What have we missed? Which points do you want to boost? (For example, we discovered how many of our points apply to digital scholarship projects in general). You can '+1' on points that resonate with you, suggest changes to wording, ask questions, provide examples and references, or (constructively, please) challenge our arguments. Our funding only supported participants from the UK and US, so we're very keen to hear from folk from the rest of the world.

12 September 2023

Convert-a-Card: Past, Present and Future of Catalogue Cards Retroconversion

This blog post is by Dr Adi Keinan-Schoonbaert, Digital Curator for Asian and African Collections, British Library. She's on Mastodon as @[email protected].

 

It’s been more than eight years, in June 2015, since the British Library launched its crowdsourcing platform, LibCrowds, with the aim of enhancing access to our collections. The first project series on LibCrowds was called Convert-a-Card, followed by the ever-so-popular In the Spotlight project. The aim of Convert-a-Card was to convert print card catalogues from the Library’s Asian and African Collections into electronic records, for inclusion in our online catalogue Explore.

A significant portion of the Library's extensive historical collections was acquired well before the advent of standard computer-based cataloguing. Consequently, even though the Library's online catalogue offers public access to tens of millions of records, numerous crucial research materials remain discoverable solely through searching the traditional physical card catalogues. The physical cards provide essential information for each book, such as title, author, physical description (dimensions, number of pages, images, etc.), subject and a “shelfmark” – a reference to the item’s location. This information still constitutes the basic set of data to produce e-records in libraries and archives.

Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis
Card Catalogue Cabinets in the British Library’s Asian & African Studies Reading Room © Jon Ellis

 

The initial focus of Convert-a-Card was the Library’s card catalogues for Chinese, Indonesian and Urdu books – you can read more about this here and here. Scanned catalogue cards were uploaded to Flickr (and later to our Research Repository), grouped by the physical drawer in which they were originally located. Several of these digitised drawers became projects on LibCrowds.

 

Crowdsourcing Retroconversion

Convert-a-Card on LibCrowds included two tasks:

  1. Task 1 – Search for a WorldCat record match: contributors were asked to look at a digitised card and search the OCLC WorldCat database based on some of the metadata elements printed on it (e.g. title, author, publication date), to see if a record for the book already exists in some form online. If found, they select the matching record.
  2. Task 2 – Transcribe the shelfmark: if a match was found, contributors then transcribed the Library's unique shelfmark as printed on the card.

Online volunteers worked on Pinyin (Chinese), Indonesian and Urdu records, mainly between 2015 and 2019. Their valuable contributions resulted in lists of new records which were then ingested into the Library's Explore catalogue – making these items so much more discoverable to our users. For cards only partially matched with online records, curators and cataloguers had a special area on the LibCrowds platform through which they could address some of the discrepancies in partial matches and resolve them.

An example of an Urdu catalogue card
An example of an Urdu catalogue card

 

After much consideration, we have decided to sunset LibCrowds. However, you can see a good snapshot of it thanks to the UK Web Archive (with thanks to Mia Ridge and Filipe Bento for archiving it), or access its GitHub pages – originally set up and maintained by LibCrowds creator Alex Mendes. We have been using mainly Zooniverse for crowdsourcing projects (see for example Living with Machines projects), and you can see here some references to these and other crowdsourcing initiatives. Sunsetting LibCrowds provided us with the opportunity to rethink Convert-a-Card and consider alternative, innovative ways to automate or semi-automate the retroconversion of these valuable catalogue cards.

 

Text Recognition

As a first step, we were looking to automate the retrieval of text from the digitised cards using OCR/Machine Learning. As mentioned, this text includes shelfmark, title, author, place and date of publication, and other information. If extracted accurately enough, this text could be used for WorldCat lookup, as well as for enhancement of existing records. In most cases, the text was typewritten in English, often with additional information, or translation, handwritten in other languages. To start with, we’ve decided to focus only on the typewritten English – with the aspiration to address other scripts and languages in the future.

Last year, we ran some comparative testing with ABBYY FineReader Server (the software generally used for in-house OCR) and Transkribus, to see how accurately they perform this task. We trialled a set of cards with two different versions of ABBYY, and three different models for typewritten Latin scripts in Transkribus (Model IDs 29418, 36202, and 25849). Assessment was done by visually comparing the original text with the OCRed text, examining mainly the key areas of text which are important for this initiative, i.e. the shelfmark, author’s name and book title. For the purpose of automatically recognising the typewritten English on the catalogue cards, Transkribus Model 29418 performed better than the others – and more accurately than ABBYY’s recognition.

An example of a Pinyin card in Transkribus, showing segmentation and transcription
An example of a Pinyin card in Transkribus, showing segmentation and transcription

 

Using that as a base model, we incrementally trained a bespoke model to recognise the text on our Pinyin cards. We’ve also normalised the resulting text, for example removing spaces in the shelfmark, or excluding unnecessary bits of data. This model currently extracts the English text only, with a Character Error Rate (CER) of 1.8%. With more training data, we plan on extending this model to other types of catalogue cards – but for now we are testing this workflow with our Chinese cards.

 

Entities Extraction

Extracting meaningful entities from the OCRed text is our next step, and there are different ways to do that. One such method – if already using Transkribus for text extraction – is training and applying a bespoke P2PaLA layout analysis model. Such model could identify text regions, improve automated segmentation of the cards, and help retrieve specific regions for further tasks. Former colleague Giorgia Tolfo tested this with our Urdu cards, with good results. Trying to replicate this for our Chinese cards was not as successful – perhaps due to the fact that they are less consistent in structure.

Another possible method is by using regular expressions in a programming language. Research Software Engineer (RSE) Harry Lloyd created a Jupyter notebook with Python code to do just that: take the PAGE XML files produced by Transkribus, parse the XML, and extract the title, author and shelfmark from the text. This works exceptionally well, and in the future we’ll expand entity recognition and extraction to other types of data appearing on the cards. But for now, this information suffices to query OCLC WorldCat and see if a matching record exists.

One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis
One of the 26 drawers of Chinese (Pinyin) card catalogues © Jon Ellis

 

Matching Cards to WorldCat Records

Entities extracted from the catalogue cards can now be used to search and retrieve potentially matching records from the OCLC WorldCat database. Pulling out WorldCat records matched with our card records would help us create new records to go into our cataloguing system Aleph, as well as enrich existing Aleph records with additional information. Previously done by volunteers, we aim to automate this process as much as possible.

Querying WorldCat was initially done using the z39.50 protocol – the same one originally used in LibCrowds. This is a client-server communications protocol designed to support the search and retrieval of information in a distributed network environment. With an excellent start by Victoria Morris and Giorgia Tolfo, who developed a prototype that uses PyZ3950 and PyMARC to query WorldCat, Harry built upon this, refined the code, and tested it successfully for data search and retrieval. Moving forward, we are likely to use the OCLC API for this – which should be a lot more straightforward!

 

Curator/Cataloguer Disambiguation

Getting potential matches from WorldCat is brilliant, but we would like to have an easy way for curators and cataloguers to make the final decision on the ideal match – which WorldCat record would be the best one as a basis to create a new catalogue record on our system. For this purpose, Harry is currently working on a web application based on Streamlit – an open source Python library that enables the building and sharing of web apps. Staff members will be able to use this app by viewing suggested matches, and selecting the most suitable ones.

I’ll leave it up to Harry to tell you about this work – so stay tuned for a follow-up blog post very soon!

 

11 September 2023

Join the British Library's Universal Viewer Product Team

The British Library has been a leading contributor to IIIF, the International Image Interoperability Framework, and the Universal Viewer for many years. We're about to take the next step in this work - and you can join us! We are recruiting for a Product Owner, a Research Software Engineer and a Senior Test Engineer (deadline 03 January 2024). 

In this post, Dr Mia Ridge, product owner for the Universal Viewer (UV) 2015-18, and Dr Rossitza Atanassova, UV business owner 2019-2023, share some background information on how new posts advertised for a UV product team will help shape the future of the Viewer at the Library and contribute to international work on the UV, IIIF standards and activities.

A lavishly decorated page from a fourteenth century manusript 'The Sherborne Missal' showing an illuminated capital with the Virgin Mary holding baby Jesus and surrounded by the three Kings.With other illuminations in the margins and the text.
Detail from Add MS 74236 'The Sherborne Missal' displayed in the Universal Viewer

 The creation of a Universal Viewer product team is part of wider infrastructure changes at the British Library, and marks a shift from contributing via specific UV development projects to thinking of the Viewer as a product. We'll continue to work with the Open Collective while focusing on Library-specific issues to support other activities across the organisation. 

Staff across the Library have contributed to the development of the Universal Viewer, including curators, digitisation teams and technology staff. Staff engage through bespoke training delivered by the IIIF Consortium, participation at IIIF workshops and conferences and experimentation with new tools, such as the digital storytelling tool Exhibit, to engage wide audiences. Other Library work with IIIF includes a collaboration with Zooniverse to enable items to be imported to Zooniverse via IIIF manifests, making crowdsourcing more accessible to organisations with IIIF items. Most recently with funding from the Andrew W Mellon Foundation we updated the UV to play audio from the British Library sound collections

Over half a million items from the British Library's collections are already available via the Universal Viewer, and that number grows all the time. Work on the UV has already let us retire around 35 other image viewers, a significant reduction in maintenance overheads and creating a more consistent experience for our readers.

However, there's a lot more to do! User expectations change as people use other document and media viewers, whether that's other IIIF tools like Mirador or the latest commercial streaming video platforms. We also need to work on some technical debt, ensure accessibility standards are met, improve infrastructure, and consolidate services for the benefits to users. Future challenges include enhancing UV capabilities to display annotations, formats such as newspapers, and complex objects such as 3D.

A view of the Library's image viewer, showing an early nineteenth century Javanese palm-leaf manuscript inside its decorated wooden covers. To the left of the image there is a list with the thumbnails of the manuscript leaves and to the right the panel displays bibliographic information about the item.
British Library Universal Viewer displaying Add MS 12278

 If you'd like to work in collaboration with an international open source community on a viewer that will reach millions of users around the world, one of these jobs may be for you!

Product Owner (job reference R00000196)

Ensure the strategic vision, development, and success of the project. Your primary goal will be to understand user needs, prioritise features and enhancements, and collaborate with the development team and community to deliver a high-quality open source product. 

Research Software Engineer (job reference R00000197)

Help identify requirements, and design and implement online interfaces to showcase our collections, help answer research questions, and support application of novel methods across team activities.

Senior Test Engineer (job reference R00000198)

Help devise requirements, develop high quality test cases, and support application of novel methods across team activities

To apply please visit the British Library recruitment siteApplications close on 3 January 2024. Interview dates are listed in the job ads.

Please ensure you answer all application questions (CVs cannot be submitted). At the BL we can only shortlist with information that applicants provide in response to questions on the application.  Any questions about the roles or the process? Drop us a line at [email protected].

03 August 2023

My AHRC-RLUK Professional Practice Fellowship: A year on

A year ago I started work on my RLUK Professional Practice Fellowship project to analyse computationally the descriptions in the Library’s incunabula printed catalogue. As the project comes to a close this week, I would like to update on the work from the last few months leading to the publication of the incunabula printed catalogue data, a featured collection on the British Library’s Research Repository. In a separate blogpost I will discuss the findings from the text analysis and next steps, as well as share my reflections on the fellowship experience.

Since Isaac’s blogpost about the automated detection of the catalogue entries in the OCR files, a lot of effort has gone into improving the code and outputting the descriptions in the format required for the text analysis and as open datasets. With the invaluable help of Harry Lloyd who had joined the Library’s Digital Research team as Research Software Engineer, we verified the results and identified new rules for detecting sub-entries signaled by Another Copy rather than a main entry heading. We also reassembled and parsed the XML files, originally split in two sets per volume for the purpose of generating the OCR, so that the entries are listed in the order in which they appear in the printed volume. We prepared new text files containing all the entries from each volume with each entry represented as a single line of text, that I could use for the corpus linguistics analysis with AntConc. In consultation with the Curator, Karen Limper-Herz, and colleagues in Collection Metadata we agreed how best to store the data for evaluation and in preparation to update the Library’s online catalogue.

Two women looking at the poster illustrating the text analysis with the incunabula catalogue data
Poster session at Digital Humanities Conference 2023

Whilst all this work was taking place, I started the computational analysis of the English text from the descriptions. The reason for using these partial descriptions was to separate what was merely transcribed from the incunabula from the more language used by the cataloguer in their own ‘voice’. I have recorded my initial observations in the poster I presented at the Digital Humanities Conference 2023. Discussing my fellowship project with the conference attendees was extremely rewarding; there was much interest in the way I had used Transkribus to derive the OCR data, some questions about how the project methodology applies to other data and an agreement on the need to contextualise collections descriptions and reflect on any bias in the transmission of knowledge. In the poster I also highlight the importance of the cross-disciplinary collaboration required for this type of work, which resonated well with the conference theme of Collaboration as Opportunity.

I have started disseminating the knowledge gained from the project with members of the GLAM community. At the British Library Harry, Karen and I ran an informal ‘Hack & Yack’ training session showcasing the project aims and methodology through the use of Jupyter notebooks. I also enjoyed the opportunity to discuss my research at a recent Research Libraries UK Digital Scholarship Network workshop and look forward to further conversations on this topic with colleagues in the wider GLAM community. 

We intend to continue to enrich the datasets to enable better access to the collection, the development of new resources for incunabula research and digital scholarship projects. I would like to end by adding my thanks to Graham Jevon, for assisting with the timely publication of the project datasets, and above all to James, Karen and Harry for supporting me throughout this project.

This blogpost is by Dr Rossitza Atanassova, Digital Curator, British Library. She is on Twitter @RossiAtanassova  and Mastodon @[email protected]

 

02 August 2023

Writing tools for Interactive Fiction - an updated list

In the spring of 2020, during the first UK lockdown, I wrote an article for the British Library English and Drama blog, titled ‘Writing tools for Interactive Fiction’. Quite a few things have changed since then and as the Library launched its first exhibition on Digital Storytelling this June, it seemed like the perfect time to update this list with a few additions.

Interactive fiction (IF), or interactive narrative/narration, is defined as “software simulating environments in which players use text commands to control characters and influence the environment.”

The British Library has been collecting examples of UK interactive fiction as part of the Emerging Formats Project, which is a collaborative effort from all six UK Legal Deposit Libraries to look at the collection management requirements of complex digital publications. Lynda Clark, the British Library Innovation Fellow for Interactive Fiction, built the Interactive Narratives collection on the UK Web Archive (UKWA) during her placement. Because of Legal Deposit Regulations, most of the items in the Interactive Narratives collection can only be accessed on Library premises – which also extends to other collections in the UK Web Archive, such as the New Media Writing Prize collection.

Lynda also conducted analysis on genres, interaction patterns and tools used to build these narratives.

 

Many of these tools are free to use and don’t require any previous knowledge of programming languages. This is not meant to be an exhaustive list, but it might be a useful overview of some of the tools currently available, if you’d like to start experimenting with writing your own interactive narrative. We are also very excited to be able to offer a week-long Interactive Fiction Summer School this August at the Library, running alongside the Digital Storytelling exhibition.

For an easier navigation, these are the tools included in this article:

 

Twine

Twine is an open-source tool to write text-based, non-linear narratives. Created by Chris Klimas in 2009, Twine is perfect to write Choose Your Own Adventure-like stories without knowing how to code. The output is an HTML file, which facilitates publishing and distribution, as it can be run on any computer with an Internet connection and a web browser. If you have any knowledge of CSS or Javascript it’s possible to add extra features and specific designs to your Twine story, but the standard Twine structure only requires you to type text and put brackets around the phrases that will become links in the story (linking to another passage or branching into different directions). There is an online version or a downloadable version that runs on Windows, MacOS and Linux. Twine has multiple story formats, with different features and ways to write the interactive bits of your story. The Twine Reference is a good place to start, but there is also a Twine Cookbook (containing ‘recipes’, instructions and examples to do a variety of things).

Example of text from Cat Simulator 3000. 'You dream of mice. You dream of trout. You dream of balls of yarn. You dream of world domination. You dream of opening your own bank account. You dream of the nature of sentience.' Followed by the prompt 'Wake up'.
Some quality cat dreams.
(from Emma Winston’s Cat Simulator 3000)

 

As the most used tool in the UKWA collection, there are many examples of IF written in Twine, from cat and teatime simulators (Emma Winston’s Cat Simulator 3000 and Damon L. Wakes’ Lovely Pleasant Teatime Simulator), to stories that include a mix of video, images and audio (Chris Godber’s Glitch), and horror games made for Gothic Novel Jam using the British Library’s Flickr collection of images (Freya Campbell’s The Tower – NB some content warnings apply). Lynda Clark also authored an original story as a conclusion to her placement: The Memory Archivist incorporates many of the themes emerged during her research and won The BL Labs Artistic Award 2019.

 

ink/inky & inklewriter

Cambridge-based video game studio inkle is behind another IF tool – or two. Ink is the scripting language used to author many of inkle’s videogames – the idea behind it is to mark up “pure-text with flow in order to produce interactive scripts”. It doesn’t require any programming knowledge and the resulting scripts are relatively easy to read. Inky is the editor to write ink scripts in – it’s free to download and lets you test your narrative as you write it. Once you’re happy with your story, you can export it for the web, as well as a JSON file. There’s a quick tutorial to walk you through the basics, as well as a full manual on how to write in ink. ink was also used to write 80 Days, another work collected by the British Library as part of the emerging formats project and currently exhibited as part of the Digital Storytelling exhibition.

A side by side showing the back end and front end of what writing in ink looks like.
A page from 80 Days, written using ink. To read in full detail, please click on the image.

 

inklewriter is an open-source, ready-to-use, browser-based IF “sketch-pad”. It is meant to be used to sketch out narratives more than to author fully-developed stories. There is no download required and the fact that it is a simple and straightforward tool to experiment with IF makes it a good fit for educators. Tutorials are included within the platform itself so that you can learn while you write.

This year’s Interactive Fiction Summer School at the British Library will teach attendees how to write interactive fiction using ink, with a focus on dialogue and writing with the player in mind. Dr. Florencia Minuzzi will lead the 5-day course, together with a number of guest speakers whose work is featured in the Digital Storytelling exhibition – including Corey Brotherson, Destina Connor, Dan Hett and Meghna Jayanth. The school runs from Monday 21st to Friday 25th August – no previous coding experience necessary!

A screenshot from 80 Days Ⓒ inkle. Two men facing each other with the prompt 'begin conversation'.
A screenshot from 80 Days Ⓒ inkle.

 

Bitsy

Bitsy is a browser-based editor for mini games developed by Adam Le Doux in 2016. It operates within clear constraints (8x8 pixel tiles, a 3-colour palette, etc.), which is actually one of the reasons why it is so beloved. You can draw and animate your own characters within your pixel grid, write the dialogue and define how your avatar (your playable character) will interact with the surrounding scenery and with other non-playable characters. Again, no programming knowledge is necessary. Bitsy is especially good for short narratives and vignette games. After completing your game, you can download it as an HTML file and then share it however you prefer. There is Bitsy Docs, as well as some comprehensive tutorials and even a one-page pamphlet covering the basics.

GIF animation from the Bitsy game 'British Library Simulator'
Shout-out to the Emerging Formats Project
(from Giulia Carla Rossi’s The British Library Simulator)

 

To play (and read) a Bitsy work you should use your keyboard to move the avatar around and interact with the ‘sprites’ (interactive items, characters and scenery – usually recognisable as sporting a different colour from the non-interactive background). You can wander around a Zen garden reflecting on your impending wedding (Ben Bruce’s Zen Garden, Portland, The Day Before My Wedding), alight the village fires to welcome the midwinter spirits (Ash Green’s Midwinter Spirits), experience a love story through mixtapes (David Mowatt’s She Made Me A Mix Tape), or if you’re still craving a nice cuppa you can review some imaginary tea shops (Ben Bruce’s Five Great Places to Get a Nice Cup of Tea When You Are Asleep). You can even visit a pixelated version of the British Library and discover more about our contemporary and digital collections with The British Library Simulator.

 

Inform 7

While Twine allows you to write hypertext narratives (where readers can progress through the story by clicking on a link), Inform 7 lets you write parser-based interactive fiction. Parser-based IF requires the reader to type commands (sometimes full sentences) in order to interact with the story.

A how to guide showing what text options are available for playing text based explorer games in Inform. Helpful tips like 'Try the commands that make sense! Doors are for opening; buttons are for pushing; pie is for eating!'
How to Play Interactive Fiction (An entire strategy guide on a single postcard)
<style="font-family: inherit;">Written by Andrew Plotkin -- design by Lea Albaugh. This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License

 

Inform 7 is a free-to-use, open-source (as of April 2022) tool to write interactive fiction. Originally created as Inform by Graham Nelson in 1993, the current Inform 7 was released in 2006 and uses natural language (based on the English language) to describe situations and interactions. The learning curve is a bit steeper than with Twine, but the natural language approach allows for users with no programming experience to write code in a simplified language that reads like English text. Inform 7 also has a Recipe Book and a series of well-documented tutorials. Inform also runs on Windows, MacOS and Linux and lets you output your game as HTML files.

While the current version of Inform is Inform 7, narratives using previous versions of the system are still available – Emily Short’s Galatea is always a good place to start. You could also explore mysterious ruins with your romantic interest (C.E.J. Pacian’s Love, Hate and the Mysterious Ocean Tower), play a gentleman thief (J.J. Guest’s  Alias, the Magpie) or make more tea (Joey Jones’ Strained Tea).

 

ChoiceScript

ChoiceScript is a javascript-based scripting language developed by Adam Strong-Morse and Dan Fabulich of Choice of Games. It can be used to write choice-based interactive narratives, in which the reader has to select among multiple choices to determine how the story will unfold. The simplicity of the language makes it possible to create Choose-Your-Own-Adventure-style stories without any prior coding knowledge. The ChoiceScript source is available to download for free on the Choice of Games website (it also requires writers to have Node.js installed on their machine). Once your story is complete, you can publish it for free online. Otherwise, Choice of Games offer the possibility of publishing your work with them (they publish to various platforms, including iOS, Android, Kindle and Steam) and earn royalties from it. There is a tutorial that covers the basics, including a Glossary of ChoiceScript terms. The Choice of Game blog also includes some articles with tips on how to design and write interactive stories, especially long ones.

Genres of works built using ChoiceScript are again quite varied – from sci-fi stories exploring the relationships between writers and readers (Lynda Clark’s Writers Are Not Strangers), to crime/romantic dramas (Toni Owen-Blue’s Double/Cross) and fantasy adventures (Thom Baylay’s Evertree Inn).

 

Downpour

Downpour is a game-making tool for phones currently in development. Created by v buckenham, Downpour is a tool that will allow users to make interactive games in minutes, only using their phone’s camera and linking images together. There is no expectation of previous programming knowledge and by removing the need to access a computer, Downpour promises to be a very approachable tool. Release is currently planned for 2023 on iOS and Android – if you want to be notified when it launches you can sign up here.

Downpour banner (purple writing over pink background)
Downpour banner.

 

More resources

As I mentioned before, this is in no way a comprehensive list – there are a lot of other tools and platforms to write IF, both mainstream as well as slightly more obscure ones (Ren’Py, Quest, StoryNexus, Raconteur, Genarrator, just to mention a few). Try different tools, find the one that works best for you or use a mix of them if you prefer! Experiment as much as you like.

If you’d like to discover even more tools to build your interactive project, Everest Pipkin has an excellent list of Open source, experimental, and tiny tools.

Emily Short’s Interactive Storytelling blog also offers a round-up of very interesting links about interactive narratives.

If you want to be inspired by more independent games and interactive stories, Indiepocalypse offers a curated selection of video and/or physical games in the form of a monthly anthology.

To conclude, I’ll leave you with a quote by Anna Anthropy from her book Rise of the Videogame Zinesters:

“Every game that you and I make right now [...] makes the boundaries of our art form (and it is ours) larger. Every new game is a voice in the darkness. And new voices are important in an art form that has been dominated for so long by a single perspective. [...]

There’s nothing to stop us from making our voices heard now. And there will be plenty of voices. Among those voices, there will be plenty of mediocrity, and plenty of games that have no meaning to anyone outside the author and maybe her friends. But [...] imagine what we’ll gain: real diversity, a plethora of voices and experiences, and a new avenue for human beings to tell their stories and connect with other human beings.”

This post is by Giulia Carla Rossi, Curator for Digital Publications

02 May 2023

Detecting Catalogue Entries in Printed Catalogue Data

This is a guest blog post by Isaac Dunford, MEng Computer Science student at the University of Southampton. Isaac reports on his Digital Humanities internship project supervised by Dr James Baker.

Introduction

The purpose of this project has been to investigate and implement different methods for detecting catalogue entries within printed catalogues. For whilst printed catalogues are easy enough to digitise and convert into machine readable data, dividing that data by catalogue entry requires visual signifiers of divisions between entries - gaps in the printed page, large or upper-case headers, catalogue references - into machine-readable information. The first part of this project involved experimenting with XML-formatted data derived from the 13-volume Catalogue of books printed in the 15th century now at the British Museum (described by Rossitza Atanassova in a post announcing her AHRC-RLUK Professional Practice Fellowship project) and trying to find the best ways to detect individual entries and reassemble them as data (given that the text for a single catalogue entry may be spread across multiple pages of a printed catalogue). Then the next part of this project involved building a complete system based on this approach to take the large volume of XML files for a volume and output all of the catalogue entries in a series of desired formats. This post describes our initial experiments with that data, the approach we settled on, and key features of our approach that you should be able to reapply to your catalogue data. All data and code can be found on the project GitHub repo.

Experimentation

The catalogue data was exported from Transkribus in two different formats: an ALTO XML schema and a PAGE XML schema. The ALTO layout encodes positional information about each element of the text (that is, where each word occurs relative to the top left corner of the page) that makes spatial analysis - such as looking for gaps between lines - helpful. However, it also creates data files that are heavily encoded, meaning that it can be difficult to extract the text elements from the data files. Whereas the PAGE schema makes it easier to access the text element from the files.

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the PAGE XML Schema
Raw PAGE XML for a page from volume 8 of the Incunabula Catalogue

 

An image of a digitised page from volume 8 of the Incunabula Catalogue and the corresponding Optical Character Recognition file encoded in the ALTO XML Schema
Raw ALTO XML for a page from volume 8 of the Incunabula Catalogue

 

Spacing and positioning

One of the first approaches tried in this project was to use size and spacing to find entries. The intuition behind this is that there is generally a larger amount of white space around the headings in the text than there is between regular lines. And in the ALTO schema, there is information about the size of the text within each line as well as about the coordinates of the line within the page.

However, we found that using the size of the text line and/or the positioning of the lines was not effective for three reasons. First, blank space between catalogue entries inconsistently contributed to the size of some lines. Second, whenever there were tables within the text, there would be large gaps in spacing compared to the normal text, that in turn caused those tables to be read as divisions between catalogue entries. And third, even though entry headings were visually further to the left on the page than regular text, and therefore should have had the smallest x coordinates, the materiality of the printed page was inconsistently represented as digital data, and so presented regular lines with small x coordinates that could be read - using this approach - as headings.

Final Approach

Entry Detection

Our chosen approach uses the data in the page XML schema, and is bespoke to the data for the Catalogue of books printed in the 15th century now at the British Museum as produced by Transkribus (and indeed, the version of Transkribus: having built our code around some initial exports, running it over  the later volumes - which had been digitised last -  threw an error due to some slight changes to the exported XML schema).

The code takes the XML input and finds entry using a content-based approach that looks for features at the start and end of each catalogue entry. Indeed after experimenting with different approaches, the most consistent way to detect the catalogue entries was to:

  1. Find the “reference number” (e.g. IB. 39624) which is always present at the end of an entry.
  2. Find a date that is always present after an entry heading.

This gave us an ability to contextually infer the presence of a split between two catalogue entries, the main limitation of which is quality of the Optical Character Recognition (OCR) at the point at which the references and dates occur in the printed volumes.

 

An image of a digitised page with a catalogue entry and the corresponding text output in XML format
XML of a detected entry

 

Language Detection

The reason for dividing catalogue entries in this way was to facilitate analysis of the catalogue data, specifically analysis that sought to define the linguistic character of descriptions in the Catalogue of books printed in the 15th century now at the British Museum and how those descriptions changed and evolved across the thirteen volumes. As segments of each catalogue entry contains text transcribed from the incunabula that were not written by a cataloguer (and therefore not part of their cataloguing ‘voice’), and as those transcribed sections are in French, Dutch, Old English, and other languages that a machine could detect as not being modern English, to further facilitate research use of the final data, one of the extensions we implemented was to label sections of each catalogue entry by the language. This was achieved using a python library for language detection and then - for a particular output type - replacing non-English language sections of text with a placeholder (e.g. NON-ENGLISH SECTION). And whilst the language detection model does not detect the Old-English, and varies between assigning those sections labels for different languages as a result, the language detection was still able to break blocks of text in each catalogue entry into the English and non-English sections.

 

Text files for catalogue entry number IB39624 showing the full text and the detected English-only sections.
Text outputs of the full and English-only sections of the catalogue entry

 

Poorly Scanned Pages

Another extension for this system was to use the input data to try and determine whether a page had been poorly scanned: for example, that the lines in the XML input read from one column straight into another as a single line (rather than the XML reading order following the visual signifiers of column breaks). This system detects poorly scanned pages by looking at the lengths of all lines in the page XML schema, establishing which lines deviate substantially from the mean line length, and if sufficient outliers are found then marking the page as poorly scanned.

Key Features

The key parts of this system which can be taken and applied to a different problem is the method for detecting entries. We expect that the fundamental method of looking for marks in the page content to identify the start and end of catalogue entries in the XML files would be applicable to other data derived from printed catalogues. The only parts of the algorithm which would need changing for a new system would be the regular expressions used to find the start and end of the catalogue entry headings. And as long as the XML input comes in the same schema, the code should be able to consistently divide up the volumes into the individual catalogue entries.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs