UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

29 May 2024

IIPC Web Archiving Spring/Summer School and Conference 2024: Report from UK Web Archive Colleagues

Nicola Bingham, Helena Byrne, Ian Cooke, Gil Hoggarth, Cameron Huggett (British Library), Caylin Smith (Cambridge University Library) and  Eilidh MacGlone (National Library of Scotland).

GAWAC2024-website-banner-v4.4-o

This year’s IIPC General Assembly and Web Archiving Conference took place at the Bibliothèque nationale de France (BnF) in Paris. Before this year's conference there was an Early Scholars Spring School on Web Archives aimed at early career researchers interested in working with web archive materials.

Many UK Web Archive colleagues from Bodleian Libraries, the British Library, Cambridge University Library and National Library of Scotland attended the Spring/Summer School and the Web Archiving Conference both as delegates and presenters. In this blog post they report highlights of their conference experience.

Nicola Bingham, Lead Curator of Web Archives, British Library

The IIPC conference lived up to its reputation for being incredibly informative, inspiring, and intense! It was wonderful to reconnect with ‘old’ friends and to meet many new colleagues who are bringing diverse skills and perspectives to the field of web archiving.

As Co-Chair of the IIPC’s Content Development Group, alongside Alex Thurman of Columbia University Libraries, I delivered the keynote speech at the Early Scholars Spring School on Web Archives, which preceded the conference. Our presentation reflected on the history, importance, and legacy of the collaborative transnational web archive collections initiated by IIPC members over the past 14 years.

It was fascinating and gratifying to hear from web archive scholars about their diverse approaches and the variety of research questions they are exploring using web archives. Having worked in web archiving for 20 years, I find the increasing use of collections by researchers, particularly through data-mining approaches, especially interesting and rewarding.

Another interesting and informative highlight was the conference opening keynote speech by Pierre Bellanger, Pauline Ferrari, Jérôme Thièvre, and Sara Aubry. Pierre Bellanger, the founder and CEO of Skyrock and Skyrock.com, emphasised that "there is no freedom without memory," setting the tone for a discussion on the archiving of Skyblogs . Sara Aubry, web archiving technical lead at BnF, detailed the challenges they faced, including working with the Skyblog technical team on short notice to archive the blogs and altering web pages to display more articles and comments before the platform went offline. They managed to collect a substantial amount of content before the closure, amassing 5 million media files and providing API access for metadata extraction. This initiative highlights the importance of preserving the vernacular web, capturing personal pages rather than corporate content. The Skybox project further explores data-oriented methods of access and structural metadata to enhance discovery, with potential future projects aiming to build large language models to analyse and identify regional content within the blogs.

Helena Byrne, Curator of Web Archives, British Library

At this year's conference I presented in the Lighting Talk and Poster sessions. The abstracts are available to read on the IIPC website. IIPC WAC 2024 was a really great conference and there were so many takeaways to help improve my practice. One session I’d like to focus on for this blog post was SESSION #10: Digital Preservation. This session focused on citation practices for researchers using web archives in their research. This is an area that is not fully understood in the academic publishing world. I particularly liked the Citation Saver tool from Arquivo.pt as this is a simple but effective tool to bulk upload online citations from an academic publication. At the British Library we support a variety of researchers and the tools and methods discussed in this session will be useful to support them using web archives in their work. 

Gil Hoggarth, Web Archive Technical Lead, British Library

I personally had not been able to attend the last few IIPC annual conferences, so it was fabulous to meet up and connect with old faces, and new, and learn about all the exciting projects going on. As I take a technical view (of most things), I found it particularly interesting that so many institutions were trying to establish, and expand, their web archiving services. Plus, the number of people involved in joint projects, with a combined aim but also with a community benefit in mind, was quite striking. Now, having returned to challenges ahead for The British Library and the UK Web Archive, I feel far more informed and aware of these community efforts - and have been in contact with many conference attendees to follow up!

Caylin Smith, Head of Digital Preservation, Cambridge University Libraries 

This was my second time attending the IIPC conference; I attended last year in Hilversum. I enjoy attending this conference for its presentations about solving operational challenges relating to web archiving and ones about how web archiving supports an institution’s strategic mission. 

I chaired a panel titled “Striking the Balance: Empowering Web Archivists and Researchers In Accessible Web Archives” whose presenters included Leontien Talboom (Technical Analyst on the CUL Digital Preservation team), Alice Austin (Web Archivist at Edinburgh University Library), Tom Storrar (Head of Web Archiving at The National Archives, UK), and Andrea Kocsis (Heritage and Digital Humanities researcher formerly at Northeastern University London; now Chancellor’s Fellow at the University of Edinburgh). 

This panel focused on different perspectives to using web archives, including as a leader of a web archiving service, as a web archivist, and as a researcher. It highlighted evolving user expectations for web archives as well as the challenges around communicating what users can and cannot do because of technical and/or legislative requirements.

Cameron Huggett, PhD Student (CDP), British Library/Teesside University

I attended the IIPC Early Scholars Spring School on Web Archives. You can read more about my reflections at this event in this event in this blog post -  https://blogs.bl.uk/webarchive/2024/05/reflections-on-the-iipc-early-scholars-spring-school-on-web-archives-2024.html 

Eilidh MacGlone, Web Archivist, National Library of Scotland

I was attending my second IIPC in Paris, the last was in 2014. This when I was a nervous first timer – so I was happy to take part in the new mentorship programme. It was a good way to share experience across different points in our professional arcs.

Planning my conference agenda, presentations on machine learning were at the top of my list. These outlined services to classify and retrieve items from large, complex stores of resources. I knew these would be interesting, as attempts to solve a problem with no complete answer.

Ben Charles Germain Lee spoke about working with born digital government publications. He introduced these ideas using a published experiment. This combination of text and visual analysis provides at least one way to organise retrieval of a very large collection. In the presented case, born digital government publications derived from the End of Term web archive. In future, these techniques could offer a way to offer information retrieval to readers for collections which are too big to catalogue.

The IIPC’s Training Working Group session, led by Claire Newing (TNA) and Ricardo Basílio (Arquivo.pt) was another highlight. It gave me a chance to speak briefly on the most important thing in training colleagues (practice!) and the group shared a lot of really good ideas for training. I had the opportunity to use the information almost immediately on my return, training a colleague to self-archive. All in all, this IIPC was a conference with many good lessons.

Ian Cooke, Head of Contemporary British & Irish Publications, British Library

This year, I was struck by how big, and how varied, web archiving has become. The conference covered a huge array of topics and approaches. Many thanks to the Programme Committee, and especially to the team at BnF for being such excellent hosts. For me, the conference got off to a great start a day early, as I attended the pre-conference workshop on appraisal strategies for web archive curated collections, led by Melissa Wertheimer (Library of Congress). The hands-on session was a very clear reminder of the importance of professional librarians and archivists in creating focused and meaningful collections. The conference was also an opportunity for me to dive into some of the more technical sessions. Kristi Mukk and Matteo Cargnelutti’s (Harvard University Library) presentation on using AI to support search in web archives was both very clear and inspiring. I particularly liked Kristi’s assertion that ‘AI literacy is information literacy’ and the importance of thinking like a librarian. Katherine Boss’ (New York University Library) paper on an experimental project to preserve dynamic and database-driven websites using server-side web archiving (not something to be done at scale!) was also brilliant. Both also emphasised the importance of working collaboratively in teams, bringing principles from librarianship to work alongside software engineering in developing and testing new responses to preservation and discovery challenges.          

Conclusion

The IIPC Web Archiving Spring/Summer School and Conference 2024 at the Bibliothèque nationale de France provided a dynamic platform for exchanging ideas, learning about innovative projects, and fostering collaborations in the field of web archiving. UK Web Archive colleagues contributed significantly through presentations and active participation. This conference highlighted the evolving landscape of web archiving, emphasising the importance of preserving the vernacular web, improving researcher access, and leveraging new technologies like AI for better archival practices. As we return to our respective roles, we carry forward new insights and strengthened connections, ready to tackle the challenges ahead with renewed vigour and informed strategies.




22 May 2024

Reflections on the IIPC Early Scholars Spring School on Web Archives 2024

By Cameron Huggett, PhD Student (CDP), British Library/Teesside University

IIPC-2024-Paris-Early-Scholars-Summer-School-banner
IIPC Early Scholars Spring School on Web Archives banner

My name is Cameron, and I am currently undertaking an AHRC funded Collaborative Doctoral Partnership (CDP) project, between the British Library and Teesside University. My research centres on racial discourses within association football fanzines and e-zines from c.1975 to the present, and aims to examine the broader connections between football fandom, race and identity. 

I attended the Early Scholars Spring School on Web Archives, prior to commencement of the conference, which allowed me to knowledge share with colleagues from a number of different countries, institutions and disciplines, offering new perspectives on my own research. Within this school, I was fortunate enough to be able to deliver a short lighting talk, outlining my own use of web archiving within my research into the history of racial discourses within football fanzines. This generated an engaging discussion around my methodologies and led me to reflect upon how quantitative techniques can be better adopted within historical research practices.

I also particularly enjoyed discovering more about the collections of the Bibliothèque Nationale de France (BNF) and Institut National de L'audiovisuel (INA). The scope of the collections and innovative user interfaces were particularly impressive. For example, INA had created a programme that allowed the user to view a collection item, such as an election debate broadcast, alongside archived tweets relating to event in real time.

 My primary takeaway was how web archives can be innovatively employed to record the breadth and depth of online communities and discourses, as well as supplement more traditional sources within a historian’s research framework.  

24 January 2024

Exploring Alternative Access: Making the Most of Web Archives During UK Web Archive Downtime

Nicola Bingham, Lead Curator of Web Archiving, British Library

The British Library is continuing to experience disruption following a cyber-attack and are working hard to restore services. Disruption to some services is, however, expected to persist for several months. In the meantime, our buildings are open and we’ve released a searchable online version of our main catalogue, which contains records of the majority of our printed collections as well as some freely available online resources. Our reference team are on hand to answer queries, advise on collection item availability and help with other ways to complete your work. Please email [email protected] or find out more. The disruption is affecting our website, online systems and services. Please see our temporary website for up-to-date information.

Despite the disruption to access to the UK Web Archive, we continue to crawl or acquire copies of websites, as well as add new websites to our acquisition process which is being undertaken with Amazon Web Services in the Cloud, ensuring that the UK Web Archive collection is updated and preserved as usual.

We appreciate that for regular users of the UK Web Archive, the temporary unavailability of this valuable resource is inconvenient and disruptive. There exist several alternative openly accessible web archives that can serve as sources of information while the UK Web Archive is offline.

Other Openly Accessible Web Archives

Internet Archive: Known as the largest and most comprehensive web archive globally, it includes the famous Wayback Machine and boasts an extensive collection of archived web pages.

Understanding the Differences

While the Internet Archive captures a broad spectrum of global content, the UK Web Archive focuses specifically on the UK web. The UK Web Archive offers comprehensive crawls, curated collections, and secondary datasets for research. However, access is primarily restricted to legal deposit libraries, with some resources available openly.

The Internet Archive allows remote access to archived websites, but its search functionalities and scope differ from the UK Web Archive.

Memento Time Travel: This innovative platform operates under the Memento protocol, allowing users to view archived websites across various openly accessible web archives. It acts as a bridge, enabling access to past versions of web resources stored in archives such as the Internet Archive, Archive-It, UK Web Archive, archive.today, GitHub, and more. While it displays links to Mementos, it doesn’t retain the content itself.

Portuguese Web Archive (Arquivo.pt): Developed by the Portuguese Foundation for Science and Technology, this archive aims to preserve and grant access to the Portuguese web domain and its contents. It also archives a significant amount of European Union and transnational content. It's a valuable resource for preserving the digital heritage of Portugal and contributing to the preservation of European and Portuguese-language online information.

UK Government Web Archive: An openly accessible archive preserving UK central government information, encompassing videos, tweets, images, and websites dating from 1996 to the present day.

UK Parliament Web Archive: This openly accessible archive covers parliamentary websites and social media content from 2009 to the present day.

National Records of Scotland Web Archive: Offering open access, this archive allows browsing and searching of websites related to Scotland’s people and history.

Seeking Information and Resources While the UK Web Archive is offline, the UK Web Archive blog remains accessible and serves as a useful source of information about the archive.

Additionally, although the UK Web Archive itself might be temporarily inaccessible, its information pages have been preserved by the Internet Archive, accessible [here] (https://web.archive.org/web/20240000000000*/https://www.webarchive.org.uk).

For those keen on delving deeper, the British Library Research Repository houses supporting documents related to the UK Web Archive, such as collection scoping documents, annual reports, statistics, and research publications. The repository can be accessed [here](https://doi.org/10.23636/hj5v-3c07).

While the UK Web Archive takes a brief hiatus, we hope these alternative resources help. And perhaps embracing these other openly accessible archives might even unveil new avenues and perspectives for exploration.

While we work hard to recover all our online services you can find regular updates on progress published on our Knowledge Matters blog.

18 October 2023

UK Web Archive Technical Update - Autumn 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2023 Q2 report

Replication

The most important achievement over the last quarter has been establishing a replica of the UK Web Archive holdings at the National Library of Scotland (NLS). The five servers we’d filled with data were shipped, and our NLS colleagues kindly unpacked and installed them. We visited a few weeks later, finishing off the configuration of the servers so they can be monitored by the NLS staff and remotely managed by us.

This replica contains 1.160 PB of WARCs and logs, covering the period up until February 2023. But, of course, we’ve continued collection since then, and including the 2023 Domain Crawl, we already have significantly more data held at the British Library (about 160 TB more, ~1.3 PB in total). So, the next stage of the project is to establish processes to monitor and update the remote replica. Hopefully, we can update it over the internet rather than having to ship hardware back and forth, but this is what we’ll be looking into over the next weeks.

The 2023 Domain Crawl

As reported before, this year we are running the Domain Crawl on site. It’s had some issues with link farms, which caused the number of domains to leap from around 30 million to around 175 million, which crashed the crawl process.

2023-10-10-dc2023-queues

2023 Domain Crawl queues over time, showing peak at 175 million queues.

However, we were able to clean up and restart it, and it’s been stable since then. As of the end of this quarter we’ve downloaded 2.8 billion URLs, corresponding to 183 TB of (uncompressed) data.

Legal Deposit Access Service

We’ve continued to work with Webrecorder, who have added citation, search and print functionality to the ePub reader part of the Legal Deposit Access Service. This has been deployed and is available for staff testing, but we are still resolving issues around making it available for realistic testing in reading rooms across the Legal Deposit Libraries.

Browsertrix Cloud Local Deployment

We have worked out most of the issues around getting Browsertrix Cloud deployed in a way that complies with Non-Print Legal Deposit legislation and with our local policies. We are awaiting the 1.7.0 release which will include everything we need to have a functional prototype service.

Once it’s running, we can start trying our some test crawls, and work on how best to integrate the outputs into our main collection. We need some metadata protocol for marking crawls as ready for ingest, and we need to update our tools to carefully copy the results into our archival store, and support using WACZ files for indexing and access.

27 September 2023

What can you discover and access in the UK Web Archive collection?

UK Web Archiving team, British Library

The UK Web Archive collects and preserves websites from the UK. When we started collecting in 2005, we sought permission from owners to archive their websites. Since 2013, legal deposit regulations have allowed us to automatically collect all websites that we can identify as located in or originating from the UK. 

Since its inception, the UK Web Archive has collected websites using a number of different methods, with an evolving technological structure and under different legal regulations. The result of this means that what can be discovered and accessed is complicated and, therefore, not always easy to explain and understand. In this post we attempt to explain the concepts and terms of what a user will be able to find.

In the table below is a summary of the different search and access options which can be carried out via our main website (www.webarchive.org.uk). The rest of this post will go into more detail about the terms that we have used in this table.

Table of content availble in the UK Web Archive
Table of content availble in the UK Web Archive 

Year

In this table, ‘year’ refers to the year in which we archived a website, or web resource. This might be different to the year in which it was published or made available online. Once you have found an archived website, you can use the calendar feature to view all the instances, or ‘snapshots’ of that page (which might run over many years).  

Legal deposit regulations came into effect in April 2013. Before this date, websites were collected selectively and with the owners’ permissions. This means the amount of content we have from this earlier period is comparatively smaller, but (with some exceptions) is all available openly online. 

From 2013 onwards, we have collected all websites that we can identify as located in or originating from the UK. We do this once per year in a process that we call the ‘annual domain crawl.’

URL look-up

If you know the URL of a website you want to find in the UK Web Archive, you can use the search box at: https://www.webarchive.org.uk. The search box should recognise that you are looking for a URL, and you can also use a drop-down menu to switch between Full Text and URL search.

URL search covers the widest amount of the collection, and our index, which makes the websites searchable, is updated daily.

UKWA Search Bar September 2023
https://www.webarchive.org.uk/

Full text search

Much of the web archive collection has been indexed and allows a free-text search of the content, i.e., any word, phrase, number etc. Note: Given the amount of data in the web archive, the number of results will be very large.

Currently, full text search is available for all our automatically collected content up to 2015, and our curator selected websites up to 2017. 

Access at legal deposit libraries

Unless the website owner gives explicit permission otherwise, legal deposit regulations restrict access to archived websites to the six UK Legal Deposit Libraries. Access is in reading rooms using a library managed computer terminal.

Users will need a reader's pass to access a reading room: check the website of each Library on how to get a reader’s pass.

Online access outside a legal deposit library

We frequently request permission from website owners to allow us to make their archived websites openly accessible through our website. Where permission has been granted, these archived websites can be accessed from our website https://www.webarchive.org.uk/ from any location where you have internet access.

Additionally, we also make archived web content we can identify as having an Open Government Licence openly accessible.

From all the requests we send for open access to websites, we receive permission from approximately 25% of website owners.  However, these websites form a significant overall amount of content available in the archive. This is because they tend to be larger websites and are captured more frequently (daily, weekly, monthly etc.) over many years.

Curator selected websites

Each year, UK Web Archive curators, and other partners who we work with, identify thousands of resources on the web that are related to a particular topic or event, or that require more frequent collection than once per year.

Many of these archived websites form part of our Topics and Themes collections. We have more than 100 of these, covering general elections, sporting events, creative works, and communications between groups with shared interests or experiences. You can browse these collections to find archived web resources relating to these topics and themes. 

Annual Domain Crawl

Separate from selections made by curators, we conduct an annual ‘domain crawl’ to collect as much of the UK Web as possible. This is done under the Non-Print Legal Deposit regulations, with one ‘crawl’ completed each year. This domain crawl is largely automated and looks to archive all .uk, .scot, .wales, .cymru and .london top-level domain websites plus others that have been identified as being UK-based and in scope for collection.

21 September 2023

How YouTube is helping to drive UK Web Archive nominations

By Carlos Lelkes-Rarugal, Assistant Web Archivist, British Library

Screenshot of the UK Web Archive website 'Save a UK website' page.
https://www.webarchive.org.uk/nominate

There currently exists a plethora of digital platforms for all manner of online published works; YouTube itself has become more than just a platform for sharing videos, it has evolved into a platform for individuals and organisations to reach a global audience and convey powerful messages. Recently, a popular content creator on YouTube, Tom Scott, produced a short video helping to outline the purpose of Legal Deposit and by extension, the work being carried out by UKWA.

Watch the video here: https://www.youtube.com/watch?v=ZNVuIU6UUiM

Tom Scott’s video, titled "This library has every book ever published", is a concise and authentic glimpse into the work being done by the British Library, one of the six UK Legal Deposit Libraries. The video highlighted some of the technology being used that enables preservation at scale, which also highlighted the current efforts in web archiving. Dr Linda Arnold-Stratford (Head of Liaison and Governance for the Legal Deposit Libraries) stated, “The Library collection is around 170 million items. The vast majority of that is Legal Deposit”. Ian Cooke (Head of Contemporary British and Irish Publications) highlighted that with the expansion of Legal Deposit to include born-digital content that “the UK Web Archive has actually become one of the largest parts of the collection. Billions of files, about one and a half terabytes of data”.

At the time of writing, the video has had over 1.4 million views. In addition, as the video continued to gain momentum, something remarkable happened. UKWA started receiving an influx of email nominations from website owners and members of the public. This was unexpected and the volume of nominations that have since come through has been impressive and unprecedented. 

The video has led to increased engagement with the public; with nominations representing an eclectic mix of websites. The comments on the video have been truly positive. We are grateful to Tom for highlighting our work, but we are also thankful and humbled that so many commentators have left encouraging messages, which are a joy to read. The British Library has the largest web archive team of all the Legal Deposit Libraries, but this is still a small team of three curators and four technical experts where we do everything in-house from curation to the technical side. Web archiving is a difficult task but we are hopeful that we can continue to develop the web archive by strengthening our ties to the community by bringing together our collective knowledge.

If you know of a UK website that should be included in the archive, please nominate it here:  https://www.webarchive.org.uk/en/ukwa/info/nominate

28 July 2023

UK-Ireland Digital Humanities Association Launch Event Report from the British Library

By Helena Byrne, Curator of Web Archives, Frankie Perry, Music Manuscripts and Archives Cataloguer and Stella Wisdom, Digital Curator for Contemporary British Collections

UK-Ireland Digital Humanities Association Launch Event Banner with event details
UK-Ireland Digital Humanities Association Launch Event Banner

The First Annual Event for the UK-Ireland Digital Humanities Association took place  on 29th and 30th June 2023 at Senate House, University of London as well as online. The Association “aims to build a collaborative vision for the field, and create new and sustainable long-term partnerships in alignment with the international community”. The programme set across one and half days covered a wide variety of topics and included an opportunity for the Community Interest Groups to meet up. 

The British Library was involved in four presentations either as an individual presentation or as part of a collaborative project. In this blog post we hear back from the British Library colleagues who attended.

Helena Byrne, Curator of Web Archives

I was involved in two collaborative presentations with Sharon Healy (Maynooth University) and Juan-José Boté-Vericad (Universitat de Barcelona). Our first presentation was a lightning talk on day one called 'Finding Web Archives under the ‘Big Tent’ of DH: A Case Study of Ireland and the UK'. This presented one element of a forthcoming chapter in a WARCnet edited collection on web archiving. This presentation reviewed postgraduate courses for the provision of web archiving in information management and digital humanities courses in Britain and Ireland. Our second presentation was part of Panel #2 on day two called 'The Potential of a Reborn Digital Archival Edition for Collating a Corpus of Archived Web Materials'. This presentation outlined a methodology for researchers without coding skills to select, collate and analyse a corpus of archived websites. 

The highlight for me was Panel #3, especially the presentation 'Towards a Critical Black Digital Humanities: A Critical Librarian’s Response' by Naomi L.A Smith (University of West London). This presentation and the discussion that followed highlighted some of the challenges as well as some of the positive action steps that can be taken to ensure digital humanities research is more inclusive. 

Frankie Perry, Postdoctoral Research Assistant, InterMusE project, University of York / Music Manuscripts and Archives Cataloguer, British Library

I gave a paper with Prof. Rachel Cowgill (University of York) who is Principal Investigator on the InterMusE project – a collaborative venture between musicologists, computer scientists, and archive and library specialists funded by the AHRC’s UK-US New Directions for Digital Scholarship in Cultural Institutions programme. The British Library is an institutional partner, with Dr Rupert Ridgewell (Lead Curator, Printed Music) as Co-Investigator; the universities of Swansea and Illinois at Urbana-Champagne are further partners, and we’re also working with the University of Waikato. In our paper, we introduced the complexities of sourcing, digitising, and piecing together ephemera relating to historical musical events (eg. concert programmes, flyers, newspaper reviews), using as our case study materials relating to the British Music Society (1918-1933) and its regional centres and branches. We showed the interface of the digital archive built for the project, which uses a combination of the Greenstone Digital Library system, the Mirador Annotation Viewer, and the SimpleAnnotationServer to make materials browsable, searchable, and interactive for musicologists and community users alike.

I really enjoyed the event and the snapshot it provided into current digital humanities research and techniques. I especially enjoyed a paper by Orla Delaney (Cambridge) on 'Database ethnography and the museum object record', and one by Lisa Griffith (Digital Repository of Ireland) and Laura Molloy (CODATA) titled 'Pathways to collaboration – creating and sharing GLAM image collections as data'.

Stella Wisdom, Digital Curator for Contemporary British Collections

My lightning talk 'Collaborating to Curate and Exhibit Complex Digital Literature' reflected on the cooperation between curators, researchers, experimental writers and creative practitioners to plan and produce the British Library’s Digital Storytelling exhibition (2 June 2023 - 15 October 2023). A hands-on display, which explores the ways that digital innovations have transformed and enhanced our narrative experiences. Showcasing eleven examples of electronic literature that invite readers to become a part of the story themselves, through interactive narratives that respond to user input, reading experiences influenced and personalised by data feeds, and works that draw from multiple platforms and audience participation to create immersive story worlds. Preparing and in some cases modifying these interactive works to display them in a public gallery has only been possible through practical collaborations between Library staff with the writers and games studios who created these digital stories. I shared some insights from my experience of this co-curation work and encouraged attendees to visit the exhibition.

It was a pleasure to meet a number of people in real life who I had only previously spoken with online. A personal highlight was hearing Reham Hosny from the University of Cambridge and Minia University speak about 'DH and E-Lit Communities: Intersectional Perspectives'. In the refreshment breaks at this event I chatted with Reham about her novel, Al-Barrah (The Announcer) and she demonstrated to me how both augmented reality and hologram technologies work with the printed book to immerse readers in this thought provoking narrative.

12 July 2023

UK Web Archive Technical Update - Summer 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2023 Q1 report.

At the end of the last quarter, we launched the 2023 Domain Crawl. This started well (as described in the 2023 Q1 report) but a few days later it became clear the crawl was going a bit too well. We were collecting so quickly, we started to run out of space on the temporary store we use as a buffer for incoming content.

The full story of how we responded to this situation is quite complicated, so I wrote up the detailed analysis in a separate blog post. But in short, we took the opportunity to move to a faster transfer process and switch to a widely-used open source tool called Rclone. After about a week of downtime, the crawl was up and running again, and we were able to keep up and store and index all the new WARC files as they come in.

Since then, the crawl has been running pretty well, but there have been some problems…

2023-07-05-dc-storage-and-queues
2023 Domain Crawl Storage and Queues

The crawler uses disk space in two main ways: the database of queues of URLs to visit (a.k.a. the crawl frontier), and the results of the crawl (the WARCs and logs). The work with Rclone helped us get the latter under control, with the move from /mnt/gluster/dc2023 to sharing the main /opt drive and uploading directly to Hadoop. These uploads run daily, leading to a saw-tooth pattern as free space gets rapidly released before being slowly re-consumed.

But the frontier shares the same disk space, and can grow very large during a crawl. So it’s important we keep an eye on things to make sure we don’t run out of space. In the past, before we made some changes to Heritrix itself, it was possible for a domain crawl to consume huge amounts of disk space. Once, we hit over 100TB for the frontier, which becomes very difficult to manage. In recent domain crawls, our configuration changes we’ve managed to get this down to more like 10TB.

But, as you can see, around the 13th of June, we hit some kind of problem, where the apparent number of queues in the frontier started rapidly increasing, as did the rate at which we were consuming disk space. We deleted some crawler checkpoints to recover some space, as we very rarely need to restart the crawl from anything other than the most recent daily checkpoint, but this only freed-up modest amounts of space. Fortunately, the aggressive frontier growth seemed to subside before we ran out of space, and the crawl is now stable again.

Unfortunately, it’s not clear what happened. Based on previous crawls, it seems unlikely that the crawler suddenly discovered many more millions of web hosts at this point in the crawl. In the past, the number of queues has been consistently up to around 20 million at most, so this leap to over 30 million is surprising. It is possible we hit some weird web structures, but it’s difficult to tell as we do don’t yet have reliable tools for quickly analysing what’s going on in this situation.

Suspiciously, just prior to this problem, we resolved a different issue with the system used to record what URLs had been seen already. This had been accidentally starved of resources, causing problems when the crawler was trying to record what URLs had been seen. This lead to the gaps in the crawl monitoring data just prior to the frontier growth, as the system stopped working and required some reconfiguration. It’s possible this problem left the crawler in a bit of a confused state, leading to mis-management of the frontier database. Some analysis of the crawl will be needed to work out what happened.

In the laster quarter, the new URL search feature was deployed on our BETA service. Following favourable feedback on the new feature, the main https://www.webarchive.org.uk/ service has been updated to match. We hope you find the direct URL search useful.

We’ve also updated the code that recognises whether a visitor is in a Legal Deposit reading room, as it wasn’t correctly identifying readers at Cambridge University Library. Finally, there was an issue with how the CAPTCHAs on the contact and nomination forms were being validated, which has also been resolved.

Our colleagues from Webrecorder delivered the initial set of changes to the ePub renderer, making it easier to cite a paragraph of one of our Legal Deposit eBooks. Given how long the ePub format has been around, it is perhaps surprising that support for ‘obvious’ features like citations and printing are still quite immature, inconsistent and poorly-standardised. To make citation possible, we have ended up adopting the same approach as Calibre’s Reference Mode and implemented a web-based version that integrates with out access system.

We’ve also worked on updating the service documentation based on feedback from our Legal Deposit Library partners, resolved some problems with how the single-concurrent-use locks were being handled and managed, and implemented most of the translations for the Welsh language service. The translations should be complete shortly, and and updated service can be rolled out, including the second set of changes from Webrecorder (focused on searching the text of ePub documents).

Replication to NLS

The long process of establishing a replica of our holdings at the National Library of Scotland (NLS) is finally nearing completion. We have an up-to-date replica, and have been attempting to arrange the transfer of the servers. This turned out to be a bit more complicated that we expected, so has been delayed, but should be completed in the next few weeks.

Minor Updates

For curators, one small but important fix was improving how the W3ACT curation tool validates URLs. This was thought to have been fixed already, but the W3ACT software was not using URL validation consistently and this meant it was still blocking the creation of crawl target records with top-level domains like .sport (rather than the more familiar .uk or .com etc.). As of June 23rd, we released version 2.3.5 of W3ACT that should finally resolve this issue.

Apart from that, we also updated Apache Airflow to version 2.5.3, and leveraged our existing Prometheus monitoring system to send alerts if any of our SSL certificates are about to expire.