UK Web Archive blog

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

30 April 2025

Just launched - The Routledge Companion to Transnational Web Archive Studies

By Helena Byrne, Curator of Web Archives

The Routledge Companion to Transnational Web Archive Studies Edited By Susan Aasman  Anat Ben-David  Niels Brügger
The Routledge Companion to Transnational Web Archive Studies Edited By Susan Aasman, Anat Ben-David, Niels Brügger

On Monday 28, April 2025, The Routledge Companion to Transnational Web Archive Studies was launched. The book “explores the untapped potential of web archives for researching transnational digital history and communication. It covers cross- border, cross- collection, and cross- institutional examination of web archives on a global scale”.

It is an interdisciplinary collaboration and one of the last outputs from the WARCnet research network, comprising  28 chapters  grouped into five sections. The last chapter in each section is a conversation which multiple authors contributed to by responding to questions set by the editors related to the theme of that section.

Lead editor Susan Aasman stated “The companion contains concrete examples on how to research national web domains through a transnational perspective; provides case studies with grounded explorations of the COVID- 19 crisis as a distinctly transnational event captured by web archives; offers methodological considerations while unpacking techniques and skill sets for conducting transnational web archive research; and critically engages the politics and power dynamics inherent to web archives as institutionalised collections”.

UK Web Archive curators, based at the British Library together with curators at University of Westminster contributed to chapters and conversations in the book. The editors stated that “The Routledge Companion to Transnational Web Archive Studies is an essential read for graduate students and scholars from internet and media studies, cultural studies, history, and digital humanities. It will also appeal to web archiving practitioners, including librarians, web curators, and IT developers”.

To celebrate the launch of the book, Routledge is offering a 20% discount with the code 25AFLY2 on http://www.routledge.com/. This code expires on 30th September 2025 and cannot be used with any other special offers.

23 April 2025

Web Archives Collections as Data at the Digital Humanities in the Nordic and Baltic Countries (DHNB) Workshop Report

By Helena Byrne, Curator of Web Archives

DHNB 2025 Conference Banner
DHNB 2025 Conference Banner

The UK Web Archive was one of five web archive organisations represented in the Web Archive Collections as Data workshop held at the Digital Humanities in the Nordic and Baltic Countries (DHNB) 2025 conference held at the National Museum of Estonia in Tartu. The UK Web Archive has participated in the 2025, 2024 and 2023 DHNB conference. The workshop was organised by Olga Holownia, Senior Programme Officer at the International Internet Preservation Consortium (IIPC). It served as an introduction to web archives and web archives collections as data with a focus on use cases but also the challenges related to producing, sharing and publishing, collections as data.

The first stage of the workshop gave a brief overview of the collections as data movement within the GLAM sector, and introduced the Collections as Data Checklist developed by members of the GLAM Labs community. It also introduced what web archives are and where you can access them, how a selection of web archives are making their collections available as data as well as what are the potential research opportunities for these collections. The panel included Olga Holownia (IIPC), Gustavo Candela (University of Alicante), Helena Byrne (British Library), Jon Carlstedt Tønnessen (National Library of Norway), Anders Klindt Myrvoll (Royal Danish Library), Sophie Ham and Steven Claeyssens (KB, National Library of the Netherlands). 

The UK Web Archive presentation promoted the recently published Datasheets for Web Archives Toolkit and the new metadata data sets that are available through the British Library Research Repository. The presentation gave an overview of how the project started, the background to how the Toolkit was prepared and how it was implemented.

Web Archives Collections as Data Workshot at DHNB 2025
Web Archives Collections as Data Workshop at DHNB 2025. Photographer: Helena Byrne & Carmen Kurg.

 

The activity stage of the workshop focused on how we could adapt the Collections as Data Checklist for web archives. The participants were split into three groups. They reviewed the checklist through the lens of if it is applicable to web archives, how it could be adapted if it does not fit, what solutions can be developed to overcome some of the challenging sections of the checklist. There was a rich discussion amongst the groups which also benefited from having both researchers and library professionals involved in reviewing the checklist.

Web Archives Collections as Data Workshop at DHNB 2025
Web Archives Collections as Data Workshop at DHNB 2025. Photographer: Carmen Kurg.

The general consensus from the groups was that maybe more detail is needed to accompany the Checklist so that it could be applied to web archive collections. Some of the points on the Checklist are particularly difficult to apply to web archive collections. There was a lot of discussion on the first two points as they cover licensing and citation. These are particularly difficult for web archives due to national legislation; most web archives operate on a dark or grey access model and most onsite terminals used to access web archives have copy and paste functions disabled so citation can become problematic. However, the participants were positive about the potential to apply an annotated or adapted Collections as Data Checklist specifically for web archives. The brainstorming session at this workshop was the first step of starting a discussion about what resources are needed to improve the process of publishing web archive collections as data. The second of these discussions was picked up at the IIPC Web Archiving Conference in April 2025. 

For a more general report from the DHNB conference click the link to the Digital Scholarship blog to read the report: https://blogs.bl.uk/digital-scholarship/2025/04/dhnb-2025-digital-humanities-in-the-nordic-and-baltic-countries-conference-report.html 

25 November 2024

Datasheets for Web Archives Toolkit is now live

By Helena Byrne, Curator of Web Archives

Datasheets for Web Archives Toolkit Banner with authour names and logos
Datasheets for Web Archives Toolkit

Since autumn 2022, Emily Maemura from the University of Illinois and Helena Byrne from the UK Web Archive team at the British Library have been exploring how the Datasheets for Datasets framework, devised for machine learning by Gebru et. al, could be applied to web archives. In order to explore the research question “can we use datasheets to describe the provenance of web archives, supporting research uses?” a series of workshops were organised in 2023. 

These workshops included a card sorting exercise with expertise in web archives as well as general information management. After the card sorting exercise there was a general discussion about using this framework to describe web archive collections.

These workshops formed the core of the guidance documentation published in the Datasheets for Web Archives Toolkit published in the British Library Research Repository.

The Toolkit

This Toolkit provides information on the creation of datasheets for web archives datasets. The datasheet concept is based on past work from Gebru et al. at Microsoft Research. The datasheet template and samples here were developed through a series of workshops with web archives curators, information professionals, and researchers during Spring and Summer 2023. The toolkit is composed of several parts including templates, examples, and guidance documents. Documents in the toolkit are available at a single DOI (https://doi.org/10.22020/rq8z-r112) and include:

  1. Toolkit Overview 
  2. Datasheets Question Guide
  3. Datasheet Blank Template

Implementation 

The UK Web Archive has implemented this framework to publish data sets from its curation software the W3 Annotation Curation Tool (ACT). These data sets are available to view in the UK Web Archive: Data folder in the British Library Research Repository. So far there are just a few collections published but this will grow over the coming months.

02 October 2024

Archiving Social Media with Browsertrix

By Carlos Lelkes-Rarugal, Assistant Web Archivist

When people think of web archiving, social media is often overlooked as a content source. Although it's impossible to capture everything posted on social media platforms, at the UK Web Archive, we strive to archive the public profiles of key figures like politicians, journalists, athletes and industry leaders. However, archiving social media presents significant challenges for any institution trying to capture and preserve it. Recently, a new tool has helped us archive social media content with greater success.

This blog post outlines our approach to archiving social media for the 2024 General Election, highlighting what worked well and identifying areas for improvement.

Challenges of the Previous Workflow

In an earlier blog post, we discussed our efforts in collecting content for the 2024 General Election. While we updated the user nomination process, we still relied on the same website crawler, Heritrix. Here is a simplified version of the previous workflow:

  •       Nominate a seed
  •       Validate seed and create a metadata record
  •       Send seed and metadata to the Heritrix crawler
  •       Archive, process, and store the website
  •       Make the archived website available for viewing

This workflow enabled us to archive thousands of websites daily, thanks to Heritrix’s robust capabilities. However, despite its effectiveness at archiving static websites, Heritrix is less adept at capturing dynamic content such as maps or social media. While we can archive video, UK Non-Print Legal Deposit regulations prevent us from archiving video-streaming platforms like YouTube or TikTok.

The Challenges of Archiving Dynamic Content

Dynamic content is notoriously difficult to archive. Automated crawlers like Heritrix struggle with elements that rely heavily on JavaScript, asynchronous loading, or user interactions—common features of social media platforms. Heritrix cannot simulate these browser-based interactions, meaning critical content can be missed.

The challenge for web archiving institutions is compounded by the rapid evolution of social media platforms, which continually update their designs and policies, often implementing anti-crawling measures. For example, X (formerly Twitter) once allowed open access to its API. In April 2023, however, the platform introduced a paid API and a pop-up login requirement to view tweets, essentially blocking crawlers. This shift mirrors a broader trend among social media platforms to protect user data from unauthorised scraping and repurposing of data, a practice often linked to training AI models.

While archiving dynamic content is a known problem, finding tools capable of managing these complexities has proven difficult. Webrecorder, an open-source tool, offers one potential solution. It allows users to record their interactions within a web browser, capturing the resources loaded during the browsing session. This content is then packaged into a file, enabling the recreation of entire web pages. While Webrecorder has evolved, it is only part of the solution.

Introducing Browsertrix

Heritrix and Browsertrix both offer valuable solutions for web archiving but operate on different scales. Heritrix’s strength lies in its ability to handle a high volume of websites efficiently, but it falls short with dynamic content. Browsertrix, by contrast, excels at capturing interactive, complex web pages, though it can require more manual intervention.

Despite the increased time and effort involved, Browsertrix offers several key advantages:

  •       High-Fidelity Crawling: Browsertrix can accurately archive dynamic and interactive social media content.
  •       Ease of Use: Its user-friendly interface and comprehensive documentation made Browsertrix relatively easy for our team to adopt. Plus, its widespread use within the International Internet Preservation Consortium (IIPC) means additional support is readily available.

Archiving Social Media: A New Approach

One of the most significant challenges in archiving social media is dealing with login authentication. Most social platforms now require users to log in to access content, making it impossible for Heritrix to proceed beyond the login page. Heritrix does not create a browser environment, let alone maintain cookies or browser sessions, so it cannot simulate user browser interactions that are sometimes necessary to view or download content.

This is where Browsertrix excels. Operating within a web browser environment, Browsertrix can handle login credentials, enable browser events like drop-down menus, and capture content that loads asynchronously, such as social media posts. Essentially, it records a user’s browsing session, capturing the resources that make the visible web page.

During the 2024 General Election, we ran Browsertrix alongside Heritrix. Heritrix handled the majority of the simpler website nominations, such as MP and party websites, while Browsertrix focused on more complex social media accounts.

Workflows and Resources for the 2024 General Election

Although we planned to integrate Browsertrix into our archiving efforts for the 2024 General Election, unforeseen delays meant that we only gained access to the tool on June 28th—just one week before polling day on July 5th. However, prior planning helped us decide on key social media accounts.

Key considerations for this workflow included:

  •       Collaboration with Legal Deposit Libraries
  •       Limited time frame
  •       Archiving multiple social media accounts
  •       Daily archiving schedules
  •       Finite Browsertrix resources

We had an organisational account with five terabytes of storage and 6,000 minutes of processing time. However, as with any web archiving, the actual crawl times and data requirements were difficult to predict due to the variable size and complexity of websites.

Which is why we try to encapsulate our crawls with general parameters assigned to each seed, for example the frequency of a crawl or the data cap. In an ideal world, we would crawl them every minute with unlimited data, but there is a cost to everything, and so our strategy relies on the expertise of curators and archivists to determine the ideal parameters that will ensure a best-effort capture, whilst ensuring we utilise our hardware as efficiently as possible.

Using Browsertrix, the first task was to prioritise which social media platform to tackle first, depending on how many accounts were nominated for each platform. In total, we had 138 social media accounts to archive:

  •       96 X accounts
  •       25 Facebook accounts
  •       17 Instagram accounts

X was by far the most active platform, making it a priority. After some trial and error, we found that a three-minute crawl time produced high-quality captures for most accounts. Here are some of the settings that were adjusted, in various combinations:

  •       Start URL Scope
  •       Extra URL Prefixes in Scope
  •       Exclusions
  •       Additional URLs
  •       Max Pages
  •       Crawl Time Limit
  •       Crawl Size Limit
  •       Delay After Page Load
  •       Behaviour Timeout
  •       Browser Windows
  •       Crawler Release Channel
  •       User Agent

For X specifically, we staggered crawls by 30 minutes to avoid triggering account blocks. This came with its own challenges, as we had no system in place to manage scheduling   and social media login details. For this reason, it was felt that the Browsertrix application should be solely managed by one experienced member of staff, rather than the curators who nominated the accounts in order to manage the social media account logins and the scheduling of crawl jobs. In practice, this meant that a spreadsheet was used, detailing the numerous social media accounts with their login and various crawling parameters.

Quality Assurance

Quality assurance (QA) is a crucial but time-consuming aspect of web archiving, especially when dealing with dynamic content. Browsertrix offers a QA tool that generates reports analysing the quality of individual crawl jobs, including screenshot comparisons and resource analysis. However, this feature can be resource-intensive; for instance, a QA report for a single Facebook capture required approximately 30 minutes of processing time. Given our limitation of 6,000 minutes of processing time and the large volume of crawl jobs, we had to selectively perform QA on key crawl jobs rather than generating reports for every one.

Browsertrix’s extensive documentation provides more details on its QA process, which we found valuable when managing our resources effectively during this large-scale archival effort. Users can run spot checks on crawl jobs, choosing those that might benefit from a QA report; this gives a sense of how healthy the capture is, and allows the user to adjust the Browsertrix settings. Another approach is to offload the quality assurance so that it is performed outside Browsertrix. The user can download the WACZ files and interrogate them to check their contents against the live website, again carrying out spot checks to see if certain significant resources were captured. 

Looking at the live website in a web browser, users can analyse the network traffic and view what resources are loading, usually through the browser developer tools. The resources that load during network analysis also have the exact URI of the resources, which can be searched for within the WACZ file. Bear in mind, this sort of comparison with the live website should be done soon after crawling has completed, otherwise you may be conducting a comparison on a URL where the content has changed significantly to that which was initially crawled.

Some of the QA considerations which we were guided by include:

  •         If issues are found, what, if anything, can be realistically done to remedy them?
  •         Is it an issue with the crawler or with the playback software?
  •         How much time can you apportion to QA without it impacting other work?
  •         Will the time given over to QA yield an appropriate benefit?
  •         Can your QA scale?

Where to go from here?

The 2024 General Election marked the first time we used Browsertrix alongside Heritrix for social media archiving. While the process presented challenges, particularly around managing login authentication and processing constraints, Browsertrix proved to be an invaluable tool for capturing complex media. By refining our workflows and balancing the use of both crawl streams, we were able to archive a significant portion of relevant social media content. Looking forward, we will continue to develop and improve our tools and strategies; collaborating with partners and sharing our experience and knowledge by engaging with the wider web archiving community. 

18 September 2024

Creating and Sharing Collection Datasets from the UK Web Archive

By Carlos Lelkes-Rarugal, Assistant Web Archivist

We have data, lots and lots of data, which is of unique importance to researchers, but presents significant challenges for those wanting to interact with it. As our holdings grow by terabytes each month, this creates significant hurdles for the UK Web Archive team who are tasked with organising the data and for researchers who wish to access it. With the scale and complexity of the data, how can one first begin to comprehend what it is that they are dealing with and understand how the collection came into being? 

This challenge is not unique to digital humanities. It is a common issue in any field dealing with vast amounts of data. A recent special report on the skills required by researchers working with web archives was produced by the Web ARChive studies network (WARCnet). This report, based on the Web Archive Research Skills and Tools Survey (WARST), provides valuable insights and can be accessed here: WARCnet Special Report - An overview of Skills, Tools & Knowledge Ecologies in Web Archive Research.

At the UK Web Archive, legal and technical restrictions dictate how we can collect, store and provide access to the data. To enhance researcher engagement, Helena Byrne, Curator of Web Archives at the British Library, and Emily Maemura, Assistant Professor at the School of Information Sciences at the University of Illinois Urbana-Champaign, have been collaborating to explore how and which types of datasets can be published. Their efforts include developing options that would enable users to programmatically examine the metadata of the UK Web Archive collections.

Thematic collections and our metadata

To understand this rich metadata, we first have to examine how it is created and where it is held..

Since 2005 we have used a number of applications, systems, and tools to enable us to curate websites. The most recent being the Annotation and Curation Tool (ACT), which enables authenticated users, mainly curators and archivists, to create metadata that define and describe targeted websites. The ACT tool also serves  to help users build collections around topics and themes, such as the UEFA Women's Euro England 2022. To build collections, ACT users first input basic metadata to build a record around a website, including information such as website URLs, descriptions, titles, and crawl frequency. With this basic ACT record describing a website, additional metadata can be added, for example metadata that is used to assign a website record to a collection. One of the great features of ACT is its extensibility, allowing us, for instance, to create new collections.

These collections, which are based around a theme or an event, give us the ability to highlight archived content. The UK Web Archive holds millions of archived websites, many of which may be unknown or rarely viewed, and so to help showcase a fraction of our holdings, we build these collections which draw on the expertise of both internal and external partners.

Exporting metadata as CSV and JSON files

That’s how we create the metadata, but how is it stored? ACT  is a web application and the metadata created through it is stored in a Postgres relational database, allowing authenticated users to input metadata in accordance to the fields within ACT. As the Assistant Web Archivist, I was given the task to extract the metadata from the database, exporting each selected collection as a CSV and JSON file. To get to that stage, the Curatorial team first had to decide which fields were to be exported. 

The ACT database is quite complex, in that there are 50+ tables which need to be considered. To enable local analysis of the database, a static copy is loaded into a database administration application, in this case, DBeaver. Using the free-to-use tool, I was able to create entity relationship diagrams of the tables and provide an extensive list of fields to the curators so that they could determine which fields are the most appropriate to export.

I then worked on a refined version of the list of fields, running a script for the designated Collection and pulling out specific metadata to be exported. To extract the fields and the metadata into an exportable format, I created an SQL (Structured Query Language) script which can be used to export results in both JSON and/or CSV: 

Select

taxonomy.parent_id as "Higher Level Collection",

collection_target.collection_id as "Collection ID",

taxonomy.name as "Collection or Subsection Name",

CASE

     WHEN collection_target.collection_id = 4278 THEN 'Main Collection'

     ELSE 'Subsection'

END AS "Main Collection or Subsection",

target.created_at as "Date Created",

target.id as"Record ID",

field_url.url as "Primary Seed",

target.title as "Title of Target",

target.description as "Description",

target.language as "Language",

target.license_status as "Licence Status",

target.no_ld_criteria_met as "LD Criteria",

target.organisation_id as "Institution ID",

target.updated_at as "Updated",

target.depth as "Depth",

target.scope as "Scope",

target.ignore_robots_txt as "Robots.txt",

target.crawl_frequency as "Crawl Frequency",

target.crawl_start_date as "Crawl Start Date",

target.crawl_end_date as "Crawl End Date"

From

collection_target

Inner Join target On collection_target.target_id = target.id

Left Join taxonomy On collection_target.collection_id = taxonomy.id

Left Join organisation On target.organisation_id = organisation.id

Inner Join field_url On field_url.target_id = target.id

Where

collection_target.collection_id in (4278, 4279, 4280, 4281, 4282, 4283, 4284) And

(field_url.position Is Null Or field_url.position In (0))

JSON Example
JSON output example for the Women’s Euro Collection

Accessing and using the data

The published metadata is available from the BL Research Repository within the UK Web Archive section, in the folder “UK Web Archive: Data”. Each dataset includes the metadata seed list in both CSV and JSON formats, a data dictionary and a datasheet which gives provenance information about how the dataset was created as well as a data dictionary that defines each of the data fields. The first collections selected for publication were:

  1. Indian Ocean Tsunami December 2004 (January-March 2005) [https://doi.org/10.23636/sgkz-g054]
  2. Blogs (2005 onwards) [https://doi.org/10.23636/ec9m-nj89] 
  3. UEFA Women's Euro England 2022 (June-October 2022) [https://doi.org/10.23636/amm7-4y46] 

31 July 2024

If websites could talk (part 6)

By Ely Nott, Library, Information and Archives Services Apprentice

After another extended break, we return to a conversation between UK domain websites as they try to parse out who among them should be crowned the most extraordinary…

“Where should we start this time?” asked Following the Lights. “Any suggestions?”

“If we’re talking weird and wonderful, clearly we should be considered first.” urged Temporary Temples, cutting off Concorde Memorabilia before they could make a sound.

“We should choose a website with a real grounding in reality.” countered the UK Association of Fossil Hunters.

“So, us, then.” shrugged the Grampian Speleological Group. “Or if not, perhaps the Geocaching Association of Great Britain?”

“We’ve got a bright idea!” said Lightbulb Languages, “Why not pick us?”

“There is no hurry.” soothed the World Poohsticks Champsionships, “We have plenty of time to think, think, think it over.”

“This is all a bit too exciting for us.” sighed the Dull Men’s Club, who was drowned out by the others.

“The title would be right at gnome with us.” said The Home of Gnome, with a little wink and a nudge to the Clown Egg Gallery, who cracked a smile.

“Don’t be so corny.” chided the Corn Exchange Benevolent Society. “Surely the title should go to the website that does the most social good?”

“Then what about Froglife?” piped up the Society of Recorder Players.

“If we’re talking ecology, we’d like to be considered!” the Mushroom enthused, egged on by Moth Dissection UK. “We have both aesthetic and environmental value.”

“Surely, any discussion of aesthetics should prioritise us.” preened Visit Stained Glass, as Old so Kool rolled their eyes.

The back and forth continued, with time ticking on until they eventually concluded that the most extraordinary site of all had to be… Saving Old Seagulls.

Check out previous episodes in this series by Hedley Sutton - Part 1Part 2, Part 3 Part 4 and Part 5

 

29 May 2024

IIPC Web Archiving Spring/Summer School and Conference 2024: Report from UK Web Archive Colleagues

Nicola Bingham, Helena Byrne, Ian Cooke, Gil Hoggarth, Cameron Huggett (British Library), Caylin Smith (Cambridge University Library) and  Eilidh MacGlone (National Library of Scotland).

GAWAC2024-website-banner-v4.4-o

This year’s IIPC General Assembly and Web Archiving Conference took place at the Bibliothèque nationale de France (BnF) in Paris. Before this year's conference there was an Early Scholars Spring School on Web Archives aimed at early career researchers interested in working with web archive materials.

Many UK Web Archive colleagues from Bodleian Libraries, the British Library, Cambridge University Library and National Library of Scotland attended the Spring/Summer School and the Web Archiving Conference both as delegates and presenters. In this blog post they report highlights of their conference experience.

Nicola Bingham, Lead Curator of Web Archives, British Library

The IIPC conference lived up to its reputation for being incredibly informative, inspiring, and intense! It was wonderful to reconnect with ‘old’ friends and to meet many new colleagues who are bringing diverse skills and perspectives to the field of web archiving.

As Co-Chair of the IIPC’s Content Development Group, alongside Alex Thurman of Columbia University Libraries, I delivered the keynote speech at the Early Scholars Spring School on Web Archives, which preceded the conference. Our presentation reflected on the history, importance, and legacy of the collaborative transnational web archive collections initiated by IIPC members over the past 14 years.

It was fascinating and gratifying to hear from web archive scholars about their diverse approaches and the variety of research questions they are exploring using web archives. Having worked in web archiving for 20 years, I find the increasing use of collections by researchers, particularly through data-mining approaches, especially interesting and rewarding.

Another interesting and informative highlight was the conference opening keynote speech by Pierre Bellanger, Pauline Ferrari, Jérôme Thièvre, and Sara Aubry. Pierre Bellanger, the founder and CEO of Skyrock and Skyrock.com, emphasised that "there is no freedom without memory," setting the tone for a discussion on the archiving of Skyblogs . Sara Aubry, web archiving technical lead at BnF, detailed the challenges they faced, including working with the Skyblog technical team on short notice to archive the blogs and altering web pages to display more articles and comments before the platform went offline. They managed to collect a substantial amount of content before the closure, amassing 5 million media files and providing API access for metadata extraction. This initiative highlights the importance of preserving the vernacular web, capturing personal pages rather than corporate content. The Skybox project further explores data-oriented methods of access and structural metadata to enhance discovery, with potential future projects aiming to build large language models to analyse and identify regional content within the blogs.

Helena Byrne, Curator of Web Archives, British Library

At this year's conference I presented in the Lighting Talk and Poster sessions. The abstracts are available to read on the IIPC website. IIPC WAC 2024 was a really great conference and there were so many takeaways to help improve my practice. One session I’d like to focus on for this blog post was SESSION #10: Digital Preservation. This session focused on citation practices for researchers using web archives in their research. This is an area that is not fully understood in the academic publishing world. I particularly liked the Citation Saver tool from Arquivo.pt as this is a simple but effective tool to bulk upload online citations from an academic publication. At the British Library we support a variety of researchers and the tools and methods discussed in this session will be useful to support them using web archives in their work. 

Gil Hoggarth, Web Archive Technical Lead, British Library

I personally had not been able to attend the last few IIPC annual conferences, so it was fabulous to meet up and connect with old faces, and new, and learn about all the exciting projects going on. As I take a technical view (of most things), I found it particularly interesting that so many institutions were trying to establish, and expand, their web archiving services. Plus, the number of people involved in joint projects, with a combined aim but also with a community benefit in mind, was quite striking. Now, having returned to challenges ahead for The British Library and the UK Web Archive, I feel far more informed and aware of these community efforts - and have been in contact with many conference attendees to follow up!

Caylin Smith, Head of Digital Preservation, Cambridge University Libraries 

This was my second time attending the IIPC conference; I attended last year in Hilversum. I enjoy attending this conference for its presentations about solving operational challenges relating to web archiving and ones about how web archiving supports an institution’s strategic mission. 

I chaired a panel titled “Striking the Balance: Empowering Web Archivists and Researchers In Accessible Web Archives” whose presenters included Leontien Talboom (Technical Analyst on the CUL Digital Preservation team), Alice Austin (Web Archivist at Edinburgh University Library), Tom Storrar (Head of Web Archiving at The National Archives, UK), and Andrea Kocsis (Heritage and Digital Humanities researcher formerly at Northeastern University London; now Chancellor’s Fellow at the University of Edinburgh). 

This panel focused on different perspectives to using web archives, including as a leader of a web archiving service, as a web archivist, and as a researcher. It highlighted evolving user expectations for web archives as well as the challenges around communicating what users can and cannot do because of technical and/or legislative requirements.

Cameron Huggett, PhD Student (CDP), British Library/Teesside University

I attended the IIPC Early Scholars Spring School on Web Archives. You can read more about my reflections at this event in this event in this blog post -  https://blogs.bl.uk/webarchive/2024/05/reflections-on-the-iipc-early-scholars-spring-school-on-web-archives-2024.html 

Eilidh MacGlone, Web Archivist, National Library of Scotland

I was attending my second IIPC in Paris, the last was in 2014. This when I was a nervous first timer – so I was happy to take part in the new mentorship programme. It was a good way to share experience across different points in our professional arcs.

Planning my conference agenda, presentations on machine learning were at the top of my list. These outlined services to classify and retrieve items from large, complex stores of resources. I knew these would be interesting, as attempts to solve a problem with no complete answer.

Ben Charles Germain Lee spoke about working with born digital government publications. He introduced these ideas using a published experiment. This combination of text and visual analysis provides at least one way to organise retrieval of a very large collection. In the presented case, born digital government publications derived from the End of Term web archive. In future, these techniques could offer a way to offer information retrieval to readers for collections which are too big to catalogue.

The IIPC’s Training Working Group session, led by Claire Newing (TNA) and Ricardo Basílio (Arquivo.pt) was another highlight. It gave me a chance to speak briefly on the most important thing in training colleagues (practice!) and the group shared a lot of really good ideas for training. I had the opportunity to use the information almost immediately on my return, training a colleague to self-archive. All in all, this IIPC was a conference with many good lessons.

Ian Cooke, Head of Contemporary British & Irish Publications, British Library

This year, I was struck by how big, and how varied, web archiving has become. The conference covered a huge array of topics and approaches. Many thanks to the Programme Committee, and especially to the team at BnF for being such excellent hosts. For me, the conference got off to a great start a day early, as I attended the pre-conference workshop on appraisal strategies for web archive curated collections, led by Melissa Wertheimer (Library of Congress). The hands-on session was a very clear reminder of the importance of professional librarians and archivists in creating focused and meaningful collections. The conference was also an opportunity for me to dive into some of the more technical sessions. Kristi Mukk and Matteo Cargnelutti’s (Harvard University Library) presentation on using AI to support search in web archives was both very clear and inspiring. I particularly liked Kristi’s assertion that ‘AI literacy is information literacy’ and the importance of thinking like a librarian. Katherine Boss’ (New York University Library) paper on an experimental project to preserve dynamic and database-driven websites using server-side web archiving (not something to be done at scale!) was also brilliant. Both also emphasised the importance of working collaboratively in teams, bringing principles from librarianship to work alongside software engineering in developing and testing new responses to preservation and discovery challenges.          

Conclusion

The IIPC Web Archiving Spring/Summer School and Conference 2024 at the Bibliothèque nationale de France provided a dynamic platform for exchanging ideas, learning about innovative projects, and fostering collaborations in the field of web archiving. UK Web Archive colleagues contributed significantly through presentations and active participation. This conference highlighted the evolving landscape of web archiving, emphasising the importance of preserving the vernacular web, improving researcher access, and leveraging new technologies like AI for better archival practices. As we return to our respective roles, we carry forward new insights and strengthened connections, ready to tackle the challenges ahead with renewed vigour and informed strategies.




22 May 2024

Reflections on the IIPC Early Scholars Spring School on Web Archives 2024

By Cameron Huggett, PhD Student (CDP), British Library/Teesside University

IIPC-2024-Paris-Early-Scholars-Summer-School-banner
IIPC Early Scholars Spring School on Web Archives banner

My name is Cameron, and I am currently undertaking an AHRC funded Collaborative Doctoral Partnership (CDP) project, between the British Library and Teesside University. My research centres on racial discourses within association football fanzines and e-zines from c.1975 to the present, and aims to examine the broader connections between football fandom, race and identity. 

I attended the Early Scholars Spring School on Web Archives, prior to commencement of the conference, which allowed me to knowledge share with colleagues from a number of different countries, institutions and disciplines, offering new perspectives on my own research. Within this school, I was fortunate enough to be able to deliver a short lighting talk, outlining my own use of web archiving within my research into the history of racial discourses within football fanzines. This generated an engaging discussion around my methodologies and led me to reflect upon how quantitative techniques can be better adopted within historical research practices.

I also particularly enjoyed discovering more about the collections of the Bibliothèque Nationale de France (BNF) and Institut National de L'audiovisuel (INA). The scope of the collections and innovative user interfaces were particularly impressive. For example, INA had created a programme that allowed the user to view a collection item, such as an election debate broadcast, alongside archived tweets relating to event in real time.

 My primary takeaway was how web archives can be innovatively employed to record the breadth and depth of online communities and discourses, as well as supplement more traditional sources within a historian’s research framework.