Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

20 April 2023

UK Web Archive Technical Update - Spring 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2022 Q4 report.

Summarising Our Holdings

We regularly report on our holdings so other teams across the Legal Deposit Libraries have an understanding of how much data we hold and how we grow over time. Until recently, the reporting mechanism we used did not fully take into account the storage used across different clusters, and on Amazon Web Services.

In January the old reporting mechanism was replaced with a new implementation, better integrated with our other systems and covering all storage services. The Airflow scheduler (discussed in previous reports) generates updated lists of holdings from different systems, and a Jupyter notebook is then used as a dashboard. This is made accessible via the W3ACT curation service, unlike the old system, which was only available to British Library staff.

While it doesn’t get updated automatically, there’s also an older copy of the notebook on GitHub. See UK Web Archive Holdings Summary Report. As you can see there, the UK Web Archive now holds over 1.4 PB of WARCs and logs.

Legal Deposit Access Solution

The new system for Reading Room access to Non-Print Legal Deposit material has also made steady progress. An alpha version of the system has been rolled out across all LDLs so staff can access the service for testing, and a beta service is being rolled out to run alongside the current system in reading rooms. The deployment of the services themselves has also been automated, using GitLab CI/CD to updated the systems rather than relying on updating them by hand.

Staff testing raised some additional requirements to be met before the service roll-out can proceed. Working with Webrecorder to meet these requirements will be the focus for the next quarter.

UKWA Website

Edited 28th April 2023 to include translation updates.

The main website has been updated to run version 2.6.9 of our PyWB playback engine, and version 1.4.5 of the main search interface. Version 1.4.5 does not change the sites basic functionality, but does significantly improve the Scotting Gaelic version of the site.

However, we’ve also looked at more significant changes to the public interface to the archive.

Firstly, we’d like to update to newer version of PyWB, which now features an updated timeline and calendar display. Secondly, some experimentation with letting search engines to index selected website showed that it may be necessary to include links to the archived sites somewhere in the main site so that the crawler finds and prioritizes those URLs for indexing. To test this out, a page has been added to the site that lists any archived sites that require indexing, and that page has been included in the site map.

Finally, we’ve found a lot of queries are better answered by direct URL search than keyword search, so wanted to find ways to better integrate PyWB’s URL search functionality with the main site. To make URL search easier to use, we want to change the the main search interface on the front page of the website to spot URL searches and direct the user to the right results.

The BETA version of the website has been updated to include these changes, and is now available For review. If you have any feedback, please let us know.

Image: The BETA homepage for the UK Web Archive, offering URL or Full Text search

Web Archive Discovery tool updates

One long-standing issue we have is that our full-text search does not contain recent material, and over the next year we hope to revisit the scaling problems we’ve seen and try to improve the situation.

As an initial step towards this, we spent some time updating our search tools. The webarchive-discovery indexer has been updated to use version 2 of Apache Tika, along with other upgrades to other dependencies like the Nanite wrapper that makes is possible for us to use National Archive’s PRONOM/DROID format identification engine. This changes are quite significant, so the version number has been bumped from 3.3.x to 3.4.x.

We are also considering an alternative workflow, where we store the extracted metadata in an intermediate form, rather than going directly to Apache Solr or Elasticsearch. To enable us to experiment with this approach, the indexer has been modified to support writing the extracted metadata to JSON Lines output files so that we can use it to support multiple forms of indexing or analysis.

2023 Domain Crawl Preparation

As discussed in the previous report, this year we are bringing the domain crawl back on-site rather than running on the cloud. The technical preperation for this was fairly straightforward, given the deployment of the crawl is largely automated. The main change from the last on-site crawl is that we switched to using a server with plenty of fast SSD disks. The cloud crawls had shown us how much the whole thing can benefit from faster disks, so we have attempted to match that when running on our own servers.

Add some updated seed lists from Nominet and from our curators, and we are ready to roll on the anniversary of the first Non-Print Legal Deposit domain crawl. That one started on the 12th of April 2013, and so we’ve chosen that for our start date this year. This will be part of the wider celebrations from across the legal deposit libraries.

Addendum - 13th April 2023

Due to staff holidays, we are only now publishing this quarterly report, so we can add some notes on the launch of the 2023 domain crawl.

The crawl was set up on the 11th, and loaded with the 11 million seed URLs from Nominet and the 27,059 domain crawl seeds from W3ACT (including 13,460 non-UK seeds). On the morning of the 12th, the crawl was launched, and seems to be running well, at around 400 URLs per second. If the system can sustain this rate, which corresponds to around one billion URLs per month, the whole crawl should complete in 2-3 months time.

Image: Dashboard for the first 24 hours of the 2023 Domain

For more information on the anniversary of Non-Print Legal Deposit, see Celebrating ten years of collecting the UK Web Space.

Posted by Helena Byrne at 2:20 PM

Tags

Legal deposit, Web/Tech

04 April 2023

Celebrating ten years of collecting the UK Web Space

Nicola Bingham, Lead Curator, Web Archiving, British Library

This April, we are celebrating ten years of collecting and preserving digital publications in the UK such as websites, e-books, and online journals, under legal deposit regulations. The UK Web Archive forms an important part of our collecting activity, across all six legal deposit libraries. We aim to preserve a copy of every UK website that we can identify, reflecting the broad range of experience and expression across the UK.

The UK Web Archive provides a detailed insight into the evolution of online public communication over the past two decades. Communication on the web is central to understanding the history, politics, culture and society of the 21st century. However, we know that information shared publicly on the web is rapidly changed, deleted and replaced. The UK Web Archive helps people to understand current events, and the recent past, by preserving that information before it is lost.

Here are a few examples of topics and themes that we have preserved in the archive:

General elections: We have archived websites related to every UK general election since 2005. These websites provide a fascinating insight into the political campaigns, issues, and debates of each election.
London Olympics and Paralympics 2012: These websites document the planning, organisation, and events of the games, as well as the cultural and social impact they had on the UK.
Brexit: This collection documents the political, social, and economic impacts of Brexit. It contains official sources as well as voices from all sides of the debate across the UK.
Online Enthusiast Communities: This collection provides insight into hobbyists in the UK. It covers a wide range of interests from more traditional areas, such as stamp collecting and cycling, to the more esoteric, such as the UK Roundabout Appreciation Society.

The UK Web Archive is used by researchers to answer significant questions on various topics. Recent examples include:

discovering changes in word meanings over time (Barbara McGillivrary, the Alan Turing Institute)
exploring the evolution of the digital economy in the UK (Prof Emmanouil Tranos, University of Bristol)
investigating nationalism, internationalism and sporting identity, through the media coverage of the 2012 Olympic Games (Caio Mello, School of Advanced Study).
preserving and exploring online communication about health during the Covid-19 pandemic (National Library of Scotland, funded by Wellcome Trust).

The UK Web Archive has been in existence since 2004. Legal deposit regulations came into effect on 6 April 2013 which increased our capacity to collect the UK’s online heritage and ensure it is available for future generations to research and study.

Prior to these regulations, we had to ‘hand pick’ websites to archive, and then could only proceed with written permission of the website owner. From 6 April 2013, the six legal deposit libraries of the UK and Ireland (the British Library, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries, Cambridge University Library and the Library of Trinity College Dublin) were empowered to collect and preserve all web content that could be identified as published in the UK. Since then, we have been archiving the UK Web at the “domain” level and hold many millions of websites - or over a Petabyte of digital content. The 11th annual “domain crawl” will be launched this week.

How can I access it?
Anyone can access the UK Web Archive, free of charge, at the six UK Legal Deposit Libraries.

You can search the archive, and view thousands of openly accessible archived websites at https://www.webarchive.org.uk/

Help us build the archive
Even though we aim to collect as much of the UK Web as possible, we miss many websites as we cannot automatically identify all of them as being published in UK. If you know of a UK website that should be preserved, please suggest it here: https://www.webarchive.org.uk/en/ukwa/info/nominate

Posted by Jason Webber at 3:24 PM

Tags

Legal deposit, Web/Tech

16 January 2023

UK Web Archive Technical Update - Winter 2022

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the update at the start of the autumn.

2022 Domain Crawl Completion

As in previous years, the 2022 Domain Crawl continued to run right up until the end of the year. Overall, things ran smoothly, with only brief outages for upgrading the virtual server over time as the size of the frontier grew.

Because we’re running on the cloud, we are paying for how much compute capacity, RAM and disk space we’re using. So, when the crawl is young and the Heritrix3 frontier database is small, it makes sense to use a small computer. But as the crawl frontier grows, so does the amount of RAM the crawler needs to manage the frontier, so we scale up as we go.

This is one of the reasons we spent time making it possible to configure the frontier database so more house-keeping and clean-up processes are run while the crawl is running. This helps Heritrix clear disk space after it has dealt with URLs, and led to significant savings. The 2020 crawl ended up using 45TB of disk space to store the crawl state, and deleting old ‘checkpoint’ files (which can be used to revert the crawl state to a previous point in time) did not help free up more space. But after changing those configuration options, the 2021 and 2022 crawls only needed 15TB of space, and deleting checkpoints was much more effective.

2023 Domain Crawl Planning

We originally moved to the cloud to relieve pressure on the BL networks as staff switched to remote working during the pandemic. But even when COVID restrictions were eased, the library has continued to support staff working remotely where possible. Fortunately, over the last year the library has upgraded many of the network systems across both the London and Boston Spa sites, which means we now have permission to run the 2023 crawl on site.

As there is still some uncertainty as to how this will affect other network users, we are planning to begin the crawl much earlier in the year (perhaps as early as February). This gives us more time to revisit our options if something goes awry.

Internal Collections API

Working with the Archives of Tomorrow project to understand their requirements, we now have an internal API where W3ACT metadata can be downloaded for entire collections, including all sub-collections and target site metadata. Authenticated W3ACT users can retrieve these full collection extracts (including unpublished collections), which are updated daily. The JSON files are available at https://www.webarchive.org.uk/act/static/api-json-including-unpublished/collection/ for logged-in users.

The public version of the API is in the final stages of development, and should be released early in 2023. Unlike the internal API, this will not include collections that are not yet ready for publication.

W3ACT 2.3.4

Just a few days ago, W3ACT 2.3.4 was released. This included a number of tweaks and bugfixes, including correcting the CSV export feature and adding more export formats (TSV and JSON). For more details, please take a look at the associated release milestone.

There was also an issue with how W3ACT data was used, meaning the subdomains of sites with open access licences were being given the same licence as the ‘parent’ domain. This has now been resolved and access is consistent with the data in W3ACT.

Document Harvester Outage

From the 12th of December onwards, the Document Harvester had stopped picking up GOV.UK documents properly. This appears to have stemmed from some edits carried out in W3ACT, where the Watched Target that covered the GOV.UK document publication service was merged with the main GOV.UK Target (which was not Watched). This meant the crawler was no longer looking for documents from GOV.UK.

We made the GOV.UK Target into a Watched Target, and then cleared the relevant crawl logs for re-processing. Those logs have now been processed and the missed documents have been identified.

We’re looking at how this happened and will take steps to prevent this happening in the future.

Legal Deposit Access Solution

The Application Support team has been working with Networks team and our Legal Deposit Library partners to start to roll out an initial ‘alpha’ service across all sites. This will help all library staff to try out the system and lay the foundations for a ‘beta’ service in reading rooms. The Project Manager has also been working hard to understand the likely timeline for the project and communicate this to all stakeholders, while keeping the project management triangle in mind.

Additionally, we’re working on setting up a suitable Continuous Deployment pipeline for this service using GitLab CI/CD. This will allow us to analyse, test and safely deploy new versions of the access service without having to manage the system by hand.

CDX Backfill

One of the critical components of the web archive is the content index (CDX), which is an index of all the URLs we have archived, and is required for playback to work. Ours runs on OutbackCDX (from the National Library of Australia), and a subset of it’s functionality is available via our API.

In the past, we’ve had problems running large CDX indexing jobs, and this had left us in an unfortunate situation where the 2016, 2018 and 2019 domain crawls were not indexed. During the last few months, we modified the the indexing process to (re)process our WARCs and ‘backfill’ the index, which has filled in those gaps.

This also showed that we could process our entire collection (i.e. over 1PB) in a reasonable time (roughly three months depending on the precise workload), which is reassuring. It will likely be necessary to re-build indexes from time to time, and it’s good to know it should be possible to do so in a reasonable amount of time. Also, the act of reading every byte of every WARC is an additional explicit proof that the files have been kept safe over all these years! We know HDFS has been systematically monitoring the files over time, but it’s nice to run an independent check.

The 2020, 2021 and 2022 domain crawls will have to wait a little longer, as they are stored on Amazon Web Services and need transferring to the British Library before they can be indexed.

Browsertrix-Cloud

Finally, we’re proud to be part of the IIPC project Browser-based Crawling For All, which contributes to the development of Browsertrix Cloud and attempts to ensure IIPC members can take advantage of it. As part of this, we proposed two sessions for next years’ IIPC conference, both of which have been accepted:

A workshop called Browser-Based Crawling For All: Getting Started with Browsertrix Cloud, aimed at helping attendees take advantage of Browsertrix Cloud. We’re particularly interested in uncovering barriers that might prevent adoption.
A panel called Browser-Based Crawling For All: The Story So Far, giving an insight into the current state of the project and of Browsertrix Cloud (including any feedback from the workshop).

Hoping to see you there!

Posted by Jason Webber at 9:00 AM

Tags

Web/Tech

12 January 2023

Changes in Nature’s Calendar – Early Bloomers

The Importance of Citizen Science in Monitoring and Adapting to Climatic Change

By Andrea Deri, Cataloguer and UKWA Climate Change Collection’s lead curator

On 1 January 2023, I had my usual walk from Folkestone Gardens via Sue Godfrey Nature Park, Deptford, London Borough of Lewisham to Greenwich Park, Royal Borough of Greenwich. Overcast, temperature in single digit, humid but calm. Trees and shrubs mostly leafless: an accentuating background to patches of bright green mosses.

I was hoping to see some flowers on winter blossoming plants, for example the bell-shaped flowers of clematis ‘Jingle Bell’ in St Alfege Church’s yard, and the spidery flowers of witch hazels in the Royal Observatory Garden in Greenwich. I was also curious what other flowers I would find, earlier than usual, triggered by the warming climate. Having joined a month ago (1 December 2022) the annual wildflower ‘hunt’ on the first day of the winter, a survey of species in flower in my locality, Deptford’s urban area since 2009 organised by the Creekside Education Trust and the London Natural History Society, I expected several early bloomers. Here is Creekside’s blog post of the 2021 wildflower survey.

While the witch hazels (Fig. 1.) did not disappoint, I was up for a surprise with clematis “Jingle Bell”: only the silky fluffy seedheads were left: it finished flowering earlier this year. I was lucky to see its last flowers on Christmas Eve 2022 (Fig. 2.). Other early flowers greeted me on a hazelnut shrub in Sue Godfrey Nature Park (Fig. 3.). But, I was truly astonished to see daffodils fully opened in a park by Creekside, just across the Creekside Discovery Centre (Fig.4.)

Figure 1 Witch hazel (Hamamelis sp.) in flower. Photo: Andrea Deri, Royal Observatory Garden, Greenwich, London, 1 January 2023

I started searching for phenology calendars, almanacs, and any information on the blooming time of these species in my local and other areas in order to compare my observations with the “expected” (based on previous years) flowering periods. The online findings supported my assumption: I did observe earlier than expected flowerings, with the most specific data for the hazelnut.

Clematis ‘Jingle Bell’
According to the Royal Horticultural Society (RHS) clematis “Jingle Bell” flowers in winter and early spring. Compared to this broad-brush period, my observation this year suggests this individual specimen finished flowering much earlier than expected and earlier than I had observed this specimen in previous years.

Figure 2 Clematis cirrhosa “Jingle Bells” one bell-shaped flower and fluffy seedheads. Photo: Andrea Deri, St Elfege Church, Greenwich, London, 24 December 2022

Daffodil
A post on the Daffodil Society prompted me to do a search on RHS’s website for daffodils where February-March was quoted as the usual flowering period. More precise than for the clematis. Early flowering daffodil horticultural varieties, however, can bloom as early as January, stated one of the Gardeners World blogposts. I may have encountered an early flowering daffodil garden variety. In addition to its literary associations, this iconic flower may have just now become also a conversation starter about the climate crisis. Would its freshness and brightness frame a difficult dialogue in hope?

Figure 3 Daffodils (Narcissus sp.) in flower. Photo: Andrea Deri, near Creekside Discovery Centre, Deptford, London, 1 January 2023

Hazelnut
The Woodland Trust Nature’s Calendar offered me with the tool I had been really looking for: a peer-reviewed database linked to a live map that allowed me to compare my observation with fellow observers in the UK at day level precision.

Figure 4 Hazelnut (Corylus avellana) in flower: crimson female flowers, yellow catkin male flowers. Photo: Andrea Deri, Sue Godfrey Nature Park, Deptford, London, 1 January 2023

Before I signed up to add my hazelnut observation, I took a screenshot of the “Add a Record” webpage on 5 January 2023 that showed the first hazelnut flower sighting on 4 January 2023. (Fig.5.)

Figure 5 Screenshot of Nature's Calendar, Woodland Trust. Photo: Andrea Deri, @20:34 pm GMT 5 January 2023

Hazelnut first flowering was among the recently recorded data of the Nature’s Calendar (Fig. 5.) My observation of hazelnut flowers on 1 January 2023 was not extraordinary but earlier than the one featured online. Hazelnut is expected to be in flower in early January according to Nature Calendar (downloadable pdf). But as early as 1 January? To answer this question, I had to register to enter my data. When I entered my observation date, I received an automatic note, all in red:

“This date falls outside of the expected range

The date you have entered is unusually early or late for this species and event; please double check the record. If it’s correct we’d like to know more about your observation, so please add a comment before clicking ‘next’ to continue. If possible, a photo is very useful too. Please note that your record will not appear on the live map until it has been checked by the Nature’s Calendar team.”

For evidence, I uploaded one of my photos of the hazelnut flowers (Fig.4.) and a description of the place and circumstances. My hazelnut flowering observations may turn out to be some of the earliest this year. To prove or refute this statement I rely on the Woodland Trust’s online database, the Nature’s Calendar team’s peer-review and keen monitoring of fellow citizen scientists. This type of on-land & online live collaboration in monitoring the slightest phenological changes is gaining increasing importance in addressing local impacts of climatic changes.

Will hazelnut flower earlier and earlier in the future? Only regular visitors can answer this question by careful monitoring the same hazelnut shrub and recording the date of the first flowers and uploading the data to Nature Calendar.

Nature Calendar invites citizen scientists to monitor a carefully selected list of species of shrubs, trees, flowers, grasses, fungi, birds, insects and amphibians throughout the year. Their changes over time will give us information on how these species (plants, animals and mushroom) adapt to the unfolding climatic changes. Phenological change data contributes to better decisions in wildlife conservation, among others.

While I was browsing, I came across several websites and webpages on various other decisions and local actions related to climate change adaptation. For example: What can I do about climate change in my garden? What local residents are doing in the boroughs of Lewisham and Greenwich about the climate crisis: Climate Action Lewisham, Climate Home – a home of creativity, imagination and community activism by young people, Lewisham Climate Action Bond as an example of Local Climate Bonds, Lewisham Climate Emergency Declaration and Action Plan, CAPE Informing Local Action on Climate Change / London Borough of Lewisham, The Climate Emergency website of Royal Borough of Greenwich, Carbon Neutral Greenwich, Greenwich Climate Network.

Some of the activities and organisations were familiar to me, I was taken aback by others: ‘How could I miss them? I live here!” A fast-changing landscape of actions and online information. Having saved these sites to my further actions, I also realised some of these online contents could be highly ephemeral. Uploading my list of URLs to the UKWA Climate Change collection saved local digital content for future research on climatic changes.

Sauntering through streets, gardens and parks has turned into an archival journey, connecting past, present and future. Fit for the first day of the year. Fit for any days, anywhere where your interest, experience, and local knowledge crosses climatic changes.

The Natural History Museum’s community science webpage lists a broad range of UK wildlife monitoring activities related to climatic changes, including the New Year Plant Hunt of the Botanical Society of Britain and Ireland and the upcoming annual Big Garden Birdwatch (27-28 January 2023) organised by the Royal Society for the Protection of Birds since 1979.

Contribute to the web archive
Your next walk or online stroll may spark you to nominate some of your local climate initiatives (civil society, governmental, business, media, arts and academia) to the UK Web Archive Climate Change Collection. Many thanks for your consideration.

Posted by Jason Webber at 9:51 AM

Tags

Contemporary Britain, Selection, Web/Tech

12 December 2022

Examining sports history through digitised & born digital resources

By Helena Byrne, Curator of Web Archives, The British Library

The Irish Sporting Lives workshop and symposium took place at the Ulster University campus in Belfast from 11-12 November 2022. Day one took the form of a half day workshop aimed at PhD/ECR researchers. It focused both on imparting knowledge about how to research historical figures and how to write sporting biographies. There were three sessions in the workshop:

Margaret Roberts: It’s not what you research… it’s the way that you research it: that’s what gets results
Helena Byrne: Examining sports history through digitised and born digital resources
Turlough O’Riordan & Terry Clavin: Writing sporting lives

The slide deck and speaker notes on ‘Examining sports history through digitised and born digital resources’ are now available in the British Library Shared Research Repository under a CC BY 4.0 Attribution licence.

The running time for this session was 70 minutes, therefore, many of the slides were discussed only briefly to allow more time for the activity phase of the workshop. The slides accompanying the notes can be edited by anyone to suit different session lengths. If more time is available, more time can be spent on exploring the different options discussed in the slides. As there was limited time in this workshop, no live demos were given during the presentation. The workshop focused on the subject of sport, but it could be adapted to suit any subject area.

For more general web archiving training materials at a beginner level, please see the International Internet Preservation Consortium (IIPC) Training Materials page: https://netpreserve.org/web-archiving/training-materials/

The agenda for this session covered:

Warm Up Activity
Digital Resources
Digitised Newspapers
Web Archives
Hackathon – Preserve Irish sporting heritage online.
Wrap Up Activity

The session mostly focused on using web archives and only briefly covered digitised newspapers because this was covered in more depth in the first session led by Margaret Roberts.

The warm-up activity collected anonymous information on what type of academic background the workshop participants were from, what their general level of awareness of web archives were, and in particular their awareness of the UK Web Archive. Participation in this activity was optional and not all participants responded to every question. Most of the participants came from a history background while others were from subjects including English Literature, Law, Sports Management or Independent Researchers who research a wide variety of sports.

There were twelve responses to the question ‘Do you understand the difference between the terms digitised and born digital?’. Six respondents replied ‘yes’, while three said ‘no’ and three said ‘not sure’. In the ‘Digital Resources’ section of the presentation, the difference between these two terms was clarified during the presentation. More in depth user studies on web archive research conducted by Healy et. al. (2022) and Costea (2018) have highlighted that there is often confusion amongst researchers on the difference between a digital library/digital archive, a database and a web archive.

There were thirteen responses to the question ‘Have you ever used a web archive?’. Six respondents replied 'yes', while four said ’no’ and three said ‘not sure’. There were twelve responses to the question ‘Have you ever used the UK Web Archive?’. Four respondents replied ‘yes’, while six said ‘no’ and two said ‘not sure’.

The session highlighted different ways that the researchers could use DIY web archiving techniques to mitigate against the impact link rot and content drift could have on their research.

In the hackathon part of the session, participants were tasked to use some of the DIY web archiving strategies discussed to preserve the Irish sporting heritage. Participants could choose from two options:

Add online content used in your research to the relevant web archives.
Review what web content has already been preserved from your area of study in the UK Web Archive Sports Collections. Then select online content from the web to nominate to the UK Web Archive.

Although there was approximately 25 minutes available at the end of this presentation for this activity, it would really need more time and if possible pre-workshop preparation to get maximum results for this activity.

To wrap up this session, participants were asked two questions about how likely they were to use web archives in their research. Firstly, on a scale of 1 meaning very unlikely to 5 very likely, participants were asked ‘How likely are you to use a web archive as a resource for your research?’. Seven participants answered this question and the aggregated response was 4.4. Secondly, eight participants responded to the question ‘How likely are you to save content you view online in a web archive?’. This was also a scale question with 1 meaning very unlikely to 5 very likely, and the aggregated response was 3.4.

Although the workshop elicited a small sample of results, they show that there is an interest in using web archives in academic research, not just as a reference source but as a way for managing online citations in the field of sports studies. It would be beneficial to the research community if those teaching research method classes could incorporate web archive training into their classes. The training materials published through the British Library Shared Research Repository can be adapted to suit any subject area.

References:

Healy, S., Byrne, H., Schmid, K., Bingham, N., Holownia, O., Kurzmeier, M., & Jansma, R. (2022). Skills, Tools, and Knowledge Ecologies in Web Archive Research. WARCnet Special Report. Aarhus, Denmark: WARCnet, https://cc.au.dk/fileadmin/dac/Projekter/WARCnet/Healy_et_al_Skills_Tools_and_Knowledge_Ecologies.pdf

Costea, M.-D. (2018). Report on the Scholarly Use of Web Archives. Aarhus, Denmark: NetLab. Retrieved 2019-08-30, from http://netlab.dk/wp-content/uploads/2018/02/Costea_Report_on_the_Scholarly_Use_of_Web_Archives.pdf

Posted by Jason Webber at 11:00 AM

Tags

Contemporary Britain, Sports, Web/Tech

07 December 2022

Pride and Visibility in the LGBTQ+ Lives Online Collection

By Ash Green, CLIP LGBTQ+ Network, and Goldsmith University

The LGBTQ+ Lives Online UK Web Archive collection currently holds over 600 sites, web pages, blogs etc focused on the LGBTQ+ experience of people in the UK. Community and the coming together of individuals is a key aspect of the LGBTQ+ experience, and this is particularly reflected in sites acting as networks; focused on Pride events; and visibility and remembrance days such as Bi Visibility Day, Lesbian Visibility Week, Trans Day of Remembrance, International Day Against Homophobia, Biphobia and Transphobia. These events, networks and days are there to support the community; remind others outside the community we are part of, that we exist; that we celebrate who we are; that the need to highlight and address inequalities continues to remain important despite LGBTQ+ people having existed for millennia.

Gotta Be Worth It from Pexels:

An example of sites in the UK Web Archive under some of these banners include: LGBT Mummies (aiming to support LGBT+ women & people globally on the path to motherhood or parenthood); London Gaymers (a safe place for the LGBT gaming community in London and across the UK to connect with like minded individuals); African Rainbow Family (a non-for-profit charitable organisation that support lesbian, gay, bisexual, transgender intersexual and queer (LGBTIQ) people of African heritage and the wider Black Asian Minority Ethnic groups); Pride Sports (a focus on increasing participation in sport by lesbians, gay men, bisexual and transgender people as well as the wider community). As you can see from the examples given, many of the informal networks are focused on where other aspects of an individual’s life overlaps with being an LGBTQ+ person.

We also have Pride sites archived within the collection, including both local (Pride In Surrey , Glasgow’s Mardi Gla , York Pride) and nationwide (LGBTQYMRU ) events. Before the pandemic they were mainly face-to-face events, but between 2020 and 2022, there was an increase in online events as many sought to keep LGBTQ+ people connected in a safe way.

We would like to build the collection of UK sites focused around Pride and awareness/visibility days. We don’t limit our collection of sites to big organisations only – as we have said before, all LGBTQ+ content is welcome, including personal content if it is published in the UK. And even though we would like to develop the areas of the collection highlighted above, we are also still happy to receive submissions around any aspects of LGBTQ+ Lives Online. So, if you know of any online content you think we should be archiving within this collection please nominate it here.

Under the Non-Print Legal Deposit Regulations 2013, the UKWA can archive UK published websites, but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Libraries and Trinity College Dublin Library. If you’re curious about what is in the LGBTQ+ collection you can browse through it here.

Posted by Jason Webber at 9:00 AM

Tags

Contemporary Britain, Legal deposit, LGBTQ+, Web/Tech

01 December 2022

History on the move: Curating a collection on the Queen’s Platinum Jubilee

By Daniela Major, PhD Student, School of Advanced Studies, University of London

Note: This blog post was written before the death of Her Majesty Queen Elizabeth II. The Jubilee collection has documented the end of an extraordinary reign and will hopefully serve as a basis for future researchers to understand this historical moment.

Before I started my placement at the UK Web Archive, my project idea was to build a collection about the History of London. I had thought it would give me an opportunity to delve into history blogs and history websites, and to explore how people interpret historical events; it was, however, a Jubilee year, and the opportunity came up instead to curate a collection about this very modern event, which would, moreover, unfold as I built the collection.

The particular challenges of this exercise were very attractive to someone who still considers herself an historian. It is fairly straightforward to build a collection about events that have gone past and that have been analysed by countless historians. It is a very different thing to curate a collection about events that are happening, whose consequences remain unknown. In this sense, the Queen’s Platinum Jubilee was a great opportunity because in many ways Queen Elizabeth II already belongs to History. It is entirely possible to historicise her existence and her years in power. It is also possible to use her reign as a way to look into the making of modern Britain and modern Europe, as she was present through many key historical moments in the last 70 years.

A priority which was defined early on was representing different parts of the UK, rather than focusing only on the big cities. We looked into how towns, villages and cities were celebrating the Jubilee, what events they were organizing, where street parties would take place and how councils involved local communities in the celebrations. From a geographical representation came the necessity to represent different voices and opinions, both from the UK and the Commonwealth. It was vital the collection didn’t turn out to be laudatory. Future researchers would be interested in knowing whether there was resistance to the monarchy and whether consensus was real or fabricated.

As with so many questions in History, the answer is both yes and yes. Yes, there is resistance, but yes there is genuine and even widespread appreciation for the Queen.

For the majority of my academic career, I have looked to the past to study it. Historians are used to question the archives. We have to question the silences and the omissions, we have to remember who created records, who kept them, and why. Curating this collection placed me firmly on the other side of these interrogations. I was the one deciding what should go into the collection, what should be kept for posterity. The web is vast, content is being produced every minute of every hour. It is not conceivable to include everything. The responsibility is enormous, but it made me all the more aware of the need to hear different sides, so as to not exclude voices which have often been silenced in the past.

The Web affords researchers the possibility to glimpse into facets of life and points of view that many previous historical records have omitted. It is a rich source with enormous democratic potential, and one which will become even more essential in the years to come; it must be protected and looked after. The work that web archivists do, and that I have been privileged enough to take part in, is vital to safeguard the history of the present and the future.

View the Queen's Platinum Jubilee, 2022 collection

Also the Queen's Diamond Jubilee, 2012 collection

Posted by Jason Webber at 9:01 AM

Tags

Contemporary Britain, Jubilee, Legal deposit, Web/Tech

30 November 2022

If Websites Could talk - Part 5

By Hedley Sutton, Team Leader, Asian & African Studies Reference Services

Check out previous episodes in this series - Part 1, Part 2, Part 3 and part 4.

Over a year has passed since we last eavesdropped on the ongoing debate among U.K. domain websites as to which of them deserves to be recognised as the most extraordinary site of all.

“We think we should be considered,” said *Heritage Cast Iron Radiators*. “We’re not a site that you come across every day.”

“Agreed, but you could surely say the same about us,” retorted the *Carrotworkers’ Collective*. “What do you reckon, *Angelfish Opinions*?”

There was no response, the latter being in deep conversation about matters piscine with the *Catfish Study Group*.

“Let’s hear it for the mammals!” cried *Platypus Research*. “You’re with us, *Led by Donkeys* , are you not? And you, *Absolute Dogs*? Not quite sure if you count, *Hatching Dragons*”?

“We insects always get overlooked,” muttered the *British Bee Veterinary Association*.

“We know how you feel,” commiserated *Polly Parrot Rescue UK*.

“What about us?” said the *UK Soft Power Group*. “Our charm, our intelligence …”

“Look, we want to take this tired debate to a whole new dimension,” said the *Quantum Communications Hub*. “With the help of the *Cosmic Shambles Network*, nothing can possibly stop us!”

“That’s not quite fair,” said the *Tuneless Choir*. “If you’re going to work together on your bid, then we might well hook up with the *London Vegetable Orchestra*”.

“Wait a minute – two can play at that game,” said the *Museum of Human Kindness* , “Can’t they, *Empathy Museum*?”

Fortunately at this point the *Centre for Effective Dispute Resolution* made a useful suggestion. It was decided that the fairest way forward was for candidate sites to first contact the *UK Anonymisation Network*, and then let the *Academy of Experts* make the final choice.

And thus it came to pass that the chosen site was … *Much Better Adventures*.

Posted by Jason Webber at 7:46 PM

Tags

Web/Tech