UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

11 July 2022

UK Web Archive Technical Update - Summer 2022

By Andy Jackson, Web Archive Technical Lead, The British Library

Following on from the spring quarterly update, we’ve been able to make some good progress despite being understaffed during this period.

Hadoop storage and replication
We are still in the process of replicating content onto a second Hadoop cluster, to be moved to the National Library of Scotland. The cluster capacity is 1PB, and it’s now about 70% full. Next steps will involve double-checking the files have been replicated correctly, and planning the relocation of the servers.

Legal Deposit Access Solution
There has been significant progress on developing the new reading room access system for the UK Web Archive and other Legal Deposit content. The Webrecorder team has delivered and initial version of the NPLD Player app, which will be needed to access Legal Deposit material on some reading room access terminals. Once some final issues have been addressed, and some documentation added, we can start to plan the roll out in detail.

Before that, we need the centralized services deployed, which use our PyWB system to render PDFs and ePubs as well as archived web pages. The Webrecorder team have implemented most of the necessary changes to PyWB, and we have been working towards deploying the new access services, in partnership with the British Library’s Application Support team.

The whole project team has been busy planning, capturing use cases and test cases, considering security issues, publishing internal communications about the work, and responding to feedback from those communications. There’s still a few areas of uncertainty, which means we don’t yet have a solid time-scale for the full transition to the new services, but this should become clear over the next few months.

Web Crawlers
While the core crawl system has not been changed in the last quarter, we have made improvements to how the crawls are launched and how the current Document Harvester is implemented.

Specifically, all services have now been moved from our older workflow system to our new Airflow platform (as mentioned in the 2022-01 technical update). This means these automated tasks are now easier to monitor and manage. In particular, the older workflow system has been struggling for some months due to the large number of tasks involved in the Document Harvester workflow. The underlying tools have been heavily refactored to make sure the document identification and extraction processes are much more efficient and reliable.

W3ACT (Annotation and Creation Tool)
While W3ACT itself has not been updated during the last quarter, the version of PyWB it uses has been updated to the latest 2.6.7 release.

UKWA Website
The new searchable Topics & Themes page is now live, making it much easier to explore our curated collections. We’ve found a few minor issues, such as some collections not appearing on the page, but we’ll work on ironing these out over the next weeks.

To help us update our website with confidence, we’ve made a number of improvements to our automated testing system. This has been refactored to make it easier to run, and extended to cover almost all critical web services and APIs. As well as making changes easier to implement, this also means we can automatically run the test suite every morning, and will be alerted if anything isn’t working as expected (UKWA staff and partners can access the most recent test report at this URL).

This new test suite includes experimental support for running the Pa11y accessibility evaluation tool, and including the results in the test report. In time, this will help us ensure any changes we make to the website do not negatively affect the accessibility of the site (at least to the extend that automated testing can determine).

Archive of Tomorrow
Finally, we’ve enjoyed starting to get into some detailed conversations with our Archives of Tomorrow project colleagues. Among other things, these conversations will help drive our nascent UKWA API work, by helping us explore how best to make our curated collections and other data and metadata available for re-use. These discussions also reminded us to polish off some updates to our screen-shotting services, which means the Twitter and Open Graph social card support we’ve added to our playback pages should now be significantly more responsive and reliable.

To find out more about the Archives of Tomorrow project, you can check out this IIPC blog post: Archive of Tomorrow – Capturing online health (mis)information.

05 July 2022

What to expect on the UK Web Archive blog during UEFA Women’s Euro England 2022

By Helena Byrne, Curator of Web Archives, British Library

The UEFA Women's Euro 2022 competition is taking place across England from July 6 to July 31, 2022. We are collecting websites about the UEFA Women’s Euro 2022 from around the UK

You can view the UEFA Women’s Euro England 2022 collection here:  https://www.webarchive.org.uk/en/ukwa/collection/4278

a blue banner image with the British Library, Inspired by England 2022, the National Football Museum and the UK Web Archive. A female football player kicking a ball and the text, Can you help us preserve football history? We are collecting websites about the UEFA Women’s EURO 2022. Nominate a website for us to archive QR code and link to the nomination form: https://www.webarchive.org.uk/en/ukwa/info/nominate

Over the next few weeks there will be a number of guest blog posts from the UK Web Archive and collaborators from around the UK. 

First up, we will have a blog post from the National Library of Scotland and the National Library of Wales. Neither Scotland nor Wales qualified for this edition of the tournament, but as part of the UK Web Archive, both national libraries will be contributing to the collection and ensuring that any fan events taking place are preserved. 

From the 18th July there will be a number of blog posts published each week in July.  There will be a guest blog post from the Public Records Office of Northern Ireland (PRONI) who will be contributing a range of content from Northern Ireland. The team from Northern Ireland made history by qualifying for their first UEFA Women’s Euro tournament. 

There will be a series of blog posts from the tournament’s Arts and Heritage partners in the host cities. There were three specially commissioned projects to celebrate the rich history of women’s football and its players and to encourage more people to be inspired by the tournament. These blog posts will also include updates from across the UEFA Women’s Euro England 2022 host cities. These blog posts will give a summary of their local cultural programme activities, as well as an overview of what websites they nominated to the collection that are important for telling the story of the UEFA Women’s Euro England 2022 tournament in their area.

The final blog post in the series will be published in late September, this will be a reflection on the collection activities and give an overview of some personal favourites from the curator of the web archive collection, Helena Byrne. 

Get involved 
Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nomination form: https://www.webarchive.org.uk/en/ukwa/info/nominate 

29 June 2022

What content should I nominate on the UEFA Women’s Euro to the UK Web Archive?

By Helena Byrne, Curator of Web Archives, British Library

a blue banner image with the UK Web Archive, British Library, Inspired by England 2022 and the National Football Museum. A female football player kicking a ball and the text, Can you help us preserve football history? We are collecting websites about the UEFA Women’s EURO 2022. Nominate a website for us to archive:

The UEFA Women's Euro 2022 competition is taking place across England from July 6 to July 31, 2022. We are collecting websites about the 2022 UEFA Women’s EURO from around the UK. You can view the collection here:  

https://www.webarchive.org.uk/en/ukwa/collection/4278 

This blog post runs through some examples of the type of content you might like to nominate to the collection. 

We archive websites: 1. That are on a .uk or other UK geographic top-level domain such as .scot or .cymru. 2. That are published in the UK.  We do not archive: 1.Online Sound or Video platforms, in which audio-visual material is the predominant content. 2. Private Intranets and Emails. 3. Personal data in social networking sites or websites only available to restricted groups.

We archive as much openly available online content that we can identify as being published in the UK. Archiving is carried out through a mix of automated processes such as an annual domain crawl or through manual selection by the UK Web Archive teams, as well as the public nomination form.

UEFA Women’s Euro England 2022
For the UEFA Women’s Euro England 2022 we want content that specifically refers to the tournament. Some websites might only have a subsection or even just one page dedicated to the tournament so you can nominate that specific URL. 

We add the following type of web content to the collection:

  1. Full website
  2. Subsection of a website
  3. Individual page from a website
  4. Event page
  5. Twitter accounts

Unfortunately due to technical challenges, the only social media content we can successfully archive is Twitter. If you know of any high-profile Twitter accounts -  that aren’t personal accounts of ordinary people - then please nominate them. 

Examples of some website content we have added so far include:

Full website
Have you seen any new websites set up just for the UEFA Women’s Euro 2022 tournament? Most websites will, at most, just have a dedicated subsection or page for the tournament. Some websites such as the official sponsor, Visa, highlight the tournament on their home page in the run-up to and during the tournament. This is why we have added the whole website to the collection, as it is easy for the user to navigate from the home page of the archived website during the tournament to the dedicated section for the tournament. 

Subsection of a website
The FA website has a subsection dedicated to UEFA Women’s Euro 2022. The earliest captures of this subsection are from July 2020 which you can view here:

https://www.webarchive.org.uk/wayback/archive/20200726095218/http://www.thefa.com/competitions/uefa-womens-euro-2022 

a screenshot of the UEFA Women’s Euro 2022 subsection of the FA website from July 26 2020. The text reads Women’s Euro set for 2022. The UEFA Women’s Euro 2021 in England is postponed until the summer of 2022] https://www.webarchive.org.uk/wayback/archive/20200726095218/http://www.thefa.com/competitions/uefa-womens-euro-2022

Link to archived website: https://www.webarchive.org.uk/wayback/archive/20200726095218/http://www.thefa.com/competitions/uefa-womens-euro-2022 

Individual page from a website
In some cases there is just one page on a website relevant to the collection subject. When thinking about women’s football, the Royal Philharmonic Orchestra (RPO) doesn’t always come top of the list of potential websites. However, they have partnered with the FA to ‘engage fans in a range of musical opportunities and public events celebrating the history, ethos and future of women’s football’. What other websites have you seen that have posted an article about the UEFA Women’s Euro 2022 tournament? 

You can listen back to the archived versions of the anthems on the RPO website here: https://www.webarchive.org.uk/wayback/archive/20220621111257/https://www.rpo.co.uk/rpo-resound/womens-euro-anthem 

Event pages:
There are lots of events going on around the UEFA Women’s Euro 2022, these range from official events, fan-led events or venues organising their own events such as talks, book launches or watch parties for the matches. Eventbrite is one of the most popular platforms for ticketing these events, but have you seen any other platforms or websites?

A search on Eventbrite for Euro 2022 in the United Kingdom on the day of writing comes back with 500 pages

Twitter accounts:
Archived copies of Twitter accounts are only accessible through a reading room, but you can view what we have selected here: https://www.webarchive.org.uk/en/ukwa/collection/4284

We have already added the Twitter accounts of the players for England, Northern Ireland and other players based in the UK. However, we may have missed some, so please let us know through the nomination form.

Get involved 
Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nomination form.

15 June 2022

Breaking the News - News collections in the Web Archive

By Jason Webber, Web Archive Engagement Manager, British Library

The British Library is currently running the wonderful ‘Breaking the News’ exhibition. If you’ve not seen it yet, make sure you check it out. It is open until Sun 21 Aug 2022. The exhibition explores how the News has impacted and influenced our society. This exploration includes modern digital forms of news, much of which are contained in the UK Web Archive (UKWA).

Breaking The News

The ‘News’ collection in UKWA contains over 2700 news sites that we archive. The scope ranges from major national news outlets - BBC, Guardian, Daily Mail etc. as well as many local and even hyper-local news websites. The collection includes one newspaper, The Independent, that ceased being a print paper to become exclusively a digital one.

The majority of these archived news sites and twitter accounts can only be viewed in reading rooms of UK Legal Deposit Libraries. Many, however, are openly available to view from home, lets see some examples:

Local news
In addition to major national news outlets we collect thousands of local and hyper-local news websites. Many towns, suburbs and villages maintain a local news website and we do our best to archive them.

Brixton blog

Bristol cable

Archived website - Bristol Cable

Cranfield and Marston Vale Chronicle 

International
Whilst the focus of the our collection is for UK based news, we do also collect some international or overseas publications. Tristan da Cunha, one of the remotest places on earth maintains a news website for its residents.
Irish news - TheJournal.ie

Tristan da Cunha News 

News-tristan

About journalism
As well as news outlets aimed at us the public, we also collect websites for journalists themselves.

The Bureau of Investigative Journalism

Media helping media

News-media-helping

You can discover everything we have collected in the News collection via our website.

If you know of a UK news website (this might be about your local area), nominate it to the UK Web Archive.

31 May 2022

Can you help the UK Web Archive preserve football history?

By Helena Byrne, Curator of Web Archiving, British Library

image of a female footballer kicking a ball on a blue background

The UEFA Women's Euro 2022 competition is taking place across England from July 6 to July 31, 2022. We’re collecting websites about the 2022 UEFA Women’s EUROs. Nominate a website for us to archive – it’s free and easy to do.

Since the launch of the UK Web Archive in 2005, this is the second time that England has hosted the Women’s European Championships. England hosted the 2005 edition of the tournament, but this is the first time that the UK Web Archive has a dedicated collection on the event. In late 2017, the UK Web Archive started to formally curate sports websites by establishing three main collections on sport. They are the Sports Collection, Sports: Football and Sports: International Events

The Sports: Football collection is divided into subsections based on the code of football and was given its own collection as football is the most popular sport in the UK. The final collection in this series is Sports: International Events, documents major sporting events mostly hosted in the UK. It is in this collection that the UEFA Women's Euros England 2022 collection will sit.

The British Library is working in partnership with the official Women's Euros cultural programme led by the FA, the National Football Museum and the five other UK Legal Deposit Libraries that make up the UK Web Archive to curate this collection but we also want fans to get involved. 

text that says in partnership and then the logos for the British Library, Inspired by England 2022, National Football Museum and the UK Web Archive

This collection has six subsections that cover events both on and off the playing field:

Cultural Programme: Any websites and social media accounts related to the cultural programme during the tournament. This includes arts, heritage and learning events.

Fans: Websites, blogs and social media accounts written by fans of the sport.

Organisational Bodies/Venues: Football Association, Irish Football Association, match stadiums and local government websites.

Press Media & Comment: News and comment, including the UEFA Women's Euro England 2022 landing pages on BBC and other media websites etc..

Sponsors: UK Websites and news articles relating to some of the official sponsors of the UEFA Women's Euro England 2022.

Teams: Websites and social media accounts of players' based in the UK. This will mostly be made up of players from England and Northern Ireland but also a few players from the other countries that qualified for the competition and live in the UK.

We need your help to ensure that information, discussion and creative output related to women’s football are preserved for future generations. Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nominations form: www.webarchive.org.uk/en/ukwa/info/nominate

30 May 2022

What UKWA did at the IIPC Web Archive Conference 2022

By Jason Webber, Engagement Manager, The British Library

Between the 18 and 25 May 2022, we had the biggest annual event in the world of web archiving - The IIPC General Assembly and Web Archive Conference. Some of the sessions were for members only but many were free and open for anyone to attend.

IIPC conference banner

Here are the UKWA staff and research partners who gave presentations at the conference with links to their pre-recorded talks that have been uploaded to our YouTube channel.

 

 

23 May 2022

Building Event Collections from Web Archives

By Sara Abdollahi, PhD student, L3S Research Center

The world is frequently experiencing events such as terrorist attacks, Brexit, and the migrant crisis, that has resulted in a vast amount of event-centric information on the web. Researchers, particularly digital humanities researchers and social scientists who analyse the significant events that influence and shape our societies, can benefit from web archives that reflect the perception of events as they happened at the time.

The Research challenge
Web archiving services provide a preserved state of the web that facilitates its study in the future. The ever-growing structure of web archives is one of the main challenges in accessing information for specific research. It is often difficult or even impossible for researchers to find their required documents. Typically, web archives offer interfaces for the users to access the information they need through keyword search. Researchers can then type the name of the event they are interested in and retrieve a list of web documents containing the text's keyword. The returned results are often overwhelming due to their quantity, potential redundancy, and irrelevance, needing an additional intensive cleaning phase to get more related web documents.

The UK Web Archive (UKWA) as well as some other web archives, offer manually collected event-centric collections to solve this issue, which can be considerably time-consuming to create. More importantly, these collections might not cover all necessary information related to a specific event.

A Potential Solution
To address the mentioned challenge, I propose automatically building event collections from web archives using knowledge graphs. Knowledge graphs such as
Wikidata and DBpedia are collections of interlinked real-world entities and concepts. 

In this research, I utilise the EventKG knowledge graph which provides structured information about events, their characteristics, and relationships (e.g., sub-events) and can thus be used as a resource for extending and diversifying the search space when building event collections.

Take the Arab Spring as an example; Tunisian Revolution, Bahraini protests of 2011, and 2011 Yemeni revolution are three sub-events of it. The figure below demonstrates an example of using EventKG to create event collections for Arab Spring. 

Building Event collections diagram

By utilising sub-events to expand the initial user query, a more diverse initial set of documents can be retrieved. This process leads to increased precision and coverage of the final event collection. Traditional methods might miss related documents to sub-events if there is no mention of the main event in those documents. To advance such methods, I demonstrate the impact of event-centric features and relations from a knowledge graph on building event collections.

Sara is giving a presentation of this project at IIPC Web Archive Conference 2022 (session 15) - Register for free.

17 May 2022

UK Web Archive Technical Update - Spring 2022

By Andy Jackson, Web Archive Technical Lead, British Library

Hadoop storage and replication
With the live services happily running off both the old and new Hadoop clusters, we have been focusing on setting up and populating our third Hadoop cluster, destined for the National Library of Scotland.

The Legal Deposit libraries have worked together to fund this additional, independent copy of the UK Web Archive holdings. This is primarily for the purposes of preservation, as having a further copy managed by a separate team and organisation will help ensure our records are not lost or damaged. Longer-term, this system can also function as an independent access and research platform, and this is something we hope to explore as part of the Archives of Tomorrow project.

As there is a petabyte of content to replicate, we were initially concerned that the process of migrating the data would take an extremely long time, and possibly put an unsustainable load on our internal network infrastructure. Happily, these worries were unfounded: over the last six weeks, we’ve replicated about 300TB of WARCs, and this has not caused any noticeable network capacity problems. We’ve also been able to start running cluster jobs that calculate checksums for the files on both ends of the replication, so we can verify everything is working.

Computer server

Legal Deposit Access Solution
The current system for accessing Non-Print Legal Deposit material in our reading rooms has accessibility problems, and is being replaced with two components:

  • An enhanced version of PyWB that can render PDFs and ePubs.
  • An ‘NPLD Player’ app that will allow the content to be accessed from reading room PCs that have not been set up to prevent copies of items being accidentally taken away.

With both components being developed through a contract with Webrecorder.

This quarter has mostly been about laying the groundwork for this (like writing deployment documentation), so we might make more progress next quarter.

Crawlers
We use web browsers to render a lot of seed pages, and this now represents a significant amount of data and included a lot of duplication of common files and media. To mitigate this, we have enabled deduplication for the browser-based crawling.

We’ve also improved monitoring of the process of moving WARCs and logs to Hadoop, so we can spot if backlogs are building up.

Annotation and Curation Tool (W3ACT)
For the core W3ACT service, the only changes have been to fix the links to QA Wayback that were being misdirected to the wrong URL, and upgrade PyWBs to 2.6.4.1.

However, we have been working on embedding additional services behind the W3ACT login. These include:

  • A way to view the logs from the W3ACT crawls.
  • An instance of SolrWayback, configured to search full text indexes from the W3ACT crawls.

Our Danish colleagues have been very helpful, collaborating with us to augment SolrWayback so it could be run with our systems. There are still some gaps (e.g. the internal playback part does not work reliably as our old Solr indexes do not provide all the fields SolrWayback needs) but it’s still very valuable as a way of exploring and evaluating how we might work in the future.

One gap, however, is that we haven’t yet updated the Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12). That should be done early in April.

UKWA Website
The majority of the work has focused on finishing the 'high-level category' view of the UKWA Topics and Themes, finalizing the design and pulling together the translations. 

In addition, like QA Wayback, the public PyWB service has been updated to 2.6.4.1, and we’ve shifted the services to new hardware.

Finally, we have been laying the groundwork for regular automated regression testing, including testing for accessibility issues. Once established, this will be a huge help, allowing us to modify our services with more confidence, knowing that if we accidentally break any critical functionality, the test suite will catch the problem early. This is particularly important as preparation for larger changes, like integrating static documentation and translations into the main website (ukwa-services#48).

Google Sheets Add-On No Longer Available
A while ago, we experimented with an add-on for Google Sheets that provided a way to query web archive holdings from an online spreadsheet (this COPTR link provides some additional information).

Unfortunately, this has become unavailable due to a particular kind of digital obsolescence: changes to Google’s policies. To make it work again, we have to modify our formal policies and documentation in a way that meets Google’s specific requirements. Realistically, due to other work taking priority, it’s likely to be some time before we are able to look at restoring it.

Read the previous UKWA Technical update (Jan 2021) blog post