UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

30 May 2022

What UKWA did at the IIPC Web Archive Conference 2022

By Jason Webber, Engagement Manager, The British Library

Between the 18 and 25 May 2022, we had the biggest annual event in the world of web archiving - The IIPC General Assembly and Web Archive Conference. Some of the sessions were for members only but many were free and open for anyone to attend.

IIPC conference banner

Here are the UKWA staff and research partners who gave presentations at the conference with links to their pre-recorded talks that have been uploaded to our YouTube channel.

 

 

23 May 2022

Building Event Collections from Web Archives

By Sara Abdollahi, PhD student, L3S Research Center

The world is frequently experiencing events such as terrorist attacks, Brexit, and the migrant crisis, that has resulted in a vast amount of event-centric information on the web. Researchers, particularly digital humanities researchers and social scientists who analyse the significant events that influence and shape our societies, can benefit from web archives that reflect the perception of events as they happened at the time.

The Research challenge
Web archiving services provide a preserved state of the web that facilitates its study in the future. The ever-growing structure of web archives is one of the main challenges in accessing information for specific research. It is often difficult or even impossible for researchers to find their required documents. Typically, web archives offer interfaces for the users to access the information they need through keyword search. Researchers can then type the name of the event they are interested in and retrieve a list of web documents containing the text's keyword. The returned results are often overwhelming due to their quantity, potential redundancy, and irrelevance, needing an additional intensive cleaning phase to get more related web documents.

The UK Web Archive (UKWA) as well as some other web archives, offer manually collected event-centric collections to solve this issue, which can be considerably time-consuming to create. More importantly, these collections might not cover all necessary information related to a specific event.

A Potential Solution
To address the mentioned challenge, I propose automatically building event collections from web archives using knowledge graphs. Knowledge graphs such as
Wikidata and DBpedia are collections of interlinked real-world entities and concepts. 

In this research, I utilise the EventKG knowledge graph which provides structured information about events, their characteristics, and relationships (e.g., sub-events) and can thus be used as a resource for extending and diversifying the search space when building event collections.

Take the Arab Spring as an example; Tunisian Revolution, Bahraini protests of 2011, and 2011 Yemeni revolution are three sub-events of it. The figure below demonstrates an example of using EventKG to create event collections for Arab Spring. 

Building Event collections diagram

By utilising sub-events to expand the initial user query, a more diverse initial set of documents can be retrieved. This process leads to increased precision and coverage of the final event collection. Traditional methods might miss related documents to sub-events if there is no mention of the main event in those documents. To advance such methods, I demonstrate the impact of event-centric features and relations from a knowledge graph on building event collections.

Sara is giving a presentation of this project at IIPC Web Archive Conference 2022 (session 15) - Register for free.

17 May 2022

UK Web Archive Technical Update - Spring 2022

By Andy Jackson, Web Archive Technical Lead, British Library

Hadoop storage and replication
With the live services happily running off both the old and new Hadoop clusters, we have been focusing on setting up and populating our third Hadoop cluster, destined for the National Library of Scotland.

The Legal Deposit libraries have worked together to fund this additional, independent copy of the UK Web Archive holdings. This is primarily for the purposes of preservation, as having a further copy managed by a separate team and organisation will help ensure our records are not lost or damaged. Longer-term, this system can also function as an independent access and research platform, and this is something we hope to explore as part of the Archives of Tomorrow project.

As there is a petabyte of content to replicate, we were initially concerned that the process of migrating the data would take an extremely long time, and possibly put an unsustainable load on our internal network infrastructure. Happily, these worries were unfounded: over the last six weeks, we’ve replicated about 300TB of WARCs, and this has not caused any noticeable network capacity problems. We’ve also been able to start running cluster jobs that calculate checksums for the files on both ends of the replication, so we can verify everything is working.

Computer server

Legal Deposit Access Solution
The current system for accessing Non-Print Legal Deposit material in our reading rooms has accessibility problems, and is being replaced with two components:

  • An enhanced version of PyWB that can render PDFs and ePubs.
  • An ‘NPLD Player’ app that will allow the content to be accessed from reading room PCs that have not been set up to prevent copies of items being accidentally taken away.

With both components being developed through a contract with Webrecorder.

This quarter has mostly been about laying the groundwork for this (like writing deployment documentation), so we might make more progress next quarter.

Crawlers
We use web browsers to render a lot of seed pages, and this now represents a significant amount of data and included a lot of duplication of common files and media. To mitigate this, we have enabled deduplication for the browser-based crawling.

We’ve also improved monitoring of the process of moving WARCs and logs to Hadoop, so we can spot if backlogs are building up.

Annotation and Curation Tool (W3ACT)
For the core W3ACT service, the only changes have been to fix the links to QA Wayback that were being misdirected to the wrong URL, and upgrade PyWBs to 2.6.4.1.

However, we have been working on embedding additional services behind the W3ACT login. These include:

  • A way to view the logs from the W3ACT crawls.
  • An instance of SolrWayback, configured to search full text indexes from the W3ACT crawls.

Our Danish colleagues have been very helpful, collaborating with us to augment SolrWayback so it could be run with our systems. There are still some gaps (e.g. the internal playback part does not work reliably as our old Solr indexes do not provide all the fields SolrWayback needs) but it’s still very valuable as a way of exploring and evaluating how we might work in the future.

One gap, however, is that we haven’t yet updated the Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12). That should be done early in April.

UKWA Website
The majority of the work has focused on finishing the 'high-level category' view of the UKWA Topics and Themes, finalizing the design and pulling together the translations. 

In addition, like QA Wayback, the public PyWB service has been updated to 2.6.4.1, and we’ve shifted the services to new hardware.

Finally, we have been laying the groundwork for regular automated regression testing, including testing for accessibility issues. Once established, this will be a huge help, allowing us to modify our services with more confidence, knowing that if we accidentally break any critical functionality, the test suite will catch the problem early. This is particularly important as preparation for larger changes, like integrating static documentation and translations into the main website (ukwa-services#48).

Google Sheets Add-On No Longer Available
A while ago, we experimented with an add-on for Google Sheets that provided a way to query web archive holdings from an online spreadsheet (this COPTR link provides some additional information).

Unfortunately, this has become unavailable due to a particular kind of digital obsolescence: changes to Google’s policies. To make it work again, we have to modify our formal policies and documentation in a way that meets Google’s specific requirements. Realistically, due to other work taking priority, it’s likely to be some time before we are able to look at restoring it.

Read the previous UKWA Technical update (Jan 2021) blog post

11 May 2022

The Queen's Platinum Jubilee in the UK Web Archive

By Daniela Major, PhD Student, School of Advanced Studies, University of London

Whether you’re an avid monarchist, a staunch republican or simply obsessed with Netflix’s “The Crown”, there is no doubt that Elizabeth II has achieved a unique place in history The 70 years of her reign have been witness to profound changes in world politics and in British society. When she was crowned, Churchill was her Prime Minister, Khrushchev was freshly in charge of the Soviet Union and Eisenhower had just become the President of the United States.

Queen Elizabeth II

Throughout her decades as monarch, Queen Elizabeth has worked with 14 UK Prime-Ministers and met 13 American Presidents. She has received state visits from countless foreign leaders, who themselves influenced the shape of 20th and 21st century history: from Charles de Gaulle to Mikhail Gorbachev.

During her reign, the United Kingdom went through dramatic changes. From the dismantling of the British Empire to referendums on Welsh devolution and Scottish independence. The Queen’s honour list depicts a country where diversity is celebrated. She’s given honours to authors such as V.S Naipaul and Salman Rushdie, singers like Paul McCartney and Bono and artists like Paula Rego.

For many reasons, the Platinum Jubilee is a great opportunity to explore this dialogue between the present and the past. How and why we celebrate, or how and why we refuse to do so, places us in a specific historical context. In this case, right into 21st century UK, in a world in constant change.

Queens Platinum Jubilee logos

So far, we have discovered that food is a favourite in every celebration. Fortnum & Mason and the Big Jubilee Lunch are celebrating the Jubilee by sponsoring a competition awarding the best pudding – following the Victoria Sponge, named after Queen Victoria, and Coronation Chicken, created in honour of Elizabeth II’s coronation. The judges include Mary Berry of Great British Bake-Off fame, food historian Regula Ysewijn and MasterChef’s Monica Galetti.

A slew of cultural celebrations are on the cards: The Reading Agency launched the Big Jubilee Read which chose ten outstanding books from the last 7 decades. The Royal Mint has created a commemorative coin and the Royal Philharmonic Concert Orchestra gave a concert at the Royal Albert Hall. Throughout the whole of the UK, Town Councils are preparing for street parties, tree planting, and jubilee lunches.

This is where you come in. The UK Web Archive wants to know how you are choosing to remember this Jubilee.

  • Are you taking part in the Jubilee’s bake-off?
  • Are you lighting a beacon or attending a street party?
  • Are you going to a protest? Have you written about how the UK cannot have 70 more years of monarchism?

Help us remember this moment in History so that future historical sources reflect the full diversity of public activity. Help us show how people across the UK celebrate important dates and how they look back to their own past, how they celebrate their present.

If you know of a website worth keeping for posterity, nominate it and make your suggestion.

23 February 2022

International Women’s Day 2022 - save your event ad now!

By Helena Byrne, Curator Web Archives, The British Library

8th March is International Women’s Day (IWD). Originally started in the trade union movement, IWD was an important day to highlight the inequalities women face and to campaign for equal rights. In recent years, IWD has had a wider remit and includes celebrating the cultural, political, and socioeconomic achievements as well as struggles of women.

British Library Votes for Women exhibition website

British Library, Votes for Women online exhibition webpages, archived 2018

Events of all kinds are held on 8th March or close to that date to mark the occasion. Most of these events are advertised online through websites, social media and in online event platforms like Eventbrite. A simple search on Eventbrite for International Women’s Day brings up 500 pages from around the world but mostly in the UK. A little over half of those events are advertised for London.

Are you attending or organising an IWD event this year that is advertised online? Nominate that website/online advert to the UK Web Archive by filling in our ‘Save a website’ form.

Glasgow Women's Library+

Glasgow Women's Library, archived 2008.

What is the UK Web Archive?
The UK Web Archive is a collaboration of the six UK legal deposit libraries working together to preserve websites for future generations. We archive websites published in the UK on a wide variety of subjects such as politics, sports, hobbies and social issues etc. and have over a hundred curated collections in the UK Web Archive.

On IWD 2022 you might be interested in browsing some of our collections related to women’s rights such as Unfinished Business: The Fight for Women’s Rights (2020), Gender Equality (2018), Political Action & Communication (2015), and Women’s Issue (2005-2013).

We work with subject experts to curate our collections but also take nominations from the public so please nominate your IWD event ads or any other UK published content that you feel should be included in the UK Web Archive by filling in our ‘Save a website’ form.

07 February 2022

The Queen’s Platinum Jubilee in the UK Web Archive

By Nicola Bingham, Lead Curator, Web Archives, British Library.

6th February 2022 marks 70 years since King George VI passed away in his sleep at the royal estate at Sandringham and Princess Elizabeth, his oldest daughter and next in line to the throne (who was in Kenya at the time) became Queen. She was crowned a little over a year later as Queen Elizabeth II on June 2, 1953, at age 27.

Platinum Jubilee
2022 will see communities come together to celebrate the Queen’s Platinum Jubilee throughout the UK and Commonwealth. Celebrations will include street parties, concerts, the 'Queen's Green Canopy', an initiative to plant a tree for the jubilee, a Jubilee Pageant, 'a River of Hope', made up of flags decorated with images of hope drawn by children which will make its way along the Mall and the opening of the royal palaces, Sandringham and Balmoral to visitors amongst other events. The focal point of the celebrations will be a four-day long bank holiday weekend, from Thursday 2nd to Sunday 5th June.

Queens Platinum Jubilee logos

The Queen's Platinum Jubilee logos in English and Welsh.

The UK Web Archive will be capturing a record of this momentous event in a special collection about the Platinum Diamond Jubilee. 

We have a series of collections related to the Queen and other members of the royal family, dating back to 2012 when the Queen celebrated her Diamond Jubilee. This collection was co-curated by the Royal Archives at Windsor, the UK Legal Deposit Libraries and the Institute of Historical Research. It contains archived copies of websites produced by the Royal Household together with a wide range of related material such as Blogs, commentaries, news articles together with anti-monarchist and opposing views. Similarly in 2016 the Legal Deposit Libraries curated a collection to mark the Queen’s 90th Birthday.

Diamond Jubilee commemorative marmite

Source: boingboing.net/2012/04/20/maamite-is-jubilee-marmite.html

In 2021 staff at the Legal Deposit Libraries curated a collection to commemorate His Royal Highness Prince Philip the Duke of Edinburgh (10th June 1921 to 9th April 2021). It includes the websites of many of the organisations that the Duke was associated with during his lifetime, either as President, Patron, Honorary Member or in another capacity. The Duke had special interests in scientific and technological research and development, the welfare of young people, education, conservation, the environment and the encouragement of sport. The collection also includes statements from Commonwealth Organisations reflecting on the life of the Duke.

British Racing Car Drivers website with Prince Phillip

Source: www.webarchive.org.uk/wayback/archive/20210421115025/http://www.brdc.co.uk/HRH-The-Prince-Philip-Duke-of-Edinburgh-KG-KT

This year we will be curating a collection to mark the Queen's Platinum Jubilee. We'll announce ways to contribute to this collection in the near future, however in the meantime if you know of, or contribute to a UK or Commonwealth website with relevant content related to the Queen please let us know via our ‘Save a website’ form. We would be particularly pleased to hear from you if your website features personal or community stories about the Platinum Jubilee, or previous years’ Jubilee celebrations.

19 January 2022

Explore Women’s Football in the UK Web Archive

By Helena Byrne, Curator Web Archives, The British Library

On 5 December 1921, the Football Association (FA) banned women from playing football on affiliated grounds and stated that football is “quite unsuitable for females and ought not to be encouraged” (FIFA.com). It took almost fifty years to overturn this ban. With the formation of the Women’s Football Association (WFA) in 1969 the FA were under more pressure to remove the ban. It was at the FA Council Meeting on January 19th, 1970 that the FA made the decision to rescind the Councils Resolution of 1921.

To celebrate 52 years since the ban was lifted, this blog post gives a quick overview of women’s football in the UK Web Archive (UKWA). To mark National Sporting Heritage Day back in 2018 we published a blog post outlining the UKWA sports collection policies. 

History of women's football website in the uk web archive

History of the Women's FA, archived in 2018

Sport has always been included in the UKWA archive since it’s formation in 2005. In recent years we have been blogging more about these collections. Football in all its varieties is probably the most popular sport in the UK, which is why there is a collection dedicated exclusively to football and related activities. The most developed subsection of this collection is on soccer with almost 4,000 items in the collection. These range from individual web pages, subsections of websites as well as full websites, blogs and some social media platforms. 

Explore the extensive Soccer collection on the UK Web Archive Website.

We have collected a wide range of content from sports clubs (amateur and professional), fan sites, football research and events. There is no distinction in the collection based on gender as all content related to the sport is treated equally. 

Accessing the UK Web Archive
Under the Non-Print Legal Deposit Regulations 2013, we can archive UK published websites but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. 

Some of the websites  in UKWA that have already had permission granted, include Charlton Athletic Women, Sent Her Forward and Tartan Kicks: The Magazine For Scottish Women's Football. Some examples of websites that are onsite-only access include the Crawley Old Girls (COGS), Her Game Too and Dick, Kerr Ladies FC 1917-1965: Women's Football History.

Tartan kicks website in the UK Web Archive

Tartan Kicks website, archived in 2019

As the content of UKWA has mixed access, the message ‘Viewable only on Library premises’ will appear under the title of the website if you need to visit a Legal Deposit Library to view the content. If there is no message underneath then the archived version of the website should be available on your personal device.

Get involved with preserving women’s football online with the UK Web Archive
The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise. But we can’t curate the whole of the UK web on our own, we need your help to ensure that information, discussion and creative output related to women’s football are preserved for future generations. Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nominations form.

Keep an eye on the UKWA blog and Twitter account to find out more details on our forthcoming collection to preserve the UEFA Women's Euro 2022 competition taking place across England from July 6 to July 31, 2022. 

06 January 2022

UKWA 2021 Technical update

By Andy Jackson, UKWA Technical Lead, British library

During the last quarter of 2021, the technical services that make up the UK Web Archive underwent lot of changes behind the scenes. These changes should help us to improve our services, so it’s worth explaining a little about what’s been going on.

Starting the Hadoop 3 Migration
Our Hadoop cluster is now quite old, and updating this to a newer version has been a long-standing issue. The old Hadoop version no longer gets updates, and is not supported by modern tools and libraries, which prevents us from making the most of what’s available.

For a long time, it was unclear how best to proceed – an in-place update seemed too risky, but a cluster-to-cluster migration appeared to require too much hardware. So, over recent years, we have spent time learning how to set up and maintain a Hadoop 3 cluster, and evaluating different migration strategies, focusing on how we might maintain service during any migration.

We eventually decided a cluster-to-cluster migration should be possible, as long as we can purchase higher-density storage so we have enough headroom to migrate content over ahead of migrating hardware. Earlier in the year, following some procurement delays, we were able to purchase and establish this new Hadoop 3 cluster, with each server providing over 450TB of raw storage (compared to about 85TB per server for the older cluster).

While this was being set up, we also had to generalize our services so that all important process can be run across both clusters, and that WARC records can be retrieved from either. This has been quite time-consuming, but as 2021 drew to a close (and space on the older cluster was getting tight!), we were finally able to shift things so that newly-harvested content is written to the new Hadoop 3 cluster.

Behind the scenes, our file tracking database was updated to scan both clusters and act as a record of which files are where, and to update this record hourly rather than just once per day. A new WARC Server component was created that takes Wayback request for WARC records, and uses the tracking database to work out which cluster they are on, and then grabs and returns the WARC record in question.

In the future, the tracking database will be used to help orchestrate the movement of content to Hadoop 3, with hardware being shifted over as it becomes available. The new WARC Server means that we will be able to maintain an uninterrupted service throughout.

But to avoid interruption now, we also needed to enable access to the newer content on Hadoop 3 by indexing it for playback. To this end, a new CDX indexer implementation has been created that can be run on either cluster (built with Webrecorder’s Python tools rather than Java) . As before, the tracking database is used to keep track of what’s been indexed, but both clusters can now be indexed promptly.

Similarly, although not fully moved into production yet, the Document Harvester document extractor and the Solr full-text indexing tasks have been re-written to be able to run on either cluster, and be more robust than the prior implementations.

At time time of writing, the main public website and the internal Storage Report have not been fully moved over to run across both systems, so there may be some slight inconsistencies there in the short term. However, we expect to resolve this in the next week or two.

Task Orchestration via Apache Airflow
This large set of changes has also been used as an opportunity to update how our critical web-archiving tasks are implemented and orchestrated. We were using the Luigi framework to define tasks and their dependencies, but over time we have found this to be problematic in a number of ways:

  • The code that performs tasks and the code that orchestrates those tasks were mixed together in the same source files. This made it very hard to work on improving any individual task on it’s own, and made testing difficult.
  • The Luigi task scheduling seems to be unreliable, with processors occasionally getting stuck and not making any progress, or not raising any errors on failure. This particularly affected the Document Harvester, leading to a number of outages.
  • The Luigi task management interface is not very useful. It does not make it easy to look at previous runs, and presents very little detail.
  • The way Luigi encourages task dependencies to be coded makes it very difficult to clear out those dependencies so task can be re-run.

Therefore, while updating the various web archive tasks, they have been modified to run under Apache Airflow.

Apache airflow

This is a popular and very widely used workflow definition and scheduling system, with both Google and Amazon offering Airflow as a fully-managed cloud services as well as a healthy open source community around it. Along with this choice of workflow platform, we have also chosen to implement each task tool as a separate standalone Python command-line program. This means:

  • Task code is separate from orchestration, can be developed independently, and tasks can be deployed as Docker containers, which keeps the underlying software dependencies apart.
  • We get to use the Airflow scheduler, which appears to be more reliable, will warn us when tasks get stuck or fail, and provides Prometheus integration for monitoring.
  • The Airflow Web UI is very detailed, allows access to task logs, summaries of runs and statistics, makes workflow management easier, and provides a framework for documenting each workflow.
  • The Airflow Web UI also makes it easy to clear the status of failed workflow runs so they can be re-run as needed.

Over time, we expect to move all web archiving tasks over to this system.

W3ACT
W3ACT is used by UKWA curators and other authorised users to add targets and manage Quality assurance and licencing. There only have been minor updates to the W3ACT curation service lately, rolled out towards the end of December. 

  • QA Wayback is now running PyWB version 2.6.3 for improved playback (e.g. ukwa-pywb#70).
  • Improvements to how the W3ACT authentication cookie is handled, resolving w3act#662.

UKWA Website
Most of the recent work on the UKWA website (www.webarchive.org.uk) user interface has focused on improving the presentation of our large set of curated collections by grouping them into categories. This work is still being discussed and developed internally, so isn’t part of the public website yet. However, we’re making good progress and hope to release a new version of the website over the coming weeks.

Apart from the interface itself, some additional work has been done to update the internal services (e.g update PyWB to version 2.6.3 and add the WARC Server to read content from both Hadoop clusters), and move the deployment to our newer production platform. As indicated above, these updates should be rolled out shortly.

2021 Domain Crawl
As in 2020, the 2021 Domain Crawl was run on the Amazon Web Services cloud. This time, following improvements to Heritrix and building on prior experience, the crawl ran more smoothly and efficiently than in 2020, using less memory and disk space for the crawl frontier. The crawler was started up early in August for penetration testing, and then taken down while the security concerns were addressed. The actual crawl began on the 24th of August, starting with 10 million seed URLs, and the vast majority of the crawl had completed by mid-November. Most of the 27 million hosts we visited were crawled completely, but ~57,200 hosts did hit the 500MB size cap. However, some of these were content distribution networks (CDNs), i.e. services hosting resources for other sites, so some caps were lifted manually and the crawl was allowed to continue.

URL rates in UKWA domain crawl

On the 30th of December, the crawl was stopped, having processed 2.04 billion URLs and downloaded 99.6 TB of data (uncompressed). However, a lot of the CDN content remained uncollected, and would take a very long time to collect under Heritrix’s normal ‘politeness’ rules. In the future, it would be good to find a way to allow Heritrix to crawl these sites much more quickly, without having to manually intervene to decide which hosts are CDNs.

At this time, it has not been decided whether the 2022 Domain Crawl will be run in the cloud or from our Boston Spa site. Either way, we expect to begin the process of transferring domain crawl 2020/2021 content from AWS to our Hadoop 3 cluster over this next year.

Upcoming work
In the next quarter (Jan-Mar 2022), as well as the future updates outlined above, we are also expecting to:

  • Receive hardware for the additional Hadoop 3 replication cluster, then start setting it up and populating it ahead of it being transferred to the National Library of Scotland later in the year
  • Improve monitoring of the process of moving WARCs and logs to Hadoop (in part to ensure we spot problems with the Document Harvester earlier)
  • Add improved reporting services, replacing the current Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12)
  • Integrate static documentation and translations into the main website, via a simple CMS (ukwa-services#48). This will make it easier to add more pages and manage the translation of those pages to/from Welsh and Scottish Gaelic.
  • Begin implementing the NPLD Player, which we need in order to improve reading-room access across the Legal Deposit libraries. We’re currently finalizing the details of how our external partner will help us do this, and more details will be made available over the next couple of months.