UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

17 May 2022

UK Web Archive Technical Update - Spring 2022

By Andy Jackson, Web Archive Technical Lead, British Library

Hadoop storage and replication
With the live services happily running off both the old and new Hadoop clusters, we have been focusing on setting up and populating our third Hadoop cluster, destined for the National Library of Scotland.

The Legal Deposit libraries have worked together to fund this additional, independent copy of the UK Web Archive holdings. This is primarily for the purposes of preservation, as having a further copy managed by a separate team and organisation will help ensure our records are not lost or damaged. Longer-term, this system can also function as an independent access and research platform, and this is something we hope to explore as part of the Archives of Tomorrow project.

As there is a petabyte of content to replicate, we were initially concerned that the process of migrating the data would take an extremely long time, and possibly put an unsustainable load on our internal network infrastructure. Happily, these worries were unfounded: over the last six weeks, we’ve replicated about 300TB of WARCs, and this has not caused any noticeable network capacity problems. We’ve also been able to start running cluster jobs that calculate checksums for the files on both ends of the replication, so we can verify everything is working.

Computer server

Legal Deposit Access Solution
The current system for accessing Non-Print Legal Deposit material in our reading rooms has accessibility problems, and is being replaced with two components:

  • An enhanced version of PyWB that can render PDFs and ePubs.
  • An ‘NPLD Player’ app that will allow the content to be accessed from reading room PCs that have not been set up to prevent copies of items being accidentally taken away.

With both components being developed through a contract with Webrecorder.

This quarter has mostly been about laying the groundwork for this (like writing deployment documentation), so we might make more progress next quarter.

Crawlers
We use web browsers to render a lot of seed pages, and this now represents a significant amount of data and included a lot of duplication of common files and media. To mitigate this, we have enabled deduplication for the browser-based crawling.

We’ve also improved monitoring of the process of moving WARCs and logs to Hadoop, so we can spot if backlogs are building up.

Annotation and Curation Tool (W3ACT)
For the core W3ACT service, the only changes have been to fix the links to QA Wayback that were being misdirected to the wrong URL, and upgrade PyWBs to 2.6.4.1.

However, we have been working on embedding additional services behind the W3ACT login. These include:

  • A way to view the logs from the W3ACT crawls.
  • An instance of SolrWayback, configured to search full text indexes from the W3ACT crawls.

Our Danish colleagues have been very helpful, collaborating with us to augment SolrWayback so it could be run with our systems. There are still some gaps (e.g. the internal playback part does not work reliably as our old Solr indexes do not provide all the fields SolrWayback needs) but it’s still very valuable as a way of exploring and evaluating how we might work in the future.

One gap, however, is that we haven’t yet updated the Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12). That should be done early in April.

UKWA Website
The majority of the work has focused on finishing the 'high-level category' view of the UKWA Topics and Themes, finalizing the design and pulling together the translations. 

In addition, like QA Wayback, the public PyWB service has been updated to 2.6.4.1, and we’ve shifted the services to new hardware.

Finally, we have been laying the groundwork for regular automated regression testing, including testing for accessibility issues. Once established, this will be a huge help, allowing us to modify our services with more confidence, knowing that if we accidentally break any critical functionality, the test suite will catch the problem early. This is particularly important as preparation for larger changes, like integrating static documentation and translations into the main website (ukwa-services#48).

Google Sheets Add-On No Longer Available
A while ago, we experimented with an add-on for Google Sheets that provided a way to query web archive holdings from an online spreadsheet (this COPTR link provides some additional information).

Unfortunately, this has become unavailable due to a particular kind of digital obsolescence: changes to Google’s policies. To make it work again, we have to modify our formal policies and documentation in a way that meets Google’s specific requirements. Realistically, due to other work taking priority, it’s likely to be some time before we are able to look at restoring it.

Read the previous UKWA Technical update (Jan 2021) blog post

11 May 2022

The Queen's Platinum Jubilee in the UK Web Archive

By Daniela Major, PhD Student, School of Advanced Studies, University of London

Whether you’re an avid monarchist, a staunch republican or simply obsessed with Netflix’s “The Crown”, there is no doubt that Elizabeth II has achieved a unique place in history The 70 years of her reign have been witness to profound changes in world politics and in British society. When she was crowned, Churchill was her Prime Minister, Khrushchev was freshly in charge of the Soviet Union and Eisenhower had just become the President of the United States.

Queen Elizabeth II

Throughout her decades as monarch, Queen Elizabeth has worked with 14 UK Prime-Ministers and met 13 American Presidents. She has received state visits from countless foreign leaders, who themselves influenced the shape of 20th and 21st century history: from Charles de Gaulle to Mikhail Gorbachev.

During her reign, the United Kingdom went through dramatic changes. From the dismantling of the British Empire to referendums on Welsh devolution and Scottish independence. The Queen’s honour list depicts a country where diversity is celebrated. She’s given honours to authors such as V.S Naipaul and Salman Rushdie, singers like Paul McCartney and Bono and artists like Paula Rego.

For many reasons, the Platinum Jubilee is a great opportunity to explore this dialogue between the present and the past. How and why we celebrate, or how and why we refuse to do so, places us in a specific historical context. In this case, right into 21st century UK, in a world in constant change.

Queens Platinum Jubilee logos

So far, we have discovered that food is a favourite in every celebration. Fortnum & Mason and the Big Jubilee Lunch are celebrating the Jubilee by sponsoring a competition awarding the best pudding – following the Victoria Sponge, named after Queen Victoria, and Coronation Chicken, created in honour of Elizabeth II’s coronation. The judges include Mary Berry of Great British Bake-Off fame, food historian Regula Ysewijn and MasterChef’s Monica Galetti.

A slew of cultural celebrations are on the cards: The Reading Agency launched the Big Jubilee Read which chose ten outstanding books from the last 7 decades. The Royal Mint has created a commemorative coin and the Royal Philharmonic Concert Orchestra gave a concert at the Royal Albert Hall. Throughout the whole of the UK, Town Councils are preparing for street parties, tree planting, and jubilee lunches.

This is where you come in. The UK Web Archive wants to know how you are choosing to remember this Jubilee.

  • Are you taking part in the Jubilee’s bake-off?
  • Are you lighting a beacon or attending a street party?
  • Are you going to a protest? Have you written about how the UK cannot have 70 more years of monarchism?

Help us remember this moment in History so that future historical sources reflect the full diversity of public activity. Help us show how people across the UK celebrate important dates and how they look back to their own past, how they celebrate their present.

If you know of a website worth keeping for posterity, nominate it and make your suggestion.

23 February 2022

International Women’s Day 2022 - save your event ad now!

By Helena Byrne, Curator Web Archives, The British Library

8th March is International Women’s Day (IWD). Originally started in the trade union movement, IWD was an important day to highlight the inequalities women face and to campaign for equal rights. In recent years, IWD has had a wider remit and includes celebrating the cultural, political, and socioeconomic achievements as well as struggles of women.

British Library Votes for Women exhibition website

British Library, Votes for Women online exhibition webpages, archived 2018

Events of all kinds are held on 8th March or close to that date to mark the occasion. Most of these events are advertised online through websites, social media and in online event platforms like Eventbrite. A simple search on Eventbrite for International Women’s Day brings up 500 pages from around the world but mostly in the UK. A little over half of those events are advertised for London.

Are you attending or organising an IWD event this year that is advertised online? Nominate that website/online advert to the UK Web Archive by filling in our ‘Save a website’ form.

Glasgow Women's Library+

Glasgow Women's Library, archived 2008.

What is the UK Web Archive?
The UK Web Archive is a collaboration of the six UK legal deposit libraries working together to preserve websites for future generations. We archive websites published in the UK on a wide variety of subjects such as politics, sports, hobbies and social issues etc. and have over a hundred curated collections in the UK Web Archive.

On IWD 2022 you might be interested in browsing some of our collections related to women’s rights such as Unfinished Business: The Fight for Women’s Rights (2020), Gender Equality (2018), Political Action & Communication (2015), and Women’s Issue (2005-2013).

We work with subject experts to curate our collections but also take nominations from the public so please nominate your IWD event ads or any other UK published content that you feel should be included in the UK Web Archive by filling in our ‘Save a website’ form.