UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

19 October 2020

Exploring media events with Shine

By Caio Mello, Doctoral Researcher at the School of Advanced Study, University of London

Computer screen with some HTML code on the screen

This blogpost is a summary of the presentation I delivered with my colleague Daniela Major in the conference Engaging with Web Archives: ‘Opportunities, Challenges and Potentialities’ in September 2020. This presentation is entitled ‘Tracking and analysing media events through web archives’.

My research explores the media coverage of the Olympic Games in a cross-cultural, cross-lingual and temporal perspective. I am especially interested in comparing how the concept of 'Olympic legacy' has been approached by the Brazilian and British media considering different locations, languages and social-political contexts. I have written a bit about this before on the UK Web Archive blog in December 2019 and March 2020.

Because of its controversial nature, the term Olympic legacy is used in a variety of contexts and it has multiple meanings. Considering its narrative importance to legitimize the billionaire investment of cities to host these events, this study has as the main objective to explore and define the concept of Olympic Legacy and how it changes over time.

Here however, I will be focusing on my experience doing a secondment at the British Library with the UK Web Archive team. I have explored the potential of using the platform Shine to track news articles on Olympic legacy.

Why Shine?

Shine is a tool to explore .uk websites archived by the Internet Archive between 1996 and April 2013. While a big part of the content of the UK Web Archive can only be accessed from inside the British Library, Shine is open access and provides us with search results and URL data that can be easier to manage.

We have developed a pipeline based on 5 steps: searching, extraction, cleaning, filtering and visualisation. To extract information, we have conducted web scraping of the data using Python notebooks looking at specific newspapers (like The Guardian) and broadcast websites (like BBC) using the keyword “Olympic legacy”. Having searched for URL’s in Shine and extracted the results, the main challenge is cleaning. After extracting just the body text of the articles, we saw that many of them did not mention Olympic legacy. Usually, Shine provides results where the words searched appear in peripheral locations of the webpage. Cleaning consists of removing all the information around the main text, such as images, adverts, menus and links. With the documents we needed in hand, we had to verify if their content is relevant or not to our analysis. Sometimes, the term Olympic legacy appears but it is not necessarily related to Rio and London Olympics or it is not the main topic of the article. The process of filtering demanded a huge effort of close reading to identify contexts. At the end, we have produced some charts to visualise word-trends and topics that pop up around legacy. Although the Shine search results are limited in terms of time - it searched up until 2013 - it has been very useful as an exploratory tool to conduct preliminary analysis in a small-scale, and to build web archive and web scraping methods before applying my methods to huge amounts of texts elsewhere. 

You can watch Caio de Castro Mello Santos & Daniela Cotta de Azevedo Major’s presentation on the EWA YouTube Channel.

*This project has received funding from the European Union’s Horizon 2020 research and innovation programme. For more information: cleopatra-project.eu.

 

14 October 2020

Engaging with Web Archives - Conference Report

By Jason Webber, Web Archive Engagement Manager, The British Library

 

Engaging with Web Archives conference banner

 

Is it possible to have a successful conference when you can no longer meet in person? Going exclusively online doesn’t seem to have stopped the ‘Engaging with Web Archives’ (EWA) Conference from being a superb experience. Co-Chairs of the event are Sharon Healy and Michael Kurzmeier, PhD students at Maynooth University.

Originally planned as a more traditional, in person, conference in April 2020 the EWA team re-planned for a completely online event on 21and 22 September 2020. It is notable that this was the first web archiving conference in Ireland. Most talks were pre-recorded which meant that questions could be posed in the chat box and were often answered live by the presenter during the talk. This can be a significant advantage of pre-recorded talks.

The programme was packed with high quality presentations from many areas of web archiving but here I’ll highlight a few that were UK Web Archive (UKWA) projects or used UKWA data. 

 

Highlights

 

A Keynote talk was delivered by Professor Jane Winters, School of Advanced Study, University of London. Web archives as sites of collaboration. Jane has worked with the UK Web Archive extensively over many years and is one of only a few Professors in the UK training and promoting web archives to students. Jane's talk (link to YouTube).

 

Sara Day Thomson (University of Edinburgh) Developing a Web Archiving Strategy for the Covid-19 Collecting Initiative at the University of Edinburgh. Sara formerly worked for the Digital Preservation Coalition (DPC) led a ‘Web Archiving Task Force’ and more recently has been building important collections on Covid-19 with the University of Edinburgh in partnership with UKWA. Sara's talk (link to YouTube).

 

Dr. Brendan Power (The Library of Trinity College Dublin): Leveraging the UK Web Archive in an Irish context: Challenges and Opportunities. With Trinity College Dublin being a UK Legal Deposit Library we try and work together as much as possible and this talk highlights what is possible with specific mention of the Easter Rising collection. Brendan's talk (link to YouTube).

 

Robert McNicol (Kenneth Ritchie Wimbledon Library): The UK Web Archive and Wimbledon: A Winning Combination. We try to represent as many aspects of UK life as possible including sport. This also highlights our cooperation with other libraries and archives. See the Tennis collection. Robert's talk (link to YouTube).

 

Dr. Peter Webster (Independent Scholar, Historian and Consultant): Digital archaeology in the web of links: reconstructing a late-90s web sphere. Peter has conducted several pieces of research utilising the UKWA secondary datasets. These are free and available for download. Peter's talk (link to YouTube).

 

Helena Byrne (Curator of web Archiving, British Library): From the sidelines to the archived web: What are the most annoying football phrases in the UK? Helena is a curator in the UK Web Archive but also has a keen interest in sport and women’s football in particular. Here, Helena shows how the Trends feature (graphs) in our SHINE service can help guide research in an easy and accessible way. Helena's talk (link to YouTube).

 

Caio de Castro Mello Santos & Daniela Cotta de Azevedo Major (School of Advanced Study, University of London): Tracking and Analysing Media Events through Web Archives. Caio was a Phd student placement with UKWA as part of the Cleopatra project. Read about some of his work on this blog on Olympic legacy. Caio and Daniella's talk (link to YouTube).

 

Hannah Connell (King’s College London; British Library): Curating culturally themed collections online: The Russia in the UK Special Collection, UK Web Archive. Hannah has worked extensively collecting one of the several diaspora community collections. In addition to Russia in the UK, there is London French and Latin America UK. Hannah's talk (link to YouTube).

 

Dr. Jessica Ogden (University of Southampton) & Emily Maemura (University of Toronto): A tale of two web archives: Challenges of engaging web archival infrastructures for research. Jessica has also worked previously with UKWA as a Phd placement on the challenges of researchers using web archives. This vital work helps guide our planning for the future. Jessica and Emily's talk (link to YouTube).

 

Dr. Olga Holownia (International Internet Preservation Consortium): IIPC: training, research, and outreach activities. Olga works full time for the IIPC but has been based within the UK Web Archive team at the British Library. We have been delighted to have worked with and been supported by the IIPC since it began (The British Library is a founding member).

 

Rosita Murchan (Public Record Office of Northern Ireland): PRONI Web Archive: A Collaborative Approach. PRONI maintains their own web archive but also collaborates with the UK Web Archive in collecting material specific to Northern Ireland. This is important as there currently is no Legal Deposit partner in Northern Ireland. Rosita’s talk (link to YouTube).

 

Summary

Whilst it is a shame not to meet people in person this conference has shown me how online conferences can be a viable way forward. I’m very much looking forward to the next one.

 

See all of the pre-recorded talks on the EWA conference Youtube Channel. You can find the Engaging with Web Archives on Twitter and catch up on the conference discussion with the hashtag #EWAvirtual

 

Look out for more in-depth blog posts from EWA conference speakers over the coming weeks on the UK Web Archive blog.

 

07 October 2020

Safeguarding the Digital Legacy: the UK Web Archive is a finalist for the 2020 Digital Preservation Awards

By Ian Cooke, Head of Contemporary British Published Collections at the British Library

2020 Digital Preservation Awards logo

 

Here at the UK Web Archive we are very excited and proud to have made it to the finalists for the 2020 Digital Preservation Awards, in the ‘Safeguarding the Digital Legacy’ category.

Alongside the other finalists, we presented at #WeMissiPRES conference on 23 September. We only had a few minutes, so our ‘lightning talk’ went by in a flash. Here is a slightly extended version of our presentation. 

This year, the UK Web Archive celebrates its 15 year anniversary. It is 15 years since we first made public an online interface to our newly-created Web Archive. It’s important to us that we date from that point as, all through our 15 years, access has been a core part of what we do, and drives how we think about preservation.

Anniversaries are important, because they offer us a point to look back, to give us a longer-term perspective on our work, but also because they prompt us to think about our values as well as our legacy.

So, thinking about our values, preservation and legacy, we want to talk about three things that we are really proud of:

 

The content matters

This has led us in everything else. Communication on the web is primarily about us, about the people and communities that we share our lives with. Preservation of the web matters, because it is vital to how we understand ourselves now, and how we understand our recent past. From our beginnings, we have made the case that the web is not trivial – it should be valued – and we continue to make that case. We do this by creating thematic collections, which put the focus on the subject not the form; by talking publicly and online about our work; and by working with researchers to understand what the archived web can tell us.

Being led by the content can result in complex and innovative technological interventions, such as the continued monitoring and refinement of our domain crawls to ensure that we are as comprehensive as we can be.

It is also about policy and engagement. It’s about making sure we understand the content, and the people creating it. We reach out to communities and groups to help create collections, and this is something we understand better as we have grown. We do this by partnering with specialist archives or community groups, or through public calls for co-operation. An example currently is our LGBTQ+ lives collection, where we are working with the LGBTQ+ network of the Chartered Institute of Library and Information Professionals in the UK and also have been using social media to call for content.

 

We work collaboratively

This has been at the heart of the UK Web Archive, which has always existed as a collaborative venture between organisations – now linking the six Legal Deposit Libraries of the UK. We also engage with our peer institutions, to learn and share experience. Collaboration is vital to build and maintain the capacity that all institutions active in web archiving need to meet the preservation challenges presented by the live web. A key part of that has been the International Internet Preservation Consortium (IIPC), where we are proud to be the host for the Programme and Communications Officer. As well as participating in conferences, workshops and hackathons, we regularly take part in the ‘Online Hours: Supporting Open Source’ calls, which are dedicated to ensuring that the IIPC’s open source initiatives are truly open to members.

We work collaboratively also with researchers, both in collection-building and in research projects using the archived web. Working with researchers helps us to understand ‘real life’ challenges, and inspires the way we build our services and communicate about them. We are immensely proud of our role in the ‘Big UK Domain Data for the Arts and Humanities’ project, which helped us build our ‘Shine’ analytical tool for full-text indexes. More recently, we have been working on research in economic geography – using our postcode data set; and with researchers from the Alan Turing Institute, to understand how our data can be used to analyse word value change over time.

Research use of the UK Web Archive has developed over time. An early, and enduring use, has been a ‘close reading’ of websites. This approach may look at one or a small number of websites and study the content, layout and functionality in detail. Sometimes these studies have a longitudinal aspect, looking at change over time. Our user interface helps researchers find individual websites, or groups of websites, that are relevant to their study. This approach has been supplemented by other research methods which attempt to understand a much larger body of content at scale. This research uses tools and data to understand communication and behaviour on the web. These methods can be mutually supportive, with the results of computational analysis of the web providing supporting context for a close reading of a small number of sites.

 

We work openly

From the start, we have seen access as a vital part of our preservation work. This includes helping us to validate the preservation actions that we have taken, and also in wider advocacy for preservation of born digital content. We seek permissions to make selected web content more openly available, and look to use existing licences to make other content available. We currently do this with content released under Open Government Licences. We also work to make sure that the data we generate about our collections is available, whether that is the full-text indexes that can be searched in our User Interface, or datasets that we have generated from earlier crawls of the UK domain. Earlier this year, we worked with the National Library of Australia, National Library of New Zealand and the historian Tim Sherratt, to develop tools (using Jupyter Notebooks) that could be re-used to analyse our openly accessible data.

Looking ahead, we want to review and update our curatorial tools to support collaboration and collection building. We want to understand what the barriers are to using the archived web in research, and share more information to help researchers understand our collections. Linked to this, we are developing a research engagement plan, which will make sure that our collections and services continue to develop to meet identified needs.

So, as we look back over our 15 year history, these are three of the things that make us proud, and will continue to inspire us. Understanding the value of our collections, working in partnerships and connecting our users and public with our collections. These are values that we know we share with the wider Digital Preservation community, so are very grateful for this chance to join the celebration.

 

You can watch back on all of the presentations from this category on the #WeMissiPRES conference YouTube Channel.  

 

01 October 2020

Request for Information: Metadata Management Tool for the UK Web Archive

By Helena Byrne, Curator of Web Archives at the British Library

What is a Request for Information (RFI)? 

A Request for Information (RFI) is not a tender opportunity, but is part of a market consultation exercise aiming to ensure that the procurement route selected and the options ultimately developed for any procurement are properly informed.

At the conclusion of this RFI process, the information gathered may be used to assess potential suppliers and service offers and produce a shortlist for invitation to tender or procurement under a Government Framework Arrangement. At this stage, no final decision has been taken on the precise procurement route to be followed.

 

The UK Web Archive RFI

The UK Web Archive (UKWA) is a collection shared by the six UK Legal Deposit Libraries: the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries Oxford, Cambridge University Library and the Library of Trinity College Dublin. UKWA aims to archive, preserve and give access to the historic UK web space. This is achieved through annual domain crawls, the first of which was undertaken in 2013, and more frequent crawls of key websites and specially curated collections which date back as far as 2005. These collections reflect important aspects of British culture and events that shape society.

The UKWA team based at the British Library is seeking to acquire a metadata management tool or set of tools to integrate with our web archive services. This will support the description of websites and web pages in our archive, the creation of topic-based collections and encouraging the participation of non-specialists in describing our archived web records. The intention is for this tool to handle the metadata associated with our web archiving services rather than the technical aspects of crawling and storing web content.

Our current Annotation Curation Tool (ACT) covers many functions. However, as the collection has grown in size, and the system matures in age, some of these features have become difficult to manage and response times to enquiries can be very slow, meaning the system is becoming more difficult to use as basic functions become almost impossible to execute. ACT is a bespoke tool, and in this RFI we are looking to explore off the shelf options that can be adapted to suit our requirements and that can be easily modified as these requirements change over time.

 

RFI Timeline

Set out below is the proposed RFI timetable, this is intended as a guide and, whilst the British Library does not intend to depart from this timetable, it reserves the right to do so at any time.

Publish RFI

01st  October 2020

Initial responses returned by

12th  November 2020

Shortlist and clarifications

18th  November 2020

Presentations (via video conference)

26th  November 2020

RFI concludes and feedback provided 

10th  December 2020

 

British Library e-Tendering Service

To ensure that your organisation is involved in this project at this early stage of engagement please provide details of the most appropriate contact within your organisation’s business development team – ideally your business development director or similar – to allow us to invite them into the 001599 online Request For Information (RFI) process. Please send a named contact email address to [email protected] at your earliest convenience.

 

Edit (05/11/2020): The timeline for the RFI process was changed to incorporate an extension on the deadline for submissions. You can read the unedited version of this blog post on the UK Web Archive Website: https://www.webarchive.org.uk/wayback/archive/20201008154057/https://blogs.bl.uk/webarchive/2020/10/request-for-information-metadata-management-tool-for-the-uk-web-archive.html 

 

30 September 2020

National Sporting Heritage Day 2020

By Helena Byrne, Curator of Web Archives at the British Library

women playing soccer with a linesman in the foreground
Women playing soccer

 

The 30th September is National Sporting Heritage Day in the UK and to celebrate we will give you a quick overview of the UK Web Archive (UKWA) sporting activities in 2020. UKWA is made up of the six UK Legal Deposit Libraries, these are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.  

Sport is a subject that shapes and reflects society. As more publications about sport move to online only, preserving this cultural record through web archiving becomes paramount. To mark the occasion back in 2018 we published a blog post outlining the UKWA sports collection policies. 

We have three collections that focus on sport that are actively curated throughout the year:

  1. Sports Collection
  2. Sport: Football 
  3. Sports: International Events

 

International Internet Preservation Consortium (IIPC)

As individual institutions the British Library and the National Library of Scotland are members of the International Internet Preservation Consortium (IIPC) and worked on building collaborative collections covering international events such as the Summer and Winter Olympic/Paralympic Games. 2020 marks ten years of building IIPC Olympic/Paralympic web archive collections.  Since the formation of the IIPC Content Development Group (CDG) in 2015, there has been a consolidated effort to build collections both on and off the playing field. All of the IIPC collections are open access. The CDG planned to build a collection on the Tokyo 2020 Games. However, due to the coronavirus pandemic the Games were rescheduled for 2021 and so was CDG dedicated collection. However, some content around the 2020 event was included in the Novel Coronavirus (COVID-19) collection and there will be updates made to the National Olympic and Paralympic Committees collection this year.  

 

Documenting the Olympics and Paralympics

Even though Tokyo 2020 was postponed until 2021, the symposium Documenting the Olympics & Paralympics, which was supposed to be a full day face-to-face event, went online. This was a collaboration between the web archive team based at the British Library, the International Centre for Sports History and Culture (ICSHC) at De Montfort University, and the British Society of Sports History (BSSH).

A broad mix of physical, digitised and born digital resources were covered in the presentations. You can listen back to an audio recording of this symposium on the Sport in History Podcast. The full abstracts and some of the PowerPoint slides are available on the British Library Research Repository.

 

Engaging with Web Archives Conference

The Engaging with Web Archives conference brought together practitioners and web archive researchers from around the world. There were three presentations on the programme that focused on UK Web Archive sports collections. 

  1. Robert McNicol (Librarian, Kenneth Ritchie Wimbledon Library) discussed the collaboration on developing the Tennis section of the UK Web Archive Sports Collection. 
  2. Helena Byrne (Curator of Web Archives, British Library) looked at tracing the popularity of annoying football phrases on the archived .uk web space from 1996-2013. 
  3. Caio de Castro Mello Santos & Daniela Cotta de Azevedo Major (PhD students, School of Advanced Study, University of London) used the London 2012 and Rio 2016 Olympic Games as a case study to analyse media events through the UK Web Archive. 

A series of blog posts about the Engaging with Web Archives conference will be coming out in the next few weeks on the UK Web Archive blog.

 

Accessing the UK Web Archive

Under the Non-Print Legal Deposit Regulations 2013, we can archive UK published websites but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. 

 

Some of the websites  in UKWA that have already had permission granted, include Heritage Quay, Pride Sports UK and WheelPower. Some examples of websites that are onsite-only access include the Fans Supporting Food Banks, Barnsley Yorkshire: Tour de France and The Women's Open.

 

As the content of UKWA has mixed access, the message ‘Viewable only on Library premises’ will appear under the title of the website if you need to visit a Legal Deposit Library to view the content. If there is no message underneath then the archived version of the website should be available on your personal device.

 

Get involved with preserving sports online with the UK Web Archive

We can’t curate the whole of the UK web on our own, we need your help to ensure that information, discussion and creative output related to sport are preserved for future generations. Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nominations form: https://www.webarchive.org.uk/en/ukwa/nominate 

 

25 September 2020

The World of Food and the UK Web Archive

 

By Helena Byrne, Curator of Web Archives at the British Library

 

Assorted sliced fruits in white ceramic bowl surrounded by more sliced fruits and some small muffins
A variety of food

 

Food is a subject that transcends culture, politics and leisure practices. Thus, food has always been a key part of the UK Web Archive (UKWA) since it was established in 2005. 

 

Recipes, restaurant menus, food blogs, online reviews are just the start of food related online material that UKWA collects. Even protest and campaigning can be food related, for instance, this summer, footballer Marcus Rashford highlighted the issue of child poverty and the lack of access to food, especially during the school holidays. 

 

For the last three years the British Library has been running a series of events around food. Due to the coronavirus pandemic, this year's Food Season moved online with a series of talks over the autumn period. 

 

The Food Season celebrates the British Library’s extensive food-related collections and explores the politics, pleasures and history of food. UKWA, which is a partnership of the six UK Legal Deposit Libraries, including the British Library, also has an extensive collection of food related websites. 

 

Food collections

In 2017, the Food Archive collection was established. This collection covers the following topics:

There are currently 333 websites or web pages in this collection. Some of the websites selected include Eat Like a Girl, the Good Grub Club and the Veggies Catering Campaign. Why not have a browse through the collection and nominate your favourite UK published food sites or restaurant websites to be included in the collection? Anyone can nominate a website by following this link: https://www.webarchive.org.uk/en/ukwa/info/nominate 

 

Even though there is a dedicated collection about food, it also features as a subsection in a number of other collections. ‘Food and Drink’ is a subsection in both the Festivals and Online Enthusiast Communities in the UK collections. In addition, individual food websites appear in several other collections. Websites related to food activism appear in both the Political Action and Communication collection as well as the (soccer) fan subsection of the Sport: Football Collection, as numerous supporters clubs have organised to support their local food banks. 

 

Social media is a very popular way to share food and micro-reviews of eateries, however, this is often challenging for us to archive. At present, Twitter is the only social media platform that we archive on a regular basis but these captures are by no means comprehensive. We have experimented with other methods of archiving social media but this is on a selective basis.

 

How can you access these archived websites?

Under the Non-Print Legal Deposit Regulations 2013, we can archive UK published websites but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.  

 

Some of the websites  in UKWA that have already had permission granted, these include the Cake Fest Edinburgh, the Lancashire Pork Pie Appreciation Society and the Food Research Collaboration. Some examples of websites that are onsite-only access include the Biscuit Appreciation Society, the UK Menu Archive and Fans Supporting Food Banks.

 

As the content of UKWA has mixed access, the message ‘Viewable only on Library premises’ will appear under the title of the website if you need to visit a Legal Deposit Library to view the content. If there is no message underneath then the archived version of the website should be available on your personal device.

Due to the coronavirus pandemic, the reading rooms were closed for a number of weeks but are starting to reopen. This blog post gives an overview of opening hours and how to book a visit at the six UK Legal Deposit Libraries:

https://blogs.bl.uk/webarchive/2020/09/ukwa-available-in-reading-rooms-again.html 

 

We would especially like to see more food and drink nominations that reflect the multicultural nature of the UK and the many diaspora communities based here. Browse through what we have so far and please nominate more content here:

https://www.webarchive.org.uk/en/ukwa/info/nominate 

 

17 September 2020

Arnhem75 - a special collection of websites added to the UK Web Archive

 

By Marja Kingma, Curator of Germanic Collections, the British Library.

 

Arnhem75 blog image
Book cover of 75 Years Battle of Arnhem by Laurens van Aggelen

 

Introduction

The idea to create a collection of websites about the commemoration of Arnhem75 came to RAF Museum historian Harry Raffal and myself whilst attending the seminar ‘The Arnhem Spirit - 75 years of Brits in Arnhem’, on 15 May 2019, organised by the Dutch Embassy in London. The event was part of a programme in which the Netherlands, Britain and other former Allied countries commemorated Operation Market Garden, the code name for the battle for the bridge across the Rhine at Arnhem that took place in September 1944. Allied forces consisted of British, American and Polish troops, with help from Dutch resistance.

The Battle of Arnhem 1944 is of great significance to the UK and interest in it remains strong on both sides of the North Sea.

We wanted to create a lasting memory of these events and a special collection in the UK Web Archive on the subject seemed like a good idea.

 

What is included?

We kept the scope of the project quite narrow; only websites with a focus on the commemorations that took place in Britain and the Netherlands in 2019 are included, with the exception of some websites that deal with the historic facts regarding the Battle to give it some context.

So far over 150 individual websites within the UK web domain have been identified, of which 64 were selected to go into the collection. These sites are limited to the UK web domain, so have .uk in their domain name, or if they don’t must be hosted in the UK, or owned by UK organisations or individuals with a postal address in the UK.

Some of the websites selected for this collection include the 23 Parachute Field Ambulance, Airborne at the Bridge and Arnhem Oosterbeel War Cemetary.

 

How can you access these archived websites?

Under the Non-Print Legal Deposit Regulations 2013, we can archive UK websites but we are only able to make them available to people outside the UK Legal Deposit Libraries reading rooms, if the website owner has given permission. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.

For this collection you can view what has been selected through the UK Web Archive website but will need to visit a UK Legal Deposit Library reading room to view the archived content. The reading rooms across the Legal Deposit Libraries are starting to reopen now, with some restrictions, as you can read in this blog: https://blogs.bl.uk/webarchive/2020/09/ukwa-available-in-reading-rooms-again.html

 

How Can I Get Involved?

You can help expand this collection by sending us a URL you think may be eligible for inclusion in the collection Arnhem75. Please go to https://www.webarchive.org.uk/en/ukwa/info/nominate to nominate a website and we’ll take it from there.

Occasionally websites from non UK domains can be included, if they have a strong link to the UK and the website owners have given their permission to be included in the collection. Dutch organisations that were involved in the Arnhem75 commemorations are encouraged to get in touch.

We look forward to your suggestions!

 

10 September 2020

Launching the UK Web Archive 2020 Annual Domain Crawl

By Helena Byrne, Curator of Web Archives at the British Library

Today (10th September 2020) the UK Web Archive team will be pushing the big red button to kickstart the annual Domain Crawl of the UK webspace. The current coronavirus pandemic will no doubt feature strongly in this year’s crawl. This will complement the curated collection that the web archive teams across the UK Legal Deposit Libraries are contributing. The British Library along with the National Library of Scotland are also selecting websites for the International Internet Preservation Consortium (IIPC) Content Development Group (CDG) Novel Coronavirus (COVID-19) collection. 

What we collect

The UK Web Archive has been archiving UK published websites on a selective basis since 2005 and in 2020 is celebrating #15YearsOfUKWA. Domain Crawl 2020 is the seventh that has taken place. It wasn’t till after the implementation of the Non-Print Legal Deposit Regulations (NPLD) in April 2013, that we were able to run a broad crawl over the UK webspace. This includes anything with a .uk or other UK geographic Top Level Domain (TLD) such as .scot, .cymru or .london etc. It also includes websites on other TLDs that have been registered in the UK or that have been manually selected. 

NPLD came into effect on the 6th April 2013 and the British Library hosted a special event to launch the first Domain Crawl. This was widely covered in the national press and you can still watch back a short video from the event on The Guardian website

How much data is collected in the Domain Crawl?

The Domain Crawl usually runs for three months of the year and each year starts at a different time of year to avoid seasonal biases. Roughly 5-10 million hosts (websites) are archived every year. However, the amount of data collected each year varies. Also the way the data is collected and stored over time changes. We compress the data we store and as technology develops the amount of data that can be compressed into one terabyte changes. Last year 63.7 TB of compressed data was collected bringing the total collected during Domain Crawls from 2013 to 2019 to 477.62 TB. 

UKWA Domain Crawl 2013-2019 (1)

When can I view this content?

Due to the enormous amounts of data that is collected each year from the annual Domain Crawl and our Frequent Crawls, there is a significant lag from when the content is archived and made available through the UK Web Archive website. The Frequent Crawl data collected from 2013-2019 was 250.34 TB bringing the combined total to 727.96 TB of compressed data. To make searching content easier the website allows you search across all the Selectively Crawled content from 2005 to 2013 as well as the Frequent Crawl content from 2013 to 2017 and the Domain Crawl content 2013 to 2015. 

Under the Non-Print Legal Deposit (NPLD) Regulations 2013, we can archive all UK published websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission.

Due to the NPLD Regulations, access to the archived content is a mix of open and onsite access. The ‘Viewable only on Library premises’ message on individual records indicates that you have to visit one of the six UK Legal Deposit Libraries.  The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.

Follow the UK Web Archive on Twitter for the latest updates on the domain crawl and other web archiving activities!