THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Peter Webster (Engagement and Liaison Manager). Read more

23 July 2014

First World War Centenary – an online legacy in partnership with the HLF

Add comment Comments (0)

Earlier this year, we at the UK Web Archive were delighted to reach an agreement with the Heritage Lottery Fund (HLF) to enable the archiving of a very large and significant set of websites relating to the Centenary of the First World War.

Throughout the Centenary and beyond, we will be working with the HLF in order to take archival copies of the websites of all HLF-funded First World War Centenary projects, and to make them available to users in the Open UK Web Archive. The first of these archived sites are already available in the First World War special collection but we hope that this will eventually lead to more than 1,000.

HLF Funding
HLF is funding First World War projects throughout the Centenary, ranging from small community projects to major museum redevelopments. Grants start at £3,000 and funding is available through four different grants programmes: First World War: then and now (grants of £3,000 - £10,000), Our Heritage (grants of £10,000 - £100,000), Young Roots (Grants of £10,000 - £50,000 for projects led by young people) and Heritage Grants (grants of more than £100,000).

HLF_Blue(RGB)AFF_TNL_RGB

Include your website
If you have HLF funding for a First World War Centenary project, please send the URL (web address) to FWWURL@hlf.org.uk with your project reference number.

If you have a UK-based WW1 website NOT funded by HLF we would still encourage you to add it for permanent archiving through our Nominate form.

Legacy
This set of archived websites will form a key part of our wider Centenary collection, and capture an important legacy of this most significant of anniversaries.

By Jason Webber, Web Archiving Engagement and Liaison Officer, The British Library

21 July 2014

A right to be remembered

Add comment Comments (0)

A notice placed in a Spanish newspaper 16 years ago, relating to an individual’s legal proceedings over social security debts, appeared many years later in Google’s search results. This led to the recent landmark decision by the European Court of Justice (ECJ) to uphold the Spanish data protection regulator’s initial ruling against Google – who were asked to remove the index and stop any future access to the digitised newspaper article by searching for the individual’s name.

Right to be forgotten
This “right to be forgotten” has been mentioned frequently since, a principle that an individual shall be able to remove traces of past events in their life from the Internet or other records. The “right to be forgotten” is a concept which has generated a great deal of legal, technical and moral wrangling, and is taken into account in practice but not (yet) enforced explicitly by law. As a matter of fact, the ECJ did not specifically find that there is a ‘right to be forgotten’ in the Google case, but applied existing provisions in the EU Data Protection Directive, and Article 8 of the European Convention on Human Rights, the right to respect for private and family life.

Implications to UK Law
In the UK Web Archive our aim is to collect and store information from the Internet and keep that for posterity. There is a question, therefore on how the ECJ decision implicates web archiving?

To answer this question, we would like to point to our existing notice and takedown policy which allows the withdrawal of public access to, or removal of deposited material under specific circumstances.

There is at present no formal and general “right to be forgotten” in UK law, on which a person may demand withdrawal of the lawfully archived copy of lawfully published material, on the sole basis that they do not wish it to be available any longer. However, the Data Protection Act 1998 is applied as the legal basis for withdrawing material containing sensitive personal data, which may cause substantial damage or distress to the data subject. Our policy is in line with the Information Commissioner's Office's response to the Google ruling, which recommend a focus on "evidence of damage and distress to individuals" when reviewing complaints.

Links only, not data
It is important to recognise that the context of the ECJ’s decision is Google’s activities in locating, indexing and making available links to websites containing information about an individual. It is not about the information itself and the court did not consider the blocking or taking down access to the newspaper article.

The purpose of Legal Deposit is to protect and ensure the “right to be remembered” by keeping snapshots of the UK internet as the nation’s digital heritage. Websites archived for Legal Deposit are only accessible within the Legal Deposit Libraries’ reading rooms and the content of the archive is not available for search engines. This significantly reduces the potential damage and impact to individuals and the libraries’ exposure to take-down requests.

Summary
Our conclusion is that the Google case does not significantly change our current notice and take-down policy for non-print Legal Deposit material. However, we will review our practice and procedures to reflect the judgement, especially with regard to indexing, cataloguing and resource discovery based on individuals’ names.

By Helen Hockx-Yu, Head of Web Archiving, The British Library

* I would like to thank my colleague Lynn Young, British Library’s Records Manager, whose various emails and internal papers provide much useful information for this blog post.

18 July 2014

UK Web Domain Crawl 2014 – One month update

Add comment Comments (0)

The British Library started the annual collection of the UK Web on the 19th of June. Now that we are one month into a process which may continue for several more, we thought we would look at the set-up and what we have found so far.

Setting up a ‘Crawl’
Fundamentally a crawl consists of two elements: ‘seeds’ and ‘scope’. That is, a series of starting points and decisions as to how far from those starting points we permit the crawler to go. In theory, you could crawl the entire UK Web with a broad enough scope and a single seed. However, practically speaking it makes more sense to have as many starting points as possible and tighten the scope, lest the crawler’s behaviour becomes unpredictable.

Photo2

Seeds
For this most recent crawl the starting seed list consisted of over 19,000,000 hosts. As it's estimated that there are actually only around 3-4 million active UK websites at this point in time this might seem an absurdly high figure. The discrepancy arises partly due to the difference between what is considered to be a 'website' and a 'domain'—Nominet announced the registration of their 10,000,000th domain in 2012. However, each of those domains may have many subdomains, each serving a different site, which vastly inflates the number.

While attempting to build the seed list for the 2014 domain crawl, we counted the number of subdomains per domain: the most populous had over 700,000.

Scope
The scope definition is somewhat simpler: Part 3 of The Legal Deposit Libraries (Non-Print Works) Regulations 2013 largely defines what we consider to be 'in scope'. The trick becomes translating this into automated decisions. For instance, the legislation rules that a work is in scope if "activities relating to the creation or the publication of the work take place within the United Kingdom". As a result, one potentially significant change for this crawl was the addition of a geolocation module. With this included, every URL we visit is tagged with both the IP address and the result of a geolocation lookup to determine which country hosts the resource. We will therefore automatically include UK-hosted .com, .biz, etc. sites for the first time.

Currently it seems that the crawlers have visited over 350,000 hosts not ending in “.uk” as they have content hosted in the UK.

Geolocation
Although we automatically consider in-scope those sites served from the UK, we can include resources from other countries—the policy for which is detailed here—in order to obtain as full a representation of a UK resource as possible. Thus far we have visited 110 different countries over the course of this year’s crawl.

With regard to the number of resources archived from each country, at the top end the UK accounts for more than every other country combined, while towards the bottom of the list we have single resources being downloaded from Botswana and Macao, among others:

Visited Countries:

1. United Kingdom
2. United States
3. Germany
4. Netherlands
5. Ireland
6. France
...
106. Macao
107. Macedonia, Republic of
108. Morocco
109. Kenya
110. Botswana

Malware
Curiously we've discovered significantly fewer instances of malware than we did in the course of our previous domain crawl. However, we are admittedly still at a relatively early stage and those numbers are only likely to increase over the course of the crawl. The distribution, however, has remained notably similar: most of the 400+ affected sites have only a single item of malware while one site alone accounts for almost half of those found.

Data collected
So far we have archived approximately 10TB of data—the actual volume of data downloaded will likely be significantly higher as firstly, all stored data are compressed and secondly, we don’t store duplicate copies of individual resources (see our earlier blog post regarding size estimates).

By Roger G. Coram, Web Crawl Engineer, The British Library