UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

23 July 2014

First World War Centenary – an online legacy in partnership with the HLF

Add comment Comments (0)

Earlier this year, we at the UK Web Archive were delighted to reach an agreement with the Heritage Lottery Fund (HLF) to enable the archiving of a very large and significant set of websites relating to the Centenary of the First World War.

Throughout the Centenary and beyond, we will be working with the HLF in order to take archival copies of the websites of all HLF-funded First World War Centenary projects, and to make them available to users in the Open UK Web Archive. The first of these archived sites are already available in the First World War special collection but we hope that this will eventually lead to more than 1,000.

HLF Funding
HLF is funding First World War projects throughout the Centenary, ranging from small community projects to major museum redevelopments. Grants start at £3,000 and funding is available through four different grants programmes: First World War: then and now (grants of £3,000 - £10,000), Our Heritage (grants of £10,000 - £100,000), Young Roots (Grants of £10,000 - £50,000 for projects led by young people) and Heritage Grants (grants of more than £100,000).

HLF_Blue(RGB)AFF_TNL_RGB

Include your website
If you have HLF funding for a First World War Centenary project, please send the URL (web address) to [email protected] with your project reference number.

If you have a UK-based WW1 website NOT funded by HLF we would still encourage you to add it for permanent archiving through our Nominate form.

Legacy
This set of archived websites will form a key part of our wider Centenary collection, and capture an important legacy of this most significant of anniversaries.

By Jason Webber, Web Archiving Engagement and Liaison Officer, The British Library

21 July 2014

A right to be remembered

Add comment Comments (0)

A notice placed in a Spanish newspaper 16 years ago, relating to an individual’s legal proceedings over social security debts, appeared many years later in Google’s search results. This led to the recent landmark decision by the European Court of Justice (ECJ) to uphold the Spanish data protection regulator’s initial ruling against Google – who were asked to remove the index and stop any future access to the digitised newspaper article by searching for the individual’s name.

Right to be forgotten
This “right to be forgotten” has been mentioned frequently since, a principle that an individual shall be able to remove traces of past events in their life from the Internet or other records. The “right to be forgotten” is a concept which has generated a great deal of legal, technical and moral wrangling, and is taken into account in practice but not (yet) enforced explicitly by law. As a matter of fact, the ECJ did not specifically find that there is a ‘right to be forgotten’ in the Google case, but applied existing provisions in the EU Data Protection Directive, and Article 8 of the European Convention on Human Rights, the right to respect for private and family life.

Implications to UK Law
In the UK Web Archive our aim is to collect and store information from the Internet and keep that for posterity. There is a question, therefore on how the ECJ decision implicates web archiving?

To answer this question, we would like to point to our existing notice and takedown policy which allows the withdrawal of public access to, or removal of deposited material under specific circumstances.

There is at present no formal and general “right to be forgotten” in UK law, on which a person may demand withdrawal of the lawfully archived copy of lawfully published material, on the sole basis that they do not wish it to be available any longer. However, the Data Protection Act 1998 is applied as the legal basis for withdrawing material containing sensitive personal data, which may cause substantial damage or distress to the data subject. Our policy is in line with the Information Commissioner's Office's response to the Google ruling, which recommend a focus on "evidence of damage and distress to individuals" when reviewing complaints.

Links only, not data
It is important to recognise that the context of the ECJ’s decision is Google’s activities in locating, indexing and making available links to websites containing information about an individual. It is not about the information itself and the court did not consider the blocking or taking down access to the newspaper article.

The purpose of Legal Deposit is to protect and ensure the “right to be remembered” by keeping snapshots of the UK internet as the nation’s digital heritage. Websites archived for Legal Deposit are only accessible within the Legal Deposit Libraries’ reading rooms and the content of the archive is not available for search engines. This significantly reduces the potential damage and impact to individuals and the libraries’ exposure to take-down requests.

Summary
Our conclusion is that the Google case does not significantly change our current notice and take-down policy for non-print Legal Deposit material. However, we will review our practice and procedures to reflect the judgement, especially with regard to indexing, cataloguing and resource discovery based on individuals’ names.

By Helen Hockx-Yu, Head of Web Archiving, The British Library

* I would like to thank my colleague Lynn Young, British Library’s Records Manager, whose various emails and internal papers provide much useful information for this blog post.

18 July 2014

UK Web Domain Crawl 2014 – One month update

Add comment Comments (0)

The British Library started the annual collection of the UK Web on the 19th of June. Now that we are one month into a process which may continue for several more, we thought we would look at the set-up and what we have found so far.

Setting up a ‘Crawl’
Fundamentally a crawl consists of two elements: ‘seeds’ and ‘scope’. That is, a series of starting points and decisions as to how far from those starting points we permit the crawler to go. In theory, you could crawl the entire UK Web with a broad enough scope and a single seed. However, practically speaking it makes more sense to have as many starting points as possible and tighten the scope, lest the crawler’s behaviour becomes unpredictable.

Photo2

Seeds
For this most recent crawl the starting seed list consisted of over 19,000,000 hosts. As it's estimated that there are actually only around 3-4 million active UK websites at this point in time this might seem an absurdly high figure. The discrepancy arises partly due to the difference between what is considered to be a 'website' and a 'domain'—Nominet announced the registration of their 10,000,000th domain in 2012. However, each of those domains may have many subdomains, each serving a different site, which vastly inflates the number.

While attempting to build the seed list for the 2014 domain crawl, we counted the number of subdomains per domain: the most populous had over 700,000.

Scope
The scope definition is somewhat simpler: Part 3 of The Legal Deposit Libraries (Non-Print Works) Regulations 2013 largely defines what we consider to be 'in scope'. The trick becomes translating this into automated decisions. For instance, the legislation rules that a work is in scope if "activities relating to the creation or the publication of the work take place within the United Kingdom". As a result, one potentially significant change for this crawl was the addition of a geolocation module. With this included, every URL we visit is tagged with both the IP address and the result of a geolocation lookup to determine which country hosts the resource. We will therefore automatically include UK-hosted .com, .biz, etc. sites for the first time.

Currently it seems that the crawlers have visited over 350,000 hosts not ending in “.uk” as they have content hosted in the UK.

Geolocation
Although we automatically consider in-scope those sites served from the UK, we can include resources from other countries—the policy for which is detailed here—in order to obtain as full a representation of a UK resource as possible. Thus far we have visited 110 different countries over the course of this year’s crawl.

With regard to the number of resources archived from each country, at the top end the UK accounts for more than every other country combined, while towards the bottom of the list we have single resources being downloaded from Botswana and Macao, among others:

Visited Countries:

1. United Kingdom
2. United States
3. Germany
4. Netherlands
5. Ireland
6. France
...
106. Macao
107. Macedonia, Republic of
108. Morocco
109. Kenya
110. Botswana

Malware
Curiously we've discovered significantly fewer instances of malware than we did in the course of our previous domain crawl. However, we are admittedly still at a relatively early stage and those numbers are only likely to increase over the course of the crawl. The distribution, however, has remained notably similar: most of the 400+ affected sites have only a single item of malware while one site alone accounts for almost half of those found.

Data collected
So far we have archived approximately 10TB of data—the actual volume of data downloaded will likely be significantly higher as firstly, all stored data are compressed and secondly, we don’t store duplicate copies of individual resources (see our earlier blog post regarding size estimates).

By Roger G. Coram, Web Crawl Engineer, The British Library

11 July 2014

Researcher in focus: Saskia Huc-Hepher – French in London

Add comment Comments (0)

Saskia is a researcher at the University of Westminster and worked with the UK Web Archive in putting together a special collection of websites. This is her experience:

Curating a special collection
Over the course of the last two years, I have enjoyed periodically immersing myself in the material culture of the French community in London as it is (re)presented immaterially on-line. In a genuinely web-like fashion, a dip into one particular internet space has invariably led me inquisitively onto others, equally enlightening, and equally expressive of the here and now of this minority community, as the one before, and in turn leading to the discovery of yet more on-line microcosms of the French diaspora.

In fact, the website curation exercise has proven to be a rather addictive activity, with “just one more” site, tantalisingly hyperlinked to the one under scrutiny, delaying the often overdue computer shutdown. These meanderings, however, have a specific objective in mind: to create a collection of websites mirroring the physical presence of the French community in London in its manifold forms, be they administrative, institutional, entrepreneurial, gastronomical, cultural or personal.

Although the collection was intended to display a variety of London French on-line discourses and genres, thereby reflecting the multi-layered realities of the French presence on-land, the aim was also that they should come together as a unified whole, given a new sense of thematic coherence through their culturo-diasporic commonality and shared “home” in the Special Collection.

Open UK Web Archive vs Non-Print Legal Deposit
One of the key challenges with attempting to pull together a unified collection has been whether it can be viewed as a whole online. For websites to be published on the Open UK Web Archive website, permission needs to be granted by the website owner. Any website already captured for the Non-print Legal Deposit (from over 3.5 million domains) can be chosen but these can only be viewed within the confines of a Legal Deposit Library.

In theory, this would mean that the Non-print Legal Deposit websites selected for the London French collection would be accessible on-site in one of the official libraries, but – crucially – not available for open-access consultation on-line.

As regards to this collection, therefore, the practical implications of the legislation could have given rise to a fragmented entity, an archive of two halves arbitrarily divorced from one another, one housed in the ‘ivory towers’ of the research elite and the other freely available to all via the Internet: not the coherent whole I had been so keen to create.

What to select?
In addition to aiming to produce a unified corpus, it was my vision that the rationale of the curation methodology should be informed by the “ethnosemiotic” conceptual framework conceived for my overarching London French research. My doctoral work brings together the ideas of two formerly disparate thinkers, namely (and rather fittingly perhaps) those of French ethnographer, Pierre Bourdieu, and Anglophone, Gunther Kress, of the British “school” of social semiotics, whose particular focus is on multimodal meanings.

Consequently, when selecting websites to be included in the collection, or at least earmarked for permission-seeking, it was vital that I took a three-pronged approach, choosing firstly “material” that demonstrated the official on-line presence of the French in London (what Bourdieu might term the “social field” level) and secondly the unofficial, but arguably more telling, grassroots’ representations of the community on the ground (Bourdieusian “Habitus”), as portrayed through individuals' blogs. Thirdly, for my subsequent multimodal analysis of the sites to be effective, it would also be necessary to select sites drawing on a multiplicity of modes, for instance written text, photographic images, sound, colour, layout, etc., which all websites do by default, but which some take to greater depths of complexity than others.

Video and audio not always captured
However, in the same way that the non-print legal deposit legislation challenges the integrity of the collection as a whole, so these theoretical aspirations turned out to be rather more optimistic than I had envisaged, not least because of the technical limitations of the special collections themselves.
Despite the infinite supplies of generosity and patience from the in-house team at the British Library the fact that special collections cannot at present accommodate material from audiovisual sites, such as on-line radio and film channels (even some audio, visual and audiovisual content from standard sites can be lost in the crawling process) is an undeniable shortcoming.
It was a particular frustration when curating this collection, as audiovisual data, often containing tacit manifestations of cultural identity, are increasingly relied upon in the 21st-century digital age and thus of considerable value now and, perhaps more importantly, for future generations.

3D-Wall visualisation tool
Since completion of the inaugural collection, one or two additional positive lessons have been learned, like the “impact” value of the 3D-Wall visualisation tool. When disseminating my curation work at the Institut Français de Londres last March, before a diverse public audience, composed of community members, together with academics, historians, journalists, publishers and students, none of whom were thought to be familiar with the UK Web Archive, making use of the 3D Wall proved to be an effective and tangible way to connect with the uninitiated audience.

3d-wall

It brought the collection to life, transforming it from a potentially dull and faceless list of website names to a vibrant virtual “street” of London French cyberspaces, bringing a new dimension to the term “information superhighway”. It gave the audience a glimpse of the colourful multitude of webpages making up the façades of “London French Street”, to be visited and revisited beyond the confines of my presentation.

Indeed, the appeal of the collection, as displayed through the 3D Wall, generated unanticipated interest among several key players within the institutional and diplomatic bodies of the French community in London, not least the Deputy Consul and the Head of the French Lycée, both of whom expressed a keen desire to become actively involved in the project.
They found the focus on the quality of the everyday lives of the London French community a refreshing change from the media obsession with the quantity of its members, and I am convinced that it was the 3D Wall that enabled the collection to be showcased to its full potential.

In summary
To conclude, it can be said that I have found the journey, from idea through curation – with the highs and lows of selections, permission-seeking and harvesting – and ultimately “going live” a rewarding and enlightening process.

It has offered insights into the technical and administrative challenges of attempting to archive the ephemeral world of the on-line so as to preserve and protect it as well as providing rich insights into both the formal and informal representations of ‘Frenchness’ in modern London.

The corpus of websites I have curated aims to play its part in recording the collective identity of this often overlooked minority community, giving it a presence, accessible to all, for generations to come and, as such, contributing prospectively to the collective memory of this diasporic population.

By Saskia Huc-Hepher (University of Westminster)

02 July 2014

How much of the UK's HTML is valid?

Add comment Comments (0)

How much of the HTML in the UK web archive is valid HTML? Despite its apparent simplicity, this turns to be a rather difficult question to answer.

What is valid HTML anyway?
What do we mean by valid?

Certainly, the W3C works to create appropriate web standards, and provides tools to assist validation according to those standards that we could re-use.

However, the web browser software that you are using has its own opinion as to what HTML can be. For example, during the ‘browser wars’, competing software products invented individual features in order to gain market share while ignoring any effort to standardise them. Even now, although the relationship between browsers is much more amicable, some of the browser vendors still maintain their own 'living standard' that is similar to, but distinct from, the W3C HTML specification. Even aside from the issue of which definition to validate against, there is the further complication that browsers have always attempted to resolve errors and problems with malformed documents (a.k.a. ‘tag soup’), and do their best to present the content anyway.

Consequently, anecdotally at least, we know that a lot of the HTML on the web is perfectly acceptable despite being invalid, and so it is not quite clear what formal validation would achieve. Furthermore, the validation process itself is quite a computationally intensive procedure, and few web archives have the resources to carry out validation at scale. Based on this understanding of the costs and benefits, we do not routinely run validation processes over our web archives.

What can we look for?
However, we do process our archives in order to index the text from the resources. As each format stores text differently, we have to perform different processes to extract the text from HTML versus, say, a PDF or Office document. Therefore, we have to identify the format of each one in order to determine how to get at the text.

In fact, to help us understand our content, we run two different identification tools, Apache Tika and DROID. The former identifies the general format, and is a necessary part of text extraction processes, whereas the latter attempts to perform a more granular identification. For example, it is capable of distinguishing between the different versions of HTML.

Ideally, one would hope that each of these tools would agree on which documents are HTML, and DROID would provide a little additional information concerning the versions of the formats in use. However, it turns out that DROID takes a somewhat stricter view of what HTML should look like, whereas Tika is a little more forgiving of HTML content that strays further away from standard usage. Another way to look at this is to say that DROID attempts to partially validate the first part of an HTML page, and so those documents that Tika identifies as HTML, but DROID does not, forms a reasonable estimate of the lower-bound of the percentage of invalid HTML in the collection.

Results
Based on two thirds of our 1996-2010 collection (a randomly-selected subset containing 1.7 billion of about 2.5 billion resources hosted from *.uk), we've determined the DROID 'misses' as a percentage of the Tika 'hits' for HTML, year by year, here:

Droid-vs-tikka02

From there one can see that pre-2000, at least ten percent of the archived HTML is so malformed that it's difficult to even identify it as being HTML. For 1995, the percentage rises to 95%, with only 5% of the HTML being identified as such by DROID (although note that the 1995 data only contains a few hundred resources). Post-2000 the fraction of 'misses' has dropped significantly and as of 2010 appears to be around 1%.

What next?
While it is certainly good news that we can reliably identify 99% of the HTML on the contemporary web, neither Tika nor DROID perform proper validation, and the larger question goes unanswered. While 1% of the current web is certainly invalid, we know from experience that the actual percentage is likely to be much higher. The crucial point, however, is that it remains unclear whether full, formal validity actually matters. As long as we can extract the text and metadata, we can ensure the content can be found and viewed, whether it is technically valid or not.

Although the utility of validation is not yet certain, we will still consider adding HTML validation to future iterations of our indexing toolkit. We may only pass a smaller random-selected sample of the HTML through that costly process, as this would still allow us to understand how much content has the clarity of formal validation, and thus how important the W3C (and the standards it promotes) are to the UK web. Perhaps it will tell us something interesting about standard adoption and format dynamics over time.

Written by Andy Jackson, Web Archiving Technical Lead, The British Library

24 June 2014

Your Web Archive Needs You!

Add comment Comments (0)

With the centenary of the outbreak of World War One taking place this summer the British Library’s Web Archiving team has been working with colleagues across the Library and beyond to initiate a ‘First World War Centenary Special Collection’ of websites.

The collection is part of a wide range of centenary projects under way at the Library including:

These projects will enable thousands of people to engage with the centenary and to showcase the many significant items held by the Library relating to the war.

The Special Collection
The web archive collection will include a huge variety of websites related to the centenary including the various events which will be taking place; resources about the history of the war; academic sites on the meaning of the conflict in modern memory and patterns of memorialisation and critical reflections on British involvement in armed conflict more generally.

The collection will help researchers find out how the First World War shaped our society and continues to touch our lives at a personal level in our local communities and as a nation.

Archiving began in April 2014 and will continue until 2019. Some examples of websites archived so far include:

We need your help!
Do you know of a website which may be suitable for the First World War Centenary Collection? If so, we would love to hear from you, particularly if you edit or publish a WW1 themed website yourself.

Websites could include those created by museums, archives, libraries, special interest groups, universities, performing arts groups, schools and community groups, family and local history societies or individual publications. It does not cost anything to have your website archived by the British Library and involves no work on your part once nominated.

Please nominate UK based WW1 related websites through our nominate form.

If you have HLF funding for a First World War Centenary project, please send the URL (web address) to [email protected] with your project reference number.

See what we have in the WW1 special collection so far.

Written by Nicola Bingham, Web Archivist, British Library

23 June 2014

Researcher in focus: Paul Thomas - UK and Canadian Parliamentary Archives

Add comment Comments (0)

At the UK Web Archive, we’re always delighted to learn about specific uses that researchers have been able to make of our data. One such case is from the work of Paul Thomas, a doctoral student in political science at the University of Toronto.

Paul writes:

‘The UK Web Archive has been a huge asset to my dissertation. My research examines how backbench parliamentarians in Canada, the UK and Scotland are increasingly cooperating across party lines through a series of informal organizations known as All-Party Groups (APGs). For the UK, the most important source for my research is the registry of APGs that is regularly produced by the House of Commons. The document, which is published in both web and PDF formats, provides details on the more than 500 groups that are in operation, including which MPs and Peers are involved, and what funding groups have received from outside bodies like lobbyists or charities.

‘A key part of the study involved using the registries to construct a dataset that tracked membership patterns across the various groups, and how they changed over time. Unfortunately, each time a new version of the registry is produced, the previous web copy is taken down.’

While the Parliamentary Archives keep old copies of the registry on file, they only do so in PDF – a format that is not so conducive to the extraction of information into a dataset. Paul was able to find and use successive versions from the UK Web Archive going back to 2006, including a number that were missing from the Internet Archive. Paul was also able to obtain pre-2006 versions from the Internet Archive. ‘Without the UK Web Archive, I would have first needed to purchase the past registries in PDF from the Parliamentary Archives and then painstakingly copy the details on each group into a dataset.’ Overall, Paul writes, ‘the UK Web Archive saved me an enormous amount of time in compiling my data'.

Paul recently gave a paper drawing on this data at the Annual Conference of the Canadian Political Science Association:
http://pauledwinjames.files.wordpress.com/2014/05/paul-thomas-cpsa2014v2.pdf

12 June 2014

How big is the UK web?

Add comment Comments (0)

The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

Red-button

You’ve probably played the time honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets….the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one 'host'.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions) 

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.