THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Peter Webster (Engagement and Liaison Manager). Read more

11 November 2014

Collecting First World War Websites – November 2014 update

Add comment Comments (0)

Earlier in 2014 we blogged about the new Special Collection of websites related to World War One that we’ve put together to mark the Centenary. As today is Armistice Day, commemorating the cessation of hostilities on the Western Front, it seems fitting to see what we have collected so far.

2849756987_447b0f638b_z

The collection has been growing steadily over the past few months and now totals 111 websites. A significant subset of the WW1 special collection comes from the output of the Heritage Lottery Funded projects. The collection also includes websites selected by subject specialists at the British Library and nominations from members of the public.

A wide variety of websites have been archived so far which can broadly be categorised into a few different types:

Critical reflections
They include critical reflections on British involvement in armed conflict more generally, for example the Arming All Sides website, which features a discussion of the Arms trade around WW1 and Naval-History.net, an invaluable academic resource on the history of naval conflict in the First and Second World Wars.

Artistic and literary
The First World War inspired a wealth of artistic and literary output. For example the website dedicated to Eugene Burnand (1850-1921) a Swiss artist who created a series of pencil and pastel portraits depicting various ‘military types’ of all races and nationalities drawn into the conflict on all sides. Burnand was a man of great humanity and his subjects included typical men and women who served in the War as well as those of more significant military rank.

The Collection also includes websites of contemporary artists who in connection with the Centenary are creating work reflecting on the history of the conflict. One such artist is Dawn Cole whose work on WW1 has focused on the archive of WW1 VAD Nurse Clarice Spratling’s diaries, creating a project of live performance, spoken word and art installations.

Similar creative reflections from the world of theatre, film and radio can be seen in the archive. See for example Med Theatre: Dartmoor in WW1, an eighteen-month project investigating the effect the First World War had on Dartmoor and its communities. Pals for Life is a project based in the north-west aiming to create short films enabling local communities to learn about World War One. Subterranean Sepoys, is a radio play resulting from the work of volunteers researching the forgotten stories of Indian soldiers and their British Officers in the trenches of the Western Front in the first year of the Great War.

Community stories
The largest number of websites archived so far comprise projects produced by individuals or local groups telling stories of the War at a community level across the UK. The Bottesford Parish 1st World War Centenary Project focusses on 220 local recruits who served in the War using wartime biographies, memorabilia and memories still in the community to tell their stories.

The Wylye Valley 1914 project has been set up by a Wiltshire-based local history group researching the Great War and the sudden dramatic social and practical effects this had on the local population. In 1914 24,000 troops descended suddenly on the Wylye Valley villages, the largest of which had a population of 500, in response to Kitcheners’ appeals for recruits. These men arrived without uniform, accommodation or any experience of organisation. The project explores the effects of the War on these men and the impact on the local communities.

An important outcome of commemorations of the Centenary of WW1 has been the restoration and transcription of war memorials across the UK. Many local projects have used the opportunity to introduce the stories of those who were lost in the conflict. Examples include the Dover War Memorial Project; the Flintshire War Memorials Project ; Leicester City, County and Rutland War Memorials project and St. James Toxteth War memorials project.

Collecting continues
This shows just some of the many ways people are choosing to commemorate the First World War and demonstrates the continued fascination with it.

We will continue collecting First World War websites through the Centenary period to 2018 and beyond. If you own a website or know of a website about WW1 and would like to nominate it for archiving then we would love to hear from you. Please submit the details on our nominate form.

By Nicola Bingham, Web Archivist, The British Library

03 November 2014

Powering the UK Web Archive search with Solr

Add comment Comments (0)

When you have hundreds of millions of webpages to search, what technologies do we use at the UK Web Archive to ensure the best service?

Solr
At the core of the UK Web Archive is the open source tool Apache Solr. To quote from their own website, ‘Solr is a very popular open source enterprise search platform that provides full-text and faceted searching’.

It is built using scalable and fault tolerant technologies, providing distributed indexing, automated failover and recovery, and centralised configuration management. And lots more besides – put simply, Solr is proactively pushing towards all aspects of big data search indexing and querying.

Open UK Web Archive
The UKWA website provides public access to more than 200 million UK selected webpages (the selection process includes gaining the permission to publish the archived site from the website owner, and you can nominate a website to archive via our Nominate a Site page.)

Once a site is harvested it is stored internally on several systems to ensure the safe keeping of the data. From these stores the data is ingested into the Solr service, which analyses the metadata and content, primarily to enable the fast querying of the service. Much of Solr’s speed comes from its way of indexing this data, which is called reverse-indexing.

Capable servers
To support these archived websites and provide the UK Web Archive search, we run the service on two dual Xeon servers – an HP ProLiant DL580 G5 with 96GB of RAM and an HP ProLiant DL380 G5 with 64GB of RAM. The data is stored on a Storage Area Network (SAN) using fibre channel connections.

Photo2

The Solr service itself runs under the Apache Tomcat Java management service, and is split between the two physical servers as a master and slave setup – one provides the data ingest mechanism, the other provides the data querying mechanism for the public website.

Scalability
One of the benefits of using Apache Solr is that it is fairly simple to grow a system, in terms of both speed and data capacity. As the amount of web content increases, we can add more hardware to handle the extra load as Solr is designed from the outset as a distributed service.

By Gil Hoggarth, Web Archiving Technical Services Engineer, The British Library

16 October 2014

What is still on the web after 10 years of archiving?

Add comment Comments (2)

The UK Web Archive started archiving web content towards the end of 2004 (e.g. The Hutton Enquiry). If we want to look back at the (almost) ten years that have passed since then, can we find a way to see how much we’ve achieved? Are the URLs we’ve archived still available on the live web? Or are they long since gone? If those URLs are still working, is the content the same as it was? How has our archival sliver of the web changed?

Looking Back
One option would be to go through our archives and exhaustively examine every single URL, and work out what has happened to it. However, the Open UK Web Archive contains many millions of archived resource, and even just checking their basic status would be very time-consuming, never mind performing any kind of comparison of the content of those pages.

Fortunately, to get a good idea of what has happened, we don’t need to visit every single item. Our full-text index categorizes our holdings by, among other things, the year in which the item was crawled. We can therefore use this facet of the search index to randomly sample a number of URLs from each year the archive has been in operation, and use those to build up a picture that compares those holdings to the current web.

URLs by the Thousand
Our search system has built-in support for randomizing the order of the results, so a simple script that performs a faceted search was all that was needed to build up a list of one thousand URLs for each year. A second script was used to attempt to re-download each of those URLs, and record the outcome of that process. Those results were then aggregated into an overall table showing how many URLs fell into each different class of outcome, versus crawl date, as shown below:

What-have-we-saved-01

Here, ‘GONE’ means that not only is the URL missing, but the host that originally served that URL has disappeared from the web. ‘ERROR’, on the other hand, means that a server still responded to our request, but that our once-valid URL now causes the server to fail.

The next class, ‘MISSING’, ably illustrates the fate of the majority of our archived content - the server is there, and responds, but no longer recognizes that URL. Those early URLs have become 404 Not Found (either directly, or via redirects). The remaining two classes show URLs that end with a valid HTTP 200 OK response, either via redirects (‘MOVED’) or directly (‘OK’).

The horizontal axis shows the results over time, since late 2004, broken down by each quarter (i.e. 2004-4 is the fourth quarter of 2004). The overall trend clearly shows how the items we have archived have disappeared from the web, with individual URLs being forgotten as time passes. This is in contrast to the fairly stable baseline of ‘GONE’ web hosts, which reflects our policy of removing dead sites from the crawl schedules promptly.

Is OK okay?
However, so far, this only tells us what URLs are still active - the content of those resources could have changed completely. To explore this issue, we have to dig a little deeper by downloading the content and trying to compare what’s inside.

This is very hard to do in a way that is both automated and highly accurate, simply because there are currently no reliable methods for automatically determining when two resources carry the same meaning, despite being written in different words. So, we have to settle for something that is less accurate, but that can be done automatically.

The easy case is when the content is exactly the same – we can just record that the resources are identical at the binary level. If not, we extract whatever text we can from the archived and live URLs, and compare them to see how much the text has changed. To do this, we compute a fingerprint from the text contained in each resource, and then compare those to determine how similar the resources are. This technique has been used for many years in computer forensics applications, such as helping to identify ‘bad’ software, and here we adapt the approach in order to find similar web pages.

Specifically, we generate ssdeep ‘fuzzy hash’ fingerprints, and compare them in order to determine the degree of overlap in the textual content of the items. If the algorithm is able to find any similarity at all, we record the result as ‘SIMILAR’. Otherwise, we record that the items are ‘DISSIMILAR’.

Processing all of the ‘MOVED’ or ‘OK’ results in this way leads to this graph:

What-have-we-saved-02

So, for all those ‘OK’ or ‘MOVED’ URLs, the vast majority appear to have changed. Very few are binary identical (‘SAME’), and while many of the others remain ‘SIMILAR’ at first, that fraction tails off as we go back in time.

Summarising Similarity
Combining the similarity data with the original graph, we can replace the ‘OK’ and ‘MOVED’ parts of the graph with the similarity results in order to see those trends in context:

What-have-we-saved-03

Shown in this way, it is clear that very few archived resources are still available, unchanged, on the current web. Or, in other words, very few of our archived URLs are cool.

Local Vs Global Trends
While this analysis helps us understand the trends and value of our open archive, it’s not yet clear how much it tells us about other collections, or global trends. Historically, the UK Web Archive has focused on high-status sites and sites known to be at risk, and these selection criteria are likely to affect the overall trends. In particular, the very rapid loss of content observed here is likely due to the fact that so many of the sites we archive were known to be ‘at risk’ (such as the sites lost during the 2012 NHS reforms). We can partially address this by running the same kind of analysis over our broader, domain-scale collections. However, that would still bias things towards the UK, and it would be interesting to understand how these trends might differ across countries, and globally.

By Andy Jackson, Web Archiving Technical Lead, The British Library