3 posts from September 2013

30 September 2013

Watching the UK domain crawl with Monitrix

We at the UK Web Archive have been archiving selected websites since 2004, and throughout we have worked to ensure that the quality of those archived sites is acceptably high. This involves a lot of manual effort; it means inspecting the web pages on each site, tracking down display issues, and re-configuring and re-crawling as necessary. On this basis, we have to date archived over 60,000 individual snapshots of websites over nearly a decade.

Now that the Legal Deposit legislation is in place, we are presented with a formidable challenge. As we move from thousands of sites to millions, what can we do to ensure the quality is high enough? We have the resources to manually inspect a few thousand sites a year, but that's now a drop in the ocean.

At large scale, even fairly basic checks become difficult. When there are only a few crawls running at once, it is easy to spot when the crawl of a single site fails for some unexpected reason. When we have very large numbers of sites being crawled simultaneously, and at varying frequencies, simply keeping track of what is going on at any given moment is not easy, and failed crawls can go unnoticed.

This is also particularly important for those rare occasions when a web publisher has contacted us with an issue about our crawling activity. We need to be able to work out straight away what's been going on, in which crawler process, and to modify its behaviour. This is why we began to develop Monitrix, a crawl monitoring component to complement our crawler.

The core idea is quite simple: Monitrix consumes the crawl log files produced by Heritrix3 and, in real time, derives statistics and metrics from that stream of crawl events. That critical information is then made available via a web-based interface.

Monitrix in action, showing graphs of data volume over time and other key indicators.

We initially trialled Monitrix during our first Legal Deposit crawl, relating to the reorganisation of the NHS in England and Wales in April. This worked very well, and the interface allowed us to track and explore the crawler activity as it happened. Simple things, like being able to flip back quickly through the chain of links that brought the crawlers to a particular site, proved very helpful in understanding the crawl's progress.

But then came the real challenge: using Monitrix during the domain crawl. The NHS collection contained only 5,500 sites, collecting just 1.8TB of archived data. In contrast, the domain crawl would eventually include millions of sites and over 30TB of data. Initially, Monitrix worked quite well, but as the crawl went on it became clear that it could not keep up with the sheer volume of data being pushed into it. The total number of URLs climbed into the millions, being collected at one point at a rate of 857 per second. Under this bombardment, Monitrix became slower and slower.

What was the problem ? With that twenty-twenty vision that comes only with hindsight, it became abundantly clear that the architecture of the MongoDB database (on which Monitrix is based) was not well suited to this, our largest scale use case. However, we now believe we have found at least one appropriate alternative technology, Apache Cassandra, and we are in the process of moving Monitrix over to that database system.

Andy Jackson, Web Archiving Technical Lead, British Library

Posted by Peter Webster at 3:22 PM

16 September 2013

Crawling the UK web domain

After the initial flurry of publicity surrounding the final advent of Non-Print Legal Deposit in April, we in the web archiving team at the British Library began the job of actually getting on with part of that new responsibility: that is, routinely archiving the whole of the UK web domain. This is happening in partnership with the other five legal deposit libraries for the UK: the National Library of Wales, the National Library of Scotland, Cambridge University Library, the Bodleian Libraries of the University of Oxford, and Trinity College Dublin.

We blogged back in April about how we were getting on, having captured 3.6TB of compressed data from some 191 million URIs in the first week alone.

Now, we're finished. After a staggered start on April 8th, the crawl ended on June 21st, just short of eleven weeks later. Having started off with a list of 3.8 million seeds, we eventually captured over 31TB of compressed data. At its fastest, a single crawler was visiting 857 URIs per second.

There is of course a great deal of fascinating research that could be done on this dataset, and we'd be interested in suggestions of the kinds of questions we ought to ask of it. For now, there are some interesting views we can take of the data. For example, here is the number of hosts plotted against the total volume of data.

2013 Domain Crawl TotalDataVolumeDistribution - resized

2013 domain crawl: data volumes and hosts

This initial graphing would suggest there are a great many domains that are very small in size indeed; more than 200,000 domains yield only 64B, a minuscule amount of data. These could be sites that return no content at all, or that are redirections to elsewhere, or that "park" domains. At the other end of the scale, there are perhaps c.50,000 domains that return 256MB of data or more.

It's worth remembering that this only represents those sites which we can know (in a scaleable way) are from the UK, which for the most part means sites with domains ending in .uk. There are various means of determining whether a .com, .org, or .net site falls within the scope of the regulations, none of which are yet scaleable; and so best estimates suggest that there may be half as many sites again from the UK which we are not yet capturing.

The next stages are to index all the data and then to ingest it into our Digital Library System, tasks which themselves take several weeks. We anticipate the data being available in the readings rooms of the legal deposit libraries at the very end of 2013. We plan a domain crawl at least once a year, and possibly twice if resources allow.

Posted by Peter Webster at 10:05 AM

04 September 2013

Scaling up to archive the UK web

The non-print legal deposit legislation became effective on 6 April 2013, which has fundamentally changed the way we archive the UK web. We are now allowed to collect many more websites, enabling us to preserve the nation’s digital heritage at scale, in partnership with the other five legal deposit libraries for the UK (LDLs).

You may have noticed that not much new content has been added to the UK Web Archive recently. But we have been busy behind the scenes - crawling billions of URLs, establishing new workflows and adapting our tools. The archived websites are being made available in LDL reading rooms and some of them will also be added to the open UK Web Archive as we progress.

Our strategy consists of a mixed collection model, allowing periodical crawls of the UK web in its entirety coupled with prioritisation of the parts which are deemed curatorially important by the six LDLs. These will then receive greater attention in curation and quality checking. The components of the collection model are:

the annual / biannual domain crawl, intended to capture the UK domain as comprehensively as possible, providing the overview and the “big picture”;
key sites - those representing UK organisations and individuals which are of general interest in a particular sector of the life of the UK and/or its constituent nations;
news websites, containing news published frequently on the web by journalistic organisations; and
events-based collections, which will capture political, cultural, social and economic events of national interest.

Broad collection framework under non-print legal deposit

The legal deposit regulations allow us to archive in this way on the proviso that users may only access the archived material itself from premises controlled by one of the six LDLs. However, we are also working to provide greater access to high-level data and analytics about the archive, and we will also be seeking permission from website owners to provide online access to selected websites in the UK Web Archive.

Look out for blog posts about the collection based on the reform of the NHS in England and Wales, and our first broad UK domain crawl.

Helen Hockx-Yu is head of web archiving at the British Library

Posted by Peter Webster at 4:21 PM

UK Web Archive blog

3 posts from September 2013

Watching the UK domain crawl with Monitrix

Crawling the UK web domain

Scaling up to archive the UK web