Crawling the UK web domain
After the initial flurry of publicity surrounding the final advent of Non-Print Legal Deposit in April, we in the web archiving team at the British Library began the job of actually getting on with part of that new responsibility: that is, routinely archiving the whole of the UK web domain. This is happening in partnership with the other five legal deposit libraries for the UK: the National Library of Wales, the National Library of Scotland, Cambridge University Library, the Bodleian Libraries of the University of Oxford, and Trinity College Dublin.
We blogged back in April about how we were getting on, having captured 3.6TB of compressed data from some 191 million URIs in the first week alone.
Now, we're finished. After a staggered start on April 8th, the crawl ended on June 21st, just short of eleven weeks later. Having started off with a list of 3.8 million seeds, we eventually captured over 31TB of compressed data. At its fastest, a single crawler was visiting 857 URIs per second.
There is of course a great deal of fascinating research that could be done on this dataset, and we'd be interested in suggestions of the kinds of questions we ought to ask of it. For now, there are some interesting views we can take of the data. For example, here is the number of hosts plotted against the total volume of data.
This initial graphing would suggest there are a great many domains that are very small in size indeed; more than 200,000 domains yield only 64B, a minuscule amount of data. These could be sites that return no content at all, or that are redirections to elsewhere, or that "park" domains. At the other end of the scale, there are perhaps c.50,000 domains that return 256MB of data or more.
It's worth remembering that this only represents those sites which we can know (in a scaleable way) are from the UK, which for the most part means sites with domains ending in .uk. There are various means of determining whether a .com, .org, or .net site falls within the scope of the regulations, none of which are yet scaleable; and so best estimates suggest that there may be half as many sites again from the UK which we are not yet capturing.
The next stages are to index all the data and then to ingest it into our Digital Library System, tasks which themselves take several weeks. We anticipate the data being available in the readings rooms of the legal deposit libraries at the very end of 2013. We plan a domain crawl at least once a year, and possibly twice if resources allow.