Dispatches from the domain crawl #1
After the blaze of publicity surrounding the advent of Non-Print Legal Deposit, the web archiving team have been busy putting the regulations into practice. This is the first of a series of dispatches from the domain crawl, documenting our discoveries as we begin crawling the whole of the UK web domain for this first time.
Firstly, some numbers. In the first week, we acquired nearly 3.6TB of compressed data (in its raw, uncompressed form, the data is ~40% larger) from some 191 million URLs. Although we staggered the launch as a series of smaller crawls, by the end of the week we reached a sustained rate of 300Mb/s. The bulk of this was from the general crawl of the whole domain, which we kicked off with a list of 3.8 million hostnames.
At this stage it is difficult to determine what our success rate is - that is, how successful we are at harvesting each resource we target. This is partly because the Heritrix crawler has what might be described as an optimistic approach to determining what in a harvested page is actually a real link to another resource (particularly when parsing Javascript). As a result, some of the occasions on which Heritrix does not return a resource are due to the fact that there was not a real resource to be had.
At this early stage it is also hard to determine reliably the difference between a erroneous response for a real link resource that has disappeared, and an occasion on which access to a real resource was blocked. Over time, we'll learn more about how best to answer some of these questions, which will hopefully start to reveal interesting things about the UK web as a whole.
Roger Coram / Andy Jackson / Peter Webster
Thanks for the details. So that's about 50 URLs per hostname? If you are fishing around for ideas about another post I would be really interested to hear how you assembled the list of hostnames, and are going to be maintaining it over time--although I imagine it is a work in progress. Keep up the good work!