Dispatches from the domain crawl #1
After the blaze of publicity surrounding the advent of Non-Print Legal Deposit, the web archiving team have been busy putting the regulations into practice. This is the first of a series of dispatches from the domain crawl, documenting our discoveries as we begin crawling the whole of the UK web domain for this first time.
Firstly, some numbers. In the first week, we acquired nearly 3.6TB of compressed data (in its raw, uncompressed form, the data is ~40% larger) from some 191 million URLs. Although we staggered the launch as a series of smaller crawls, by the end of the week we reached a sustained rate of 300Mb/s. The bulk of this was from the general crawl of the whole domain, which we kicked off with a list of 3.8 million hostnames.
At this early stage it is also hard to determine reliably the difference between a erroneous response for a real link resource that has disappeared, and an occasion on which access to a real resource was blocked. Over time, we'll learn more about how best to answer some of these questions, which will hopefully start to reveal interesting things about the UK web as a whole.
Roger Coram / Andy Jackson / Peter Webster