The British Library started the annual collection of the UK Web on the 19th of June. Now that we are one month into a process which may continue for several more, we thought we would look at the set-up and what we have found so far.
Setting up a ‘Crawl’
Fundamentally a crawl consists of two elements: ‘seeds’ and ‘scope’. That is, a series of starting points and decisions as to how far from those starting points we permit the crawler to go. In theory, you could crawl the entire UK Web with a broad enough scope and a single seed. However, practically speaking it makes more sense to have as many starting points as possible and tighten the scope, lest the crawler’s behaviour becomes unpredictable.
For this most recent crawl the starting seed list consisted of over 19,000,000 hosts. As it's estimated that there are actually only around 3-4 million active UK websites at this point in time this might seem an absurdly high figure. The discrepancy arises partly due to the difference between what is considered to be a 'website' and a 'domain'—Nominet announced the registration of their 10,000,000th domain in 2012. However, each of those domains may have many subdomains, each serving a different site, which vastly inflates the number.
While attempting to build the seed list for the 2014 domain crawl, we counted the number of subdomains per domain: the most populous had over 700,000.
The scope definition is somewhat simpler: Part 3 of The Legal Deposit Libraries (Non-Print Works) Regulations 2013 largely defines what we consider to be 'in scope'. The trick becomes translating this into automated decisions. For instance, the legislation rules that a work is in scope if "activities relating to the creation or the publication of the work take place within the United Kingdom". As a result, one potentially significant change for this crawl was the addition of a geolocation module. With this included, every URL we visit is tagged with both the IP address and the result of a geolocation lookup to determine which country hosts the resource. We will therefore automatically include UK-hosted .com, .biz, etc. sites for the first time.
Currently it seems that the crawlers have visited over 350,000 hosts not ending in “.uk” as they have content hosted in the UK.
Although we automatically consider in-scope those sites served from the UK, we can include resources from other countries—the policy for which is detailed here—in order to obtain as full a representation of a UK resource as possible. Thus far we have visited 110 different countries over the course of this year’s crawl.
With regard to the number of resources archived from each country, at the top end the UK accounts for more than every other country combined, while towards the bottom of the list we have single resources being downloaded from Botswana and Macao, among others:
1. United Kingdom
2. United States
107. Macedonia, Republic of
Curiously we've discovered significantly fewer instances of malware than we did in the course of our previous domain crawl. However, we are admittedly still at a relatively early stage and those numbers are only likely to increase over the course of the crawl. The distribution, however, has remained notably similar: most of the 400+ affected sites have only a single item of malware while one site alone accounts for almost half of those found.
So far we have archived approximately 10TB of data—the actual volume of data downloaded will likely be significantly higher as firstly, all stored data are compressed and secondly, we don’t store duplicate copies of individual resources (see our earlier blog post regarding size estimates).
By Roger G. Coram, Web Crawl Engineer, The British Library