Geo-location in the 2014 UK Domain Crawl
In April 2013 The Legal Deposit Libraries (Non-Print Works) Regulations 2013 Act was passed and of particular relevance is the section which specifies which parts of that ephemeral place we call the Web are considered to be part of "the UK":
- 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:
- “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”
In more practical terms, resources are to be considered as being published in the United Kingdom if the server which serves said resources is physically located in the UK. Here we enter the realm of Geolocation.
Heritrix & Geolocation
Geolocation is the practice of determining the "real world" location of something—in our case the whereabouts of a server, given its IP address.
The web-crawler we use, Heritrix, already has many necessary features to accomplish this. Among its many DecideRules (a series of
REJECT rules which determine whether a URL is to be downloaded) is the
ExternalGeoLocationDecideRule. This requires:
- A list of ISO 3166-1 country-codes to be permitted in the crawl
- GB, FR, DE, etc.
- An Implementation of
ExternalGeoLookupInterface is where our own work lies. This is essentially a basic framework on which you must hang your own implementation. In our case, our implementation is based on MaxMind’s GeoLite2 database. Freely available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, this is a small database which translates IP addresses (or, more specifically, IP address ranges) into country (or even specific city) locations.
Taken from our Heritrix configuration, the below shows how this is included in the crawl:
The GeoLite2 database itself is, at around only 30MB, very small. Part of beauty of this implementation is that the entire database can be held comfortably in memory. The above shows that we keep the database in Linux's shared memory, avoiding any disk IO when reading from the database.
To test the above we performed a short, shallow test crawl of 1,000,000 seeds. A relatively recent addition to Heritrix's DecideRules is this property:
During a crawl, this will create a file,
scope.log, containing the final decision for every URI along with the specific rule which made that decision. For example:
So for the above 2 URLs were rejected outright, while the first was ruled in-scope by the
Parsing the full output from our test crawl, we find:
- 89,500,755 URLs downloaded in total.
- 26,072 URLs which were not on .uk domains (and therefore would, ordinarily, not be in scope).
- 137 distinct hosts.
2014 Domain Crawl
The process for examining the output of our first Domain Crawl is largely unchanged from the above. The only real difference is the size: the
scope.log file gets very large when dealing with domain scale data. It logs not only the decision for every URL downloaded but every URL notdownloaded (and the reason why).
This will produce a list of all the distinct hosts which have been ruled in-scope by the
ExternalGeoLocationDecideRule (excluding, of course, any
.uk hosts which are considered in scope by virtue of a different part of the legislation).
The above produced a list of 2,544,426 hosts ruled in-scope by the above Geolocation process.
By Roger G. Coram, Web Crawl Engineer, The British Library