Geo-location in the 2014 UK Domain Crawl
In April 2013 The Legal Deposit Libraries (Non-Print Works) Regulations 2013 Act was passed and of particular relevance is the section which specifies which parts of that ephemeral place we call the Web are considered to be part of "the UK":
- 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:
- “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”
In more practical terms, resources are to be considered as being published in the United Kingdom if the server which serves said resources is physically located in the UK. Here we enter the realm of Geolocation.
Heritrix & Geolocation
Geolocation is the practice of determining the "real world" location of something—in our case the whereabouts of a server, given its IP address.
The web-crawler we use, Heritrix, already has many necessary features to accomplish this. Among its many DecideRules (a series of ACCEPT
/REJECT
rules which determine whether a URL is to be downloaded) is the ExternalGeoLocationDecideRule
. This requires:
- A list of ISO 3166-1 country-codes to be permitted in the crawl
- GB, FR, DE, etc.
- An Implementation of
ExternalGeoLookupInterface
.
This latter ExternalGeoLookupInterface
is where our own work lies. This is essentially a basic framework on which you must hang your own implementation. In our case, our implementation is based on MaxMind’s GeoLite2 database. Freely available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, this is a small database which translates IP addresses (or, more specifically, IP address ranges) into country (or even specific city) locations.
Taken from our Heritrix configuration, the below shows how this is included in the crawl:
<!- GEO-LOOKUP: specifying location of external database. --> < bean id = "externalGeoLookup" class = "uk.bl.wap.modules.deciderules.ExternalGeoLookup" > < property name = "database" value = "/dev/shm/geoip-city.mmdb" /> </ bean > <!-- ... ACCEPT those in the UK... --> < bean id = "externalGeoLookupRule" class = "org.archive.crawler.modules.deciderules.ExternalGeoLocationDecideRule" > < property name = "lookup" > < ref bean = "externalGeoLookup" /> </ property > < property name = "countryCodes" > < list > < value >GB</ value > </ list > </ property > </ bean > |
The GeoLite2 database itself is, at around only 30MB, very small. Part of beauty of this implementation is that the entire database can be held comfortably in memory. The above shows that we keep the database in Linux's shared memory, avoiding any disk IO when reading from the database.
Testing
To test the above we performed a short, shallow test crawl of 1,000,000 seeds. A relatively recent addition to Heritrix's DecideRules is this property:
< property name = "logToFile" value = "true" /> |
During a crawl, this will create a file, scope.log
, containing the final decision for every URI along with the specific rule which made that decision. For example:
2014 - 11 -05T10: 17 : 39 .790Z 4 ExternalGeoLocationDecideRule ACCEPT http: //www.jaymoy.com/ 2014 - 11 -05T10: 17 : 39 .790Z 0 RejectDecideRule REJECT https: //t.co/Sz15mxnvtQ 2014 - 11 -05T10: 17 : 39 .790Z 0 RejectDecideRule REJECT http: //twitter.com/2017Hull7 |
So for the above 2 URLs were rejected outright, while the first was ruled in-scope by theExternalGeoLocationDecideRule
.
Parsing the full output from our test crawl, we find:
- 89,500,755 URLs downloaded in total.
- 26,072 URLs which were not on .uk domains (and therefore would, ordinarily, not be in scope).
- 137 distinct hosts.
2014 Domain Crawl
The process for examining the output of our first Domain Crawl is largely unchanged from the above. The only real difference is the size: the scope.log
file gets very large when dealing with domain scale data. It logs not only the decision for every URL downloaded but every URL notdownloaded (and the reason why).
Here we can use a simple sed command (admittedly implemented slightly differently via distributed via Hadoop Streaming to cope with the scale) to parse the logs' output:
sed -rn 's@^.+ ExternalGeoLocationDecideRule ACCEPT https?://([^/]+)/.*$@\1@p' scope.log | grep -Ev "\.uk$" | sort -u |
This will produce a list of all the distinct hosts which have been ruled in-scope by the ExternalGeoLocationDecideRule
(excluding, of course, any .uk
hosts which are considered in scope by virtue of a different part of the legislation).
The above produced a list of 2,544,426 hosts ruled in-scope by the above Geolocation process.
By Roger G. Coram, Web Crawl Engineer, The British Library