UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

3 posts from January 2012

20 January 2012

New project: Analytical Access to the Domain Dark Archive

We're delighted to be working closely with the Institute for Historical Research on a new JISC funded project on Analytical Access to the Domain Dark Archive (AADDA). This is a JISC funded project led by IHR, in partnership with the British Library, the University of Cambridge, and King's College London.

AADDA is an 18-month project to enhance the sustainability of a substantial dark archive of UK domain websites collected between 1996 and 2010 by the Internet Archive, copies of which were recently acquired by the JISC and are stored at the British Library on their behalf. More details can be found over on the IHR Digital blog... .

18 January 2012

The Diamond Jubilee collection: nominations open!

We’re setting up a special collection of websites about The Queen’s Diamond Jubilee – and we want your help! 
 
60 years of The Queen's reign will be celebrated in her Diamond Jubilee this year. It’s a massive event that will be subject to exhaustive coverage, and it seems that nearly everyone has an opinion about it one way or another. It’s only the second time a British Monarch has ever celebrated a Diamond Jubilee, and it’s the first time that there will be an online representation of the event. This makes it a unique event in our history. So, in recognition of this, we’re setting up a special collection of archived websites to capture the Diamond Jubilee Online. We’re working with lots of other organisations and institutions to try and make the collection as comprehensive as possible, including the Royal Household, the Institute for Historical Research, and the Mass Observation project. Why don’t you join us too?
 
Are you, or is your organisation, involved in a Diamond Jubilee activity? Does it have a presence on the web? Do you want to be part of a lasting legacy, a unique collection of archived websites that will record the celebrations and varying opinions held around the nation? Yes? Then we want you to get involved! Tell us about the resources you think are valuable, and make your nominations for the Diamond Jubilee collection now. Be part of something big.  
 
It’s really important that we get a wide range of perspectives on the Jubilee celebrations, with as many different types of sites as possible. Otherwise, the collection won’t be comprehensive. So whether you’re pro- or anti-royal, writing your own blog or managing your company’s online Jubilee celebrations, running a major series of events or holding a village street party – if you have a web presence, why not let us know? We’d love to hear from you.
 
Nominate a site for the Diamond Jubilee collection

03 January 2012

Techtalk: Wayback & HDFS

In order to process the large amount of data contained within the web archive we have been using a cluster based on Apache's Hadoop for some time now. Primarily the cluster is used for text-extraction (via Tika) and various data analytics via Hadoop's MapReduce framework. The Hadoop cluster contains a distributed filesystem provided by HDFS - it is here that we currently store a copy of the entire archive in WARC format. 

With the recent release of Hadoop's WebHDFS it appears that accessing data stored in HDFS via HTTP is becoming commonplace. Earlier in 2011, Cloudera announced the release of Hoop which offers a similar API. Both offer methods to request not only single files but particular blocks of data from within those files. Something we had used in the past for demonstration purposes is Wayback's "RemoteCollection"; in addition to the typically-used "CDXCollection" where indexes and (W)ARCs are local, Wayback offers the facility to request WARC files via HTTP from a remote Wayback instance. 

As we currenly store a copy of all our WARC files within HDFS and that a WARC record is essentially a block of data within a WARC file the two technologies seem ideally suited. 

Our initial experiments have been done using Hoop but there should be few changes involved to get something similar working with WebHDFS. Within Wayback's RemoteCollection.xml configuration: 

  • A 'resourceIndex' property defines the location of the remote CDX - this is configured as normal to reference another Wayback instance which has its own, local CDX. 
<property name="resourceIndex">
    <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
        <property name="searchUrlBase" value="http://127.0.0.1/wayback/xmlquery" />
    </bean>
</property>
  • A 'resourceStore' property defines the prefix of the service from which Wayback will request files - this is configured to point to Hoop. 
<property name="resourceStore">
    <bean class="org.archive.wayback.resourcestore.SimpleResourceStore">
        <property name="prefix" value="http://127.0.0.1/hoop?" />
    </bean>
</property>

The local, client-facing Wayback installation queries the remote CDX and receives the results as XML: 

<?xml version="1.0" encoding="utf-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <numreturned>1</numreturned>
    <type>urlquery</type>
    <enddate>20111013132720</enddate>
    <numresults>1</numresults>
    <firstreturned>0</firstreturned>
    <url>civictrustwales.org/ehd_pix/wag/wag_logo.gif</url>
    <resultsrequested>1000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>0</compressedoffset>
      <mimetype>image/gif</mimetype>
      <file>
      /data/60588463/60395493/WARCS/BL-60395493-0.warc.gz?user.name=rcoram&offset=216696&len=4018&bogus=.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>civictrustwales.org/ehd_pix/wag/wag_logo.gif</urlkey>
      <closest>true</closest>
      <digest>A6FHZCVHZ3PUBPLZ75FD2W6QMIN7RDPC</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>
      http://www.civictrustwales.org/ehd_pix/wag/wag_logo.gif</url>
      <capturedate>20110630141833</capturedate>
    </result>
  </results>
</wayback>

Ordinarily Wayback will receive the name of the (W)ARC file and request this from the remote server and seek to the relevant offset on receipt. More specifically, it appends the value of the <file> element above to the prefix defined in the "resourceStore" above and makes the subsequent HTTP request. By replacing this <file> value with the parameters we need to pass to Hoop we can use Wayback to make the request. 

The <file> tag above shows the amendments we have made. Hoop requires the full path in HDFS, plus the offset and length of the required data. Currently Wayback requires that the 'file' being requested ends with "\.w?arc(\.gz)" and if it finds otherwise, will append ".arc.gz". By adding the 'bogus' parameter we avoid this and force Wayback to expect a WARC record. Note that the offset is also set to zero - we will be receiving a single record rather than a whole file in which we have to seek. 

After making this full request ('resourceStore' prefix + the Hoop data) Hoop returns to the WARC record which Wayback will handle as normal and renders to the browser. 

* There is an obvious limitation insofar as this requires two running instances of Wayback. One which interacts with Hoop and another which does little more than serve a CDX. Allowing the former to use a local CDX while still requesting remote files would be far simpler.

Roger Coram
Web Archiving Engineer, UK Web Archive