UK Web Archive blog

Introduction

The UK web is one of the most important aspects of the nation’s digital record. But the web is extremely vulnerable, and websites can and do disappear frequently. Preserving them, and providing access to those preserved versions, have become matters of urgency and strategic importance.

Read more

07 February 2012

New Collection: Video Games, Gaming Culture and the Impact of games on Society

Add comment Comments (1)

Crazy about computer games? Then nominate websites for our new video games collection!

An exciting new collection is underway to preserve information about computer games developed and played in the UK. It will include resources that document gaming culture and the impact that video games have had on wider society.

The collection is being developed by digital curation and preservation colleagues from across the Library, with additional input from staff at the National Videogame Archive. The National Videogame Archive is a collection of hardware, original software, design documents, marketing material and fan-generated ephemera housed within the National Media Museum and managed in partnership with Nottingham Trent and Bath Spa Universities. Some of the collection items from the National Video Game Archive are on public display in the Museum’s Games Lounge, which is an interactive gallery featuring vintage console and arcade games.

The collection will include games (e.g. disk images, executables of remakes) and information about games (e.g. maps, walkthroughs, FAQs). If we don’t capture it now and get it in the archive, then much of it is at real risk of being lost forever. We’re also very interested in collecting resources that discuss the cultural and societal impact of computer games, for example research on the impact of games on children’s development.

So how can you help? We are calling all games designers, players and enthusiasts to suggest the websites which you think should be preserved. These may include online games, forums, enthusiast sites, FAQs/walkthroughs, advertising, emulation software, research/education resources etc. We’re interested in all sorts of games and aim to capture a comprehensive view of computer game development and gaming culture in the late 20th and early 21st centuries.

If you know of any sites that you think should be included, then please let us know by filling in the nominations form. Mark your entry ‘Videogame collection nomination’ in the justification field, as well as entering any other information that might help us to appraise the site. Thanks!

 Stella Wisdom
Digital Curator, The British Library 

20 January 2012

New project: Analytical Access to the Domain Dark Archive

Add comment Comments (0)

We're delighted to be working closely with the Institute for Historical Research on a new JISC funded project on Analytical Access to the Domain Dark Archive (AADDA). This is a JISC funded project led by IHR, in partnership with the British Library, the University of Cambridge, and King's College London.

AADDA is an 18-month project to enhance the sustainability of a substantial dark archive of UK domain websites collected between 1996 and 2010 by the Internet Archive, copies of which were recently acquired by the JISC and are stored at the British Library on their behalf. More details can be found over on the IHR Digital blog... .

18 January 2012

The Diamond Jubilee collection: nominations open!

Add comment Comments (0)

We’re setting up a special collection of websites about The Queen’s Diamond Jubilee – and we want your help! 
 
60 years of The Queen's reign will be celebrated in her Diamond Jubilee this year. It’s a massive event that will be subject to exhaustive coverage, and it seems that nearly everyone has an opinion about it one way or another. It’s only the second time a British Monarch has ever celebrated a Diamond Jubilee, and it’s the first time that there will be an online representation of the event. This makes it a unique event in our history. So, in recognition of this, we’re setting up a special collection of archived websites to capture the Diamond Jubilee Online. We’re working with lots of other organisations and institutions to try and make the collection as comprehensive as possible, including the Royal Household, the Institute for Historical Research, and the Mass Observation project. Why don’t you join us too?
 
Are you, or is your organisation, involved in a Diamond Jubilee activity? Does it have a presence on the web? Do you want to be part of a lasting legacy, a unique collection of archived websites that will record the celebrations and varying opinions held around the nation? Yes? Then we want you to get involved! Tell us about the resources you think are valuable, and make your nominations for the Diamond Jubilee collection now. Be part of something big.  
 
It’s really important that we get a wide range of perspectives on the Jubilee celebrations, with as many different types of sites as possible. Otherwise, the collection won’t be comprehensive. So whether you’re pro- or anti-royal, writing your own blog or managing your company’s online Jubilee celebrations, running a major series of events or holding a village street party – if you have a web presence, why not let us know? We’d love to hear from you.
 
Nominate a site for the Diamond Jubilee collection

03 January 2012

Techtalk: Wayback & HDFS

Add comment Comments (0)

In order to process the large amount of data contained within the web archive we have been using a cluster based on Apache's Hadoop for some time now. Primarily the cluster is used for text-extraction (via Tika) and various data analytics via Hadoop's MapReduce framework. The Hadoop cluster contains a distributed filesystem provided by HDFS - it is here that we currently store a copy of the entire archive in WARC format. 

With the recent release of Hadoop's WebHDFS it appears that accessing data stored in HDFS via HTTP is becoming commonplace. Earlier in 2011, Cloudera announced the release of Hoop which offers a similar API. Both offer methods to request not only single files but particular blocks of data from within those files. Something we had used in the past for demonstration purposes is Wayback's "RemoteCollection"; in addition to the typically-used "CDXCollection" where indexes and (W)ARCs are local, Wayback offers the facility to request WARC files via HTTP from a remote Wayback instance. 

As we currenly store a copy of all our WARC files within HDFS and that a WARC record is essentially a block of data within a WARC file the two technologies seem ideally suited. 

Our initial experiments have been done using Hoop but there should be few changes involved to get something similar working with WebHDFS. Within Wayback's RemoteCollection.xml configuration: 

  • A 'resourceIndex' property defines the location of the remote CDX - this is configured as normal to reference another Wayback instance which has its own, local CDX. 
<property name="resourceIndex">
    <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
        <property name="searchUrlBase" value="http://127.0.0.1/wayback/xmlquery" />
    </bean>
</property>
  • A 'resourceStore' property defines the prefix of the service from which Wayback will request files - this is configured to point to Hoop. 
<property name="resourceStore">
    <bean class="org.archive.wayback.resourcestore.SimpleResourceStore">
        <property name="prefix" value="http://127.0.0.1/hoop?" />
    </bean>
</property>

The local, client-facing Wayback installation queries the remote CDX and receives the results as XML: 

<?xml version="1.0" encoding="utf-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <numreturned>1</numreturned>
    <type>urlquery</type>
    <enddate>20111013132720</enddate>
    <numresults>1</numresults>
    <firstreturned>0</firstreturned>
    <url>civictrustwales.org/ehd_pix/wag/wag_logo.gif</url>
    <resultsrequested>1000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>0</compressedoffset>
      <mimetype>image/gif</mimetype>
      <file>
      /data/60588463/60395493/WARCS/BL-60395493-0.warc.gz?user.name=rcoram&offset=216696&len=4018&bogus=.warc.gz</file>
      <redirecturl>-</redirecturl>
      <urlkey>civictrustwales.org/ehd_pix/wag/wag_logo.gif</urlkey>
      <closest>true</closest>
      <digest>A6FHZCVHZ3PUBPLZ75FD2W6QMIN7RDPC</digest>
      <httpresponsecode>200</httpresponsecode>
      <url>
      http://www.civictrustwales.org/ehd_pix/wag/wag_logo.gif</url>
      <capturedate>20110630141833</capturedate>
    </result>
  </results>
</wayback>

Ordinarily Wayback will receive the name of the (W)ARC file and request this from the remote server and seek to the relevant offset on receipt. More specifically, it appends the value of the <file> element above to the prefix defined in the "resourceStore" above and makes the subsequent HTTP request. By replacing this <file> value with the parameters we need to pass to Hoop we can use Wayback to make the request. 

The <file> tag above shows the amendments we have made. Hoop requires the full path in HDFS, plus the offset and length of the required data. Currently Wayback requires that the 'file' being requested ends with "\.w?arc(\.gz)" and if it finds otherwise, will append ".arc.gz". By adding the 'bogus' parameter we avoid this and force Wayback to expect a WARC record. Note that the offset is also set to zero - we will be receiving a single record rather than a whole file in which we have to seek. 

After making this full request ('resourceStore' prefix + the Hoop data) Hoop returns to the WARC record which Wayback will handle as normal and renders to the browser. 

* There is an obvious limitation insofar as this requires two running instances of Wayback. One which interacts with Hoop and another which does little more than serve a CDX. Allowing the former to use a local CDX while still requesting remote files would be far simpler.

Roger Coram
Web Archiving Engineer, UK Web Archive

24 December 2011

Advent Calendar: December 24th

Add comment Comments (0)

City of Sanctuary

'City of Sanctuary is a movement to build a culture of hospitality for people seeking sanctuary in the UK'.

Archived on: December 24th 2008

City-sanctuary
Still available on live web? Yes

Archived by: The British Library

Subject Classifications: Society & Culture
Society & Culture > Communities
Arts & Humanities > Religion

Special collection? No

Other instances? Yes - 6 in total (2008 - 2011)

23 December 2011

Advent Calendar: December 23rd

Add comment Comments (1)

Marxists.org Internet Archive

'A volunteer based non-profit organisation, with the purpose of educating people around the world about Marxism'.

Archived on: December 23rd 2005

Marxists-org
Still available on live web? Yes

Archived by: The British Library

Subject classifications: Society & Culture
Arts & Humanities > Philosophy & Ethics
Arts & Humanities > History

Special collection? No

Other instances? Yes - 4 in total (2005, 2006, 2011)

 

22 December 2011

21 December 2011