UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

22 August 2012

Visualising the UK Web Domain

The UK Web Archive is a selective archive containing Websites selected and preserved by the British Library and partners since 2004.

  “.uk” is one of the largest country-code top level domains in the world with 10 million registrations in March 2012. Selective archiving has many advantages but is costly and fails to capture a comprehensive picture of the national domain. The Legal Deposit Libraries in the UK will be able to collect Web resources at scale when the non-print Legal Deposit legislations are in place, expected sometime in 2013.

The benefits of archived Web resources can only be realised when these are actively used, for research, learning and teaching.  This was the impetus for us to work with the Joint Information Systems Committee (JISC) and the Internet Archive on a collaborative project which extracted a copy of UK Websites from the Internet Archive’s collection. This research dataset , supported by JISC funding, contains Websites crawled between 1996 and 2010 by the Internet Archive and is the largest historical dataset of the UK domain in existence.  One of the objectives of the project is to develop visualisations and services to demonstrate how large scale Web archive collections can be used for analytics, showing embedded trends and patterns which would not have been possible by just consulting historical copies of Websites individually.

The visualisations and secondary datasets are now released on the UK Web Archive http://www.webarchive.org.uk/ukwa/visualisation. The N-gram search is a phrase-usage visualisation tool which charts the monthly occurrence of user-defined search terms or phrases over time, as found in the JISC UK Web domain dataset (1996-2010). The link visualisation shows the relationship between domain suffixes over time.  The format profile is a visualisation of the format analysis, summarising the data formats (MIME types) contained within all of the HTTP 200 OK responses.  We have also released two downloadable secondary datasets which can used to develop further applications, a list of MIME types and a postcode index.

The JISC has also funded two additional projects, using the JISC UK Web domain dataset (1996-2010) to develop analytical access to large scale Web archive collection. These are  Analytical Access to the Domain Dark Archive  and Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research.  We are running a joint workshop at Digital Research 2012 Conference: Digital Research Using Web Archives.  If you would like to find out more about our projects and Web archiving in general, please come along and join us.

Comments

The comments to this entry are closed.

.