UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

28 May 2019

FIFA Women’s World Cup and the UK Web Archive

The 2019 FIFA Women’s World Cup will take place in France from the 7th June to the 7th July 2019. Although women's world cups date back as far as the early 1970s, the FIFA Women’s World Cup was only established in 1991. This is the fifth time that England have qualified for the FIFA World Cup but it is a first for Scotland who also join England in Group D of the competition.

Traditionally, women’s sport and in particular football is not well represented in the mainstream media but this is slowly starting to change. Coverage of events such as the FIFA Women’s World Cup is increasing, one way to gauge this is to see how many resources on the .uk web were archived. This trend graph on the UK Web Archive Shine interface, which contains all the archived .uk websites from 1996 to April 2013 shows that for each of the World Cup years that there was an increase in coverage on the .uk webspace. By clicking at a point in the graph a sample of up to 100 websites appears below the graph. There were four competitions (1999, 2003, 2007 and 2011) held during the period 1996 and 2013, but England was the only country from the UK to qualify in the 2007 and 2011 competitions. Thus, it is not surprising that there are just 11 references to “FIFA Women’s World Cup” in 1999 while there were 4,930 in 2011 on Shine Trends.

FIFA-Shine-01

Link to graph.

The UK Web Archive aims to archive the UK web space. It does this through curating collections and an annual domain crawl, which has been running since 2013 when the Non-Print Legal Deposit Regulations came into force in April 2013. Sport is a popular subject on the web, however, it is a subject area that is underrepresented in many traditional libraries and archives. The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise. We have three curated collections related to sport, one of which is dedicated to the many codes of football. These collections don’t differentiate by gender but balance between male and female representation in the collections will be skewed due to the lack of gender equality that exists in all parts of society, including the news industry. According to a UNESCO report ‘only 12 percent of sports news is presented by women worldwide, and only four percent of media content is dedicated to women's sports’.

FIFA Women Image (1)

Mega sporting events like the FIFA Women’s World Cup generates a lot of ephemeral material both in print and online. On average the lifespan for a webpage is 100 days and unless it is archived, it could disappear forever. Have you spotted any UK published web content related to the England, Scotland, Germany, USA or the odds-on favourite Japan? Then fill in our Public Nomination Form and it will be added soon after:

Nominate your website.

The only criteria that nominations to the UK Web Archive have to pass, is that the content is published from the UK (but it doesn’t have to be in English, there are multiple languages in the archive) and that it is not predominately audio-visual based platforms like Soundcloud and YouTube. Although, social media does fall into scope for Non-Print Legal Deposit with the exception of Twitter other platforms are very difficult to archive and we haven’t been able to archive Facebook since 2015.

Browse through the UK Web Archive Sports: Football collection and see if we have your local club website or Twitter account, your favourite fan sites and any other football related content you enjoy viewing. Feel free to nominate your website.

The British Library is currently hosting the (FREE) exhibition: 'An Unsuitable Game for Ladies: A Century of Women's Football' (14 May – 1 September 2019).

by Helena Byrne, Curator of Web Archiving, The British Library

29 March 2019

Collecting Interactive Fiction

Intro
Works of interactive fiction are  stories where the reader/player can guide or affect the narrative in some way. This can be through turning to a specific page as in 'Choose Your Own Adventure', or clicking a link or typing text in digital works. 

Archiving Interactive Fiction
Attempts to archive UK-made interactive fiction began with an exploration of the affordances of a couple of different tools. The British Library’s own ACT (Annotation Curation Tool), and Rhizome’s WebRecorder. ACT is a system which interfaces with the Internet Archive’s Heritrix crawl engine to provide large scale captures of the UK Web. Webrecorder instead focusses on much smaller scale, higher fidelity captures which include video, audio and other multimedia content. All types of interactive fiction (parser, hypertext, choice-based and multimodal) were tested with both ACT and Webrecorder in order to determine tools which were best suited to which types of content. It should be noted that this project is experimental and ongoing, and as a result, all assertions and suggestions made here are provisional and will not necessarily affect or influence Library collection policy or the final collection. As yet, Webrecorder files do not form part of standard Library collections.

Cat_Simulator

For most parser-based works (those made with Inform 7), Webrecorder appears to work best. It is generally more time-consuming to obtain captures in Webrecorder than in ACT as each page element has to be clicked manually (or at least, the top level page in each branch must be visited) in order to create a fully replayable record. However, this is not the case with most Inform 7 works. For the vast majority, visiting the title page and pressing space bar was sufficient to capture the entire work. The works are then fully replayable in the capture, with users able to type any valid commands in any order. ACT failed to capture most parser works, but there were some successes. For example, Elizabeth Smyth’s Inform 7 game 1k Cupid was fully replayable in ACT, while Robin Johnson’s custom-made Aunts and Butlers also retained full functionality. Unfortunately, games made with Quest failed to capture with either tool.

Another form which appears to be currently unarchivable are those works which make use of live data such as location information, maps or other online resources. Matt Bryden’s Poetry Map failed to capture in ACT, and in Webrecorder although the poems themselves were retained, the background maps were lost. Similarly, Kate Pullinger’s Breathe was recorded successfully with WebRecorder, but naturally only the default text, rather than the adaptive, location-based information is present. Archiving alternative resources such as blogs describing the works may be necessary for these pieces until another solution is found. However, even where these works don’t capture as intended, running them through ACT may still have benefits. A functional version of J.R. Carpenter’s This Is A Picture of Wind, which makes use of live wind data, could not be captured, but crawling it obtained a sample thumbnail which indicates how the poems display in the live version – something which would not have been possible using Webrecorder alone.

Choice-based works made with Ink generally captured well with ACT, although Isak Grozny’s dripping with the waters of SHEOL required Webrecorder. This could be due to the dynamic menus, the use of javascript, or because Autorun has been enabled on itch.io, all of which can prevent ACT from crawling effectively. ChoiceScript games were difficult to capture with either tool for various reasons. Firstly, those which are paywalled could not be captured. Secondly, the manner in which the files are hosted appears to affect capture. When hosted as a folder of individual files rather than as a single compiled html file, the works could only be captured with Webrecorder’s Firefox emulator, and even then, the page crashes frequently. Those which had been compiled appeared to capture equally well with either tool.

Twine works generally capture reasonably well with ACT. ACT is probably the best choice for larger Twines in particular, as capturing a large number of branches quickly becomes extremely time-consuming in Webrecorder. Works which rely on images and video to tell their story, such as Chris Godber’s Glitch, however, retain a greater deal of their functionality if recorded in Webrecorder. As the game is somewhat sprawling, a route was planned through which would give a good idea of the game’s flavour while avoiding excessively long capture times. Webrecorder also contains an emulator of an older version of Firefox which is compatible with older javascript functions and Flash. This allowed for archiving of works which would have otherwise failed to capture, such as Emma Winston’s Cat Simulator 3000 and Daniel Goodbrey’s Icarus Needs.

As alluded to above, using the two tools in tandem is probably the best way to ensure these digital works of fiction are not lost. However, creators are advised to archive their own work too, either by nominating web pages to the UKWA, capturing content with Webrecorder, or saving pages with the Internet Archive’s Wayback Machine.

By Lynda Clark, Innovation Placement, The British Library - @notagoth

21 March 2019

Save UK Published Google + Accounts Now!

The fragility of social media data was highlighted recently when Myspace deleted (by accident) user’s audio and video files without warning. This almost certainly resulted in the loss of many unique and original pieces of work. This is another example of how online social media platforms should not be seen as archives and that if things are important to you they should also be stored elsewhere. The UK Web Archive can play a role in this and we do what we can to preserve websites and selected social media. We do, however, need your help!

Google+
If you have a  Google + account you will have seen the warning that the service is shutting down on 2 April 2019 and have warned users to download any data they want to save by 31 March 2019.

However, it’s not easy to know how to preserve data from social media accounts and sometimes this information without the context of the platform it was hosted on doesn’t give the full picture. In a previous blog post we outlined the challenges involved in archiving social media. Currently the most popular social media platform in the UK Web Archive is Twitter, followed by Facebook, which we haven’t been able to successfully capture since 2015, and a limited amount of Instagram, Wiebo, WeChat and Google +.

Under the 2013 Non-Print Legal Deposit Regulations we can legally only collect digital content published in the UK. As these platforms are hosted outside the UK there is no automated way to identify UK accounts so it requires a person to look through and identify the profiles that are added. In general, these are profiles of politicians, public figures, people renowned in their field of study, campaign groups and institutions.

So far, we only have handful of Google + profiles in the UK Web Archive but we are keen to have more.

How to save your Google+ data
If you have a Google + profile or know of other profiles published in the UK that you think should be preserved, fill in our nomination form before March 29th 2019: https://www.webarchive.org.uk/en/ukwa/info/nominate

If the profiles you want to archive outside the UK you can use the save a website now function on the Internet Archive website: https://archive.org/web/

By Helena Byrne, Web Curator of Web Archiving, The British Library

02 January 2019

Extracting Place Names from Web Archives at Archives Unleashed Vancouver

By Gethin Rees, Lead Curator of Digital Mapping, The British Library

I recently attended the Archives Unleashed hackathon in Vancouver. The fantastic Archives Unleashed project aims to help scholars research the recent past by using big data from web archives. The project organises a series of datathons where researchers collaboratively work with web archive collections over the course of two days. The participants divide into small teams with the aim of producing a piece of research using the archives that they can present at the end of the event and compete for a prize. One of the most important tools that we used in the datathon was the Archives Unleashed Toolkit (AUT).

Archives-Unleashed-project

The team I was on chose to use a dataset that documented a series of Wildfires in British Columbia from 2017 and 2018 (ubc-bc-wildfires). I came to the datathon with an interest in visualising web archive data geographically: place names or toponyms contained in the text from web pages would form the core of such a visualisation. I had little experience of natural language processing before the datathon but, keen to improve my python skills, I decided to take on the challenge in the true spirit of unleashing archives!

My plan to produce such a visualisation consisted of several steps:

1) Pre-process the web archive data (Clean)
2) Extract named entities from the text (NER)
3) Determine which are place names (Geoparse)
4) Add coordinates to place names (Geocode)
5) Visualise the place names (Map)

This blog post is concerned primarily with steps 2 and 3.

An important lesson from the datathon for me is that Web Archive data are very messy. In order to get decent results from steps 2 and 3 it is important to really clean the data as thoroughly as possible. Luckily, the AUT contains several methods that can help to do this (outlined here). The analyses that follow were all run on the output of the AUT ‘Plain text minus boilerplate’ method.

There are a wealth of options available to achieve steps 2 and 3, the discussion that follows does not aim to be exhaustive but to evaluate the methods that we attempted in the datathon.

AUT NER

The first method we attempted was to use the AUT NER method (discussed here). The AUT does a great job of packaging up the Stanford Named Entity Recognizer for easy use with a simple scala command. We ran the method on the AUT derivative of the 2017 section of our Wildfires dataset (around 300mb) using the powerful virtual machines that were helpfully provided by the organisers. However, we found it difficult to get results as the analysis took a long time and often crashed the virtual machine. These problems persisted even when running the NER method on a small subset of the Wildfires dataset, making it difficult to use on a smallish set of WARCs.

The results came in in the following format:

    (20170809,dns:www.nytimes.com,{"PERSON":[],"ORGANIZATION":[],"LOCATION":[]})

Which required processing with a simple python script.

When we did obtain results, the “LOCATIONS” arrays seemed to contain only a fraction of the total place names that appeared in the text.

AUT
- Positives: Simple to execute, tailored to web archive data
- Negatives: Time consuming, processor intensive, output requires processing, not all locations returned

Geoparser

So we next turned our attention to the Edinburgh Geoparser and the excellent accompanying tutorial that I have used to great effect on other projects. Unfortunately the analysis resulted in several errors which prevented the Geoparser returning results. During the time available in the datathon we were not able to resolve these errors. The Geoparser appeared unable to deal with the output of AUT’s ‘Plain text minus boilerplate’ method. I attempted other methods to clean the data including changing the encoding and removing ctrl characters. The following python commands:

import re
s = open('9196-fulltext.txt', mode='r', encoding='utf-8-sig').read()
re.sub(r'[\x00-\x1F]+', '', s)
s.rstrip()

removed these errors:

Error: Input error: Illegal character <0x1f> immediately before file offset 6307408
in unnamed entity at line 2169 char 1 of <stream>
Error: Expected whitespace or tag end in start tag
in unnamed entity at line 4 char 6 of <stream>

However the following error remained which we could not fix even after breaking the text into small chunks:

Error: Document ends too soon
in unnamed entity at line 1 char 1 of <stream>

I would be grateful for any input about how to overcome this error as I would love to use the Geoparser to extract place names from Warc files in the future.

Geoparser
- Positives: well-documented, powerful software. Fairly easy to use. Excellent results with OCR or plain text.
- Negatives: didn’t seem to deal well with the scale and/or messiness of web archive data.

NLTK

My final attempt to extract place names involved using the python NLTK library with the following packages 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'. The initial aim was to extract the named entities from the text. A preliminary script designed to achieve this can be found here.

This extraction does not separate place names from other named entities such as proper nouns and therefore a second stage involved checking if the entities returned by NLTK were present in a gazetteer. We found a suitable gazetteer with a wealth of different information and in the final hours of the datathon I attempted to hack together something to match the NER results with the gazetteer

Unfortunately I ran out of time both to write the necessary code and to run the script over the dataset. The script badly needs improvement using dataframes and other optimisation. Notwithstanding its preliminary nature, it is clear that this method of extracting place names is slow. The quality of results is also highly dependent on the quality and size of the gazetteer. Only place names found within the gazetteer will be extracted and therefore, if the gazetteer is biased or deficient in some way, the resulting output will be skewed. Furthermore, as the gazetteer becomes larger, the extraction of place names will become painfully slow.

The method described replicates the functionality of geoparser tools yet is a little more flexible, allowing the participant to take account for the idiosyncrasies of web archive data such as unusual characters.

NLTK
- positives: flexibility, works
- negatives: slow, reliant on the gazetteer, requires python skills


Concluding Remarks

Despite the travails that I have outlined, my team mates, adopting a non-programmatic approach, came up with this brilliant map by doing some nifty things with a gazetteer, Voyant tools and QGIS.

Voyant-map

From a programmatic perspective it appears that there is still work required to develop a method to extract place names from web archive data at scale, particularly in the hectic and fast-paced environment of a datathon. The main challenge is the messiness of the data with many tools throwing errors that were difficult to rectify. In terms of future datathons, speed of analysis and implementation is a critical consideration as datathons aim to deal with big data in a short amount of time. Of course, the preceding discussion has hardly considered the quality of information outputted by the tools. This is another essential consideration and requires further work. Another future direction would be to examine other tools such as spaCy, Polyglot and NER-Tagger as described in this article.

 

20 December 2018

The UK Web Archive gets a fresh look

Until recently, if you wanted to research a historic UK website you may have had to look in a number of different places. There was the 'Open' UK Web Archive that contained the 15,000 or so publicly available websites collected since 2005. If you also wanted to check the vast 'Legal Deposit' web archive  (containing the whole UK Web space) then you would need to travel to the reading room of a UK Legal Deposit Library to see if what you needed was there. For the first time, the new UKWA website offers:

  • The ability to search the entire collection in one place
  • the opportunity to browse over 100 curated collections on a wide range of topics.

Home

www.webarchive.org.uk

Who is the UK Web Archive?
UKWA is a partnership of all the UK Legal Deposit Libraries - The British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries Oxford, Cambridge University Libraries, Trinity College, Dublin. The Legal Deposit Web Archive is available in the reading rooms of all the Libraries. A readers pass for each library is required to gain access to a reading room. 

How much is available now?
At the time of writing, everything that a human (curators and collaborators) has selected since 2005 is searchable. This constitutes many thousands of websites and millions of individual web pages. We will be adding the huge yearly Legal Deposit collections over the coming year - we'll let you know as they become available.

Among the many websites available are the BBC and many newspaper websites such as The Sun, The Daily Mail and The Guardian.

Do the websites look and work as they did originally?
Yes and no. Every effort is made so that websites look how they did originally and internal links should work. However, for a variety of technical  issues many websites will look different or some elements may be missing. As a minimum, all of the text in the collection is searchable and most images should be there. Whilst we collect a considerable amount of video, much of this will not play back.

Is every UK website available?
We aim to collect every website made or owned by a UK resident, however, in reality it is extremely difficult to be comprehensive! Our annual Legal Deposit collections include every .uk (and .london, .scot, .wales and .cymru) plus any website on a server located in the UK. Of course, many websites are .com, .info etc. and on servers in other countries.

If you have or know of a UK website that should be in the archive we encourage you to nominate them here.

Keep in touch by following us on Twitter.

By Jason Webber, Web Archive Engagement Manager, The British Library

27 November 2018

World Digital Preservation Day 2018

 

Z1

World Digital Preservation Day (formerly International Digital Preservation Day) is held on the last Thursday of every November and is organised by the DPC. There are lots of events organised around the world that will take place on the 29th November 2018.

The screenshot below shows how popular the term “Digital Preservation” was on the archived .uk web space from 1996 to April 2013. Follow this link and see who is talking about digital preservation by clicking on a point in the graph.

Z2

 

Web Archiving is an important part of the digital preservation process as web content is constantly changing over time, some more quickly than others. The UK Web Archive aims to archive, preserve and give access to the UK web space. The image below shows how the British Library website has changed over time.

  Z3

 

The UK Web Archive has been building curated collections since 2005 and has a dedicated collection on the subject of IT. The IT Collection has eleven subsections, with one dedicated to Web Archives and Digital Preservation. This is a small subsection so if you see that it is missing something important to the UK field of web archiving and digital preservation you can nominate content by filling in the public nomination form.

As part of your Digital Preservation Day, why not nominate your favourite UK published website on any subject to the UK Web Archive by filling in the public nomination form?

28 September 2018

Sports Collections in the UK Web Archive

By Helena Byrne, Web Archive Curator, The British Library

The 30th September is National Sporting Heritage Day in the UK and to celebrate the event in 2018 we will give you a quick overview of our sports collections. 

Introduction
Sport studies give us a real insight into popular culture and political issues of the time, however, it is a subject area that has often been underrepresented in many traditional libraries and archives. The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise.

UKWA Sports Collections
We currently have three collections that focus on sport:

  1. Sport: Football
  2. Sports Collection
  3. Sports: International Events

Shine - Football Graph

Trend graph on SHINE

Sport: Football
Football in all its varieties is probably the most popular sport in the UK, which is why there is a collection dedicated exclusively to football and related activities. There are many subsections to the Rugby and Soccer strand of the collection which can be viewed by clicking on the information box.

Sport Football Collection


Sports Collection
The general collection on sports has been broken down into subsections based on the type of sport rather than a specific sport title like tennis or snooker. These subject headings were based on the Universal Decimal Classification page about sport (from PD 1000 – 2003 UDC Abridged Edition). We used this general taxonomy of sports so that the collection can easily adapt to new sporting trends that emerge in the future. The Ball Sports section excludes football as there is already a dedicated collection on this subject. Ball sports is probably the most versatile section and this has an additional five subsections:

  1. By Hand
  2. On a Table
  3. With Club
  4. With Racket (Racquet)
  5. With a Stick

Sports Collection

Sports: International Events
Our third main collection covers international sporting events. Currently there are six subsections in this collection:

  1. Olympic & Paralympic Games 2012
  2. Commonwealth Games Glasgow 2014
  3. Tour De France (Yorkshire)
  4. Winter Olympics Sochi 2014
  5. Rugby World Cup 2015
  6. Rio Olympics 2016


The decision to build collections on international sporting events is dependent on staff resource and their subject knowledge of these events. Going forward we would like to build collections around the major sporting events hosted in the UK but this is not always easy or possible. A major challenge around collecting on international events is that many of the web publishers are not based in the UK and do not always set up a UK website for the event. We archive content under the Non-Print Legal Deposit Regulations 2013, that means we are not able to automatically scope in content published outside the UK.

Access and Reuse
Under the Non-Print Legal Deposit Regulations 2013 access to archived content is restricted to a UK Legal Deposit library reading room. However, if we have permission from the website owner we can make the archived version of their content open access along with government publications under the Open Government Licence. This is why if you browse through the collections on the Beta version of our website most of the links to archived content will direct you to one of the UK Legal Deposit Libraries for access but some of the content you can view from your personal device.

The UK Web Archive can be used just like many other primary resources whether it be a magazine or a newsletter and the same copyright regulations apply. The web has been in use for nearly 30 years and the publication The Web as History gives an outline of how researchers from different disciplines interact with web and web archive content. Some of the datasets used in this publication are available for reuse from: data.webarchive.org.uk/opendata/

International Internet Preservation Consortium (IIPC)
As individual institutions the British Library and the National Library of Scotland are members of the International Internet Preservation Consortium (IIPC) and worked on building collaborative collections covering international events such as the Summer and Winter Olympic/Paralympic Games. Since the formation of the IIPC Content Development Group (CDG) in 2015, there has been a consolidated effort to build collections both, on and off the playing field. The British Library took the lead curatorial role in the 2016 Summer Olympics and Paralympics Games and the 2018 Winter Olympics and Paralympics Games, all of the IIPC collections are open access.

Get Involved
The UK Web Archive aims to archive, preserve and give access to the entire UK web space. 

If you see content that that should be included in one of sports collections then please fill in our online nomination form.
Alternatively, if you would like to get more hands on with curating a collection then get in touch.

 

27 September 2018

Web Archives: A Tool for Geographical Research?

By Emmanouil Tranos and Christoph Stich, University of Birmingham

Introduction
If you are a quantitative social scientist there are few things more fascinating than free, under-utilised, quirky and easy to download data that also fits well the narrative of 'big data'.

Combine the above characteristics with data that have the potential to support researchers answering interesting research questions and then you will make a researcher happy! And this is exactly what the JISC UK Web Domain Dataset held by the UK Web Archive is all about.

A detailed description of the data can be found here, but briefly this is a subset of the Internet Archive that includes all the archived webpages under the .UK Top Level Domain (TLD) as well as the archival timestamp for the period January 1996 to March 2013. The UK Web Archive partnered with the Internet Archive and JISC to create this unique data set, which enables researchers to easily access probably the largest national archive of webpages.

The UK web space has several unique characteristics
Apart from the fact that UK was an early adopter of internet technologies and applications, it also includes some widely recognisable second level domain names such as the .co.uk and the .ac.uk. While the first one (mainly) denotes commercial activities based in the UK similar to the .com top level domain, the latter is used for UK universities. Moreover, the English language makes the UK web space more accessible to the rest of the world.

How is this dataset useful?
The JISC UK Web Domain Dataset is an easy way to access the Internet Archive data. It is, in essence, a long list of strings (i.e. groups of characters), that include the archival timestamp and the original URL of the archived webpages.

For instance, the first numerical part of the line below indicates when the contact page of the uk.eurogate.co.uk website was archived (9/5/2008 at 16:21:38).

20080509162138/http://uk.eurogate.co.uk/contact_us IG8 8HD

With the use of these strings a researcher can retrieve the HTML documents of the archived webpages from the Internet Archive API. The UK Web Archive further processed this data and created a subset of the archived UK webpages that includes all the .uk webpages that contain a UK postcode.

In the above example, the last element indicates that this specific webpage contains the postcode IG8 8HD.

This dataset, which is known as the Geoindex and can be downloaded from here, is probably one of the largest open data sets of georeferenced digital content.

Challenges
There are, however, a number of technical and conceptual challenges attached to the usage of these data. For instance, there is a debate in the literature regarding how much of the web is currently archived (e.g. Hale et al. 2017). Although there is some critique regarding the depth of archival process (i.e. how many webpages from each website are archived), the Internet Archive is the most extended digital archive (Holzmann et al., 2016; Ainsworth et al., 2011).

Moreover, the volume of the data requires some upfront investment regarding data analysis skills, but is still doable with some standard off-the-shelf libraries and tools (e.g. Python or R).

Results
After filtering out invalid postcodes, we are left with a dataset that contains about 5.8 million pairs of British postcodes and domain names.

As one can see in plot, the number of domains that reference a postcode grows relatively rapidly in the decade between 1995 and 2005 before growth levels off. The distribution of domains also more or less aligns with the population density of the UK. This is a good indicator that the collected data captures actual activity in the UK.

Domains_2012

Unsurprisingly the data also reveal a difference between London and the rest of the country. The number of domains that reference a postcode per inhabitant grew faster in London than in other places, but eventually the rest of the country caught up with London. There are, however, quite significant differences in how the domains are distributed within London as well.

London_dpt

London_dpt

So, what research questions can these data help us answer? Utilising funding from the ESRC and the Consumer Data Research Centre (CDRC) we employed this data to explore the evolution of the digital economy in the UK. Firstly, we are utilising this data in order to understand whether the availability of online content attracts individuals online. We do that by employing unique survey data available from CDRC.

Hypothesis
Our underlying hypothesis is that the availability of internet content of local interest can attract people online in order to access and take advantage of the potential on-line opportunities such as accessing local products and services. The first results seem to support our hypothesis.

Secondly, we are using this data to explore the economic activities (e.g. products and services offered b firms) that take place in some of the UK digital clusters. By filtering the data to only focus on archived web pages from specific clusters in the UK and by utilising the textual data available from the archived HTML documents, we are building topic models to reveal what type of economic activities exist in these clusters and how these activities have evolved over time.

We are testing how this archived web data can help us learn more about economic activities and how they have evolved over time. We are also comparing the outputs of this analysis with official industrial classifications from various sources including freely available such data from CDRC.

Lastly, together with colleagues from City-REDI, we are using the archived web data as a proxy to understand the early adoption of web technologies in the UK. Building upon arguments developed in evolutionary economics, the early adoption of web technologies may signify innovative regions which developed 'digital capacity' early enough, something which may affect their future growth trajectories. The first results indicate that indeed the early adoption of web technologies is related to positive future growth trajectories.

To close, we believe that our on-going research, apart from answering substantive geographical research questions, will also illustrate the value of archived web data for geographical research. It is one of the few available data sources that can provide longitudinal georeferenced data, which also includes a wealth of unstructured textual data.

The latter can also reveal patterns and activities that other more 'conventional' data sources would not have been able to uncover.

References
Ainsworth, S. G., Alsum, A., SalahEldeen, H., Weigle, M. C., & Nelson, M. L. (2011). How much of the web is archived? Paper presented at the Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries.

Hale, S. A., Blank, G., & Alexander, V. D. (2017). Live versus archive: Comparing a web archive to a population of web pages. In N. Brügger & R. Schroeder (Eds.), Web as History: Using Web Archives to Understand the Past and the Present (pp. 45-61). London: UCL Press.

Holzmann, H., Nejdl, W., & Anand, A. (2016). The Dawn of today's popular domains: A study of the archived German Web over 18 years. Paper presented at the Digital Libraries (JCDL), 2016 IEEE/ACM Joint Conference.