Using open data to visualise the early web
[Andy Jackson, web archiving technical lead at the British Library, on what the UK web looked like in 1996, and on teaching machines to classify websites.]
At the end of May, I attended the BL Labs hackathon event, and was able to spend some time talking to students and researchers who are interested in exploring our collections. Those conversations were just the prompt I needed to improve the UK Web Archive Open Data website, as it became clear that the documentation needed some improvement, but also that we had even more data to offer than I understood at first.
In terms of documentation, I was finally able to spend some time documenting the UK Host-Level Link Graph (1996-2010) dataset, released earlier this year. After publicising this updated dataset, there was some immediate interest from someone developing large scale graph visualisation tools, which lead to this excellent visualisation of the 1996 portion of the dataset:
Although further analysis is required to identify all of the clusters and relationships, this unlabelled overview immediately illustrates an important aspect of the web archive. The dots around the edge of the graph indicate individual hosts that are in the UK domain, but are not connected to many other hosts in the UK domain, and are completely disconnected from the main graph in the centre as a result. This implies that, in order to completely archive the UK web domain, we cannot limit ourselves only to the exploration of known UK hosts. This data from the Internet Archive's global crawls shows that there are a significant number of sites that we will only find if we venture out into the global web.
It would be wonderful to see more detailed analysis of this network, and of how the network changes over time. However, even this 1996 slice of the dataset contained some 58,842 hosts (nodes) and 184,433 host-to-host links (edges). The later years contain even more hosts and links, and analysing and visualising such a large link graph remains challenging.
A number of machine learning students also attended the BL Labs event, and talking to them revealed a particular interest in our selective UK Web Archive. We have been working since 2004 on building up this permissions-based archive, with manual classification of those web resources into a two-level subject hierarchy. We have been aware that it should be possible, in principle, to use this manually curated dataset to 'train' a machine learning system, so that it might be able to automatically classify resources. This might help us better to explore large scale domain crawls where we no longer have enough manual effort available to classify millions of sites manually.
At present, we have neither the time nor the expertise to exploit this possible approach to web archive analysis. That said, I realised while talking to the BL Labs attendees that they might be able to help us do that, needing only a relatively simple dataset to get started. Based on their suggestions, we created a simple Website Classification Dataset for the Selective Archive, listing the subject classification and title for each URL in the set. Early indications were that even this very limited amount of information may be enough to distinguish which top-level classification(s) a site belongs to. By providing a bit more information based on the text of the site's pages (the 100 most popular keywords from each, say) it might well be possible to provide a very useful ground truth training set that can be used to create powerful machine classification systems.
We're always keen to investigate more options for exploiting the data and metadata in our archives. If you have any requests for datasets you'd like us to make available, please comment below, or get in touch.