UK Web Archive blog: July 2013

[A guest post from Ian Cooke, lead curator for international studies and politics at the British Library, and curator of the current exhibition Propaganda: Power and Persuasion]

If you’ve visited our summer exhibition, Propaganda: Power and Persuasion, you will have seen our “Chorus” installation. Positioned on a large wall at the end of our exhibition, it displays a huge set of archived tweets that relate to three recent events (the Olympics opening ceremony, the debate on gun control in the United States, and President Obama’s ‘Four more years’, which became the most re-tweeted message). In our exhibition, we’re interested in the impact which social media is having on communicating and challenging influence from state and other powerful institutions.

There are different ways of looking at this. A simplification of one argument runs something like this: social media, through enabling access on an equal footing to the same shared public space, is a democratising tool that allows challenge to other forms of influence. People can respond to and question statements that appear dubious, and put across their own point of view. If propaganda is about narrowing the space for debate, then social media provides a powerful means to open it up. Additionally, the new technologies provide freely-available tools by which communities and grass-roots campaigns can network and co-ordinate action to powerful effect. I attended last year’s Netroots UK conference, where Sue Marsh gave an inspirational talk on digital activism and challenges to perceptions and prejudices used in the debate on cuts to welfare benefits for long-term sick and disabled people.

However, some would offer a challenge to the view of social media as always empowering. The vast proliferation of information produced, and the speed by which it is received – so that events or messages are commented on immediately – means that it becomes very hard to check sources and accuracy. Misleading information, or just a point of view put strongly, can be repeated and run unchallenged. In some cases, authority and authorship can be hard to trace. Further, some would argue that new communications technologies allow new opportunities for misdirection in political campaigning. One example is so-called “astro-turfing”, where an apparently local and popular campaign has in fact been set up and co-ordinated by a centralised and well-resourced body. Such activities have existed long before social media, but these new technologies create powerful new ways to both disguise and professionalise the role of the campaigner.

Over the past year, I had the opportunity to create a small collection of websites for the UK Web Archive as part of the Library’s ‘Curators Choice’ programme. This was a great opportunity to start exploring some of these issues, under the heading Political Action and Communication. The collection is more concerned with exploring the interpretation of new media as empowering and democratising, although some sites included, such as WhoFundsYou? are concerned with issues of transparency on the web.

In the collection you’ll find examples of websites set up to support specific campaigns, or organised around specific issues, such as the national and local Frack-Off campaigns against the use of hydraulic fracturing (“fracking”) to extract shale gas from rock. The Occupy protests in London early in 2012 are represented through the Bank of Ideas, which was hosted in disused UBS offices in Hackney, and the Occupied Times of London.

There are also examples of charities and companies that support other organisations in online campaigning. These include FairSay, Social Spark, and Hands Up. All these offer advice, web design and other new media support for charities and campaigning organisations. The Sheila McKechnie Foundation uses its own website and Campaign Central directory to offer support and resources for grass-roots campaigning (on and off-line) around Britain.

I was also interested in the way that blogging is used in campaigning and political commentary. There are examples of individual blogs including Guido Fawkes and Never Seconds. Co-authored blogs can change the style of discussion by bringing in a wider range of viewpoints. Some present views from one political perspective, such as Left Foot Forward. Others attempt to represent a wider spectrum of debate, such as Speaker’s Chair. The latter is particularly interesting in light of criticisms of political communication on the web, which argue that debate quickly polarises as people essentially only read and follow people with whom they already agree.

One area of campaigning that I specifically left out in this collection was party political campaigning during general elections. This is of course a huge area and presents its own challenges for web archiving, as sites are often live for only a short period. The UK Web Archive has however collected websites for the 2010 general election and 2005 general election, as well as the 2009 European parliamentary elections. You can also see more examples of campaign websites and political communication in our collection on the impact of the 2010 public spending cuts.

My thanks go to everyone who supported the Political Action and Communication collection, those who suggested sites and to those who agreed to have their websites archived. All the archived websites included here can be viewed from anywhere, and that of course requires permission from owners of websites – who are often busy running or supporting campaigns. As you’ll see this is a collection that I’m just getting started with, so I need to find more examples to explore further. If you have a site to suggest, would like to comment on the collection, or have found the collection useful, then I’d love to hear from you.

[Propaganda, Power and Persuasion runs at the British Library in St Pancras until 17 September. Ian may be contacted by email at ian dot cooke at bl dot uk ]

[Andy Jackson, web archiving technical lead at the British Library, on what the UK web looked like in 1996, and on teaching machines to classify websites.]

At the end of May, I attended the BL Labs hackathon event, and was able to spend some time talking to students and researchers who are interested in exploring our collections. Those conversations were just the prompt I needed to improve the UK Web Archive Open Data website, as it became clear that the documentation needed some improvement, but also that we had even more data to offer than I understood at first.

In terms of documentation, I was finally able to spend some time documenting the UK Host-Level Link Graph (1996-2010) dataset, released earlier this year. After publicising this updated dataset, there was some immediate interest from someone developing large scale graph visualisation tools, which lead to this excellent visualisation of the 1996 portion of the dataset:

Although further analysis is required to identify all of the clusters and relationships, this unlabelled overview immediately illustrates an important aspect of the web archive. The dots around the edge of the graph indicate individual hosts that are in the UK domain, but are not connected to many other hosts in the UK domain, and are completely disconnected from the main graph in the centre as a result. This implies that, in order to completely archive the UK web domain, we cannot limit ourselves only to the exploration of known UK hosts. This data from the Internet Archive's global crawls shows that there are a significant number of sites that we will only find if we venture out into the global web.

It would be wonderful to see more detailed analysis of this network, and of how the network changes over time. However, even this 1996 slice of the dataset contained some 58,842 hosts (nodes) and 184,433 host-to-host links (edges). The later years contain even more hosts and links, and analysing and visualising such a large link graph remains challenging.

A number of machine learning students also attended the BL Labs event, and talking to them revealed a particular interest in our selective UK Web Archive. We have been working since 2004 on building up this permissions-based archive, with manual classification of those web resources into a two-level subject hierarchy. We have been aware that it should be possible, in principle, to use this manually curated dataset to 'train' a machine learning system, so that it might be able to automatically classify resources. This might help us better to explore large scale domain crawls where we no longer have enough manual effort available to classify millions of sites manually.

At present, we have neither the time nor the expertise to exploit this possible approach to web archive analysis. That said, I realised while talking to the BL Labs attendees that they might be able to help us do that, needing only a relatively simple dataset to get started. Based on their suggestions, we created a simple Website Classification Dataset for the Selective Archive, listing the subject classification and title for each URL in the set. Early indications were that even this very limited amount of information may be enough to distinguish which top-level classification(s) a site belongs to. By providing a bit more information based on the text of the site's pages (the 100 most popular keywords from each, say) it might well be possible to provide a very useful ground truth training set that can be used to create powerful machine classification systems.

We're always keen to investigate more options for exploiting the data and metadata in our archives. If you have any requests for datasets you'd like us to make available, please comment below, or get in touch.

UK Web Archive blog

2 posts from July 2013

Propaganda, political communication and action on the web

Using open data to visualise the early web