Introduction

The UK web is one of the most important aspects of the nation’s digital record. But the web is extremely vulnerable, and websites can and do disappear frequently. Preserving them, and providing access to those preserved versions, have become matters of urgency and strategic importance.

16 September 2013

Crawling the UK web domain

Add comment Comments (0)

After the initial flurry of publicity surrounding the final advent of Non-Print Legal Deposit in April, we in the web archiving team at the British Library began the job of actually getting on with part of that new responsibility: that is, routinely archiving the whole of the UK web domain. This is happening in partnership with the other five legal deposit libraries for the UK: the National Library of Wales, the National Library of Scotland, Cambridge University Library, the Bodleian Libraries of the University of Oxford, and Trinity College Dublin.

We blogged back in April about how we were getting on, having captured 3.6TB of compressed data from some 191 million URIs in the first week alone.

Now, we're finished. After a staggered start on April 8th, the crawl ended on June 21st, just short of eleven weeks later. Having started off with a list of 3.8 million seeds, we eventually captured over 31TB of compressed data. At its fastest, a single crawler was visiting 857 URIs per second.

There is of course a great deal of fascinating research that could be done on this dataset, and we'd be interested in suggestions of the kinds of questions we ought to ask of it. For now, there are some interesting views we can take of the data. For example, here is the number of hosts plotted against the total volume of data.

2013 Domain Crawl TotalDataVolumeDistribution - resized

2013 domain crawl: data volumes and hosts

This initial graphing would suggest there are a great many domains that are very small in size indeed; more than 200,000 domains yield only 64B, a minuscule amount of data. These could be sites that return no content at all, or that are redirections to elsewhere, or that "park" domains. At the other end of the scale, there are perhaps c.50,000 domains that return 256MB of data or more.

It's worth remembering that this only represents those sites which we can know (in a scaleable way) are from the UK, which for the most part means sites with domains ending in .uk. There are various means of determining whether a .com, .org, or .net site falls within the scope of the regulations, none of which are yet scaleable; and so best estimates suggest that there may be half as many sites again from the UK which we are not yet capturing.

The next stages are to index all the data and then to ingest it into our Digital Library System, tasks which themselves take several weeks. We anticipate the data being available in the readings rooms of the legal deposit libraries at the very end of 2013. We plan a domain crawl at least once a year, and possibly twice if resources allow.

Posted by Peter Webster at 10:05 AM

04 September 2013

Scaling up to archive the UK web

Add comment Comments (0)

The non-print legal deposit legislation became effective on 6 April 2013, which has fundamentally changed the way we archive the UK web. We are now allowed to collect many more websites, enabling us to preserve the nation’s digital heritage at scale, in partnership with the other five legal deposit libraries for the UK (LDLs).

You may have noticed that not much new content has been added to the UK Web Archive recently. But we have been busy behind the scenes - crawling billions of URLs, establishing new workflows and adapting our tools. The archived websites are being made available in LDL reading rooms and some of them will also be added to the open UK Web Archive as we progress.

Our strategy consists of a mixed collection model, allowing periodical crawls of the UK web in its entirety coupled with prioritisation of the parts which are deemed curatorially important by the six LDLs. These will then receive greater attention in curation and quality checking. The components of the collection model are:

the annual / biannual domain crawl, intended to capture the UK domain as comprehensively as possible, providing the overview and the “big picture”;
key sites - those representing UK organisations and individuals which are of general interest in a particular sector of the life of the UK and/or its constituent nations;
news websites, containing news published frequently on the web by journalistic organisations; and
events-based collections, which will capture political, cultural, social and economic events of national interest.

Broad collection framework under non-print legal deposit

The legal deposit regulations allow us to archive in this way on the proviso that users may only access the archived material itself from premises controlled by one of the six LDLs. However, we are also working to provide greater access to high-level data and analytics about the archive, and we will also be seeking permission from website owners to provide online access to selected websites in the UK Web Archive.

Look out for blog posts about the collection based on the reform of the NHS in England and Wales, and our first broad UK domain crawl.

Helen Hockx-Yu is head of web archiving at the British Library

Posted by Peter Webster at 4:21 PM

31 July 2013

Propaganda, political communication and action on the web

Add comment Comments (0)

[A guest post from Ian Cooke, lead curator for international studies and politics at the British Library, and curator of the current exhibition Propaganda: Power and Persuasion]

If you’ve visited our summer exhibition, Propaganda: Power and Persuasion, you will have seen our “Chorus” installation. Positioned on a large wall at the end of our exhibition, it displays a huge set of archived tweets that relate to three recent events (the Olympics opening ceremony, the debate on gun control in the United States, and President Obama’s ‘Four more years’, which became the most re-tweeted message). In our exhibition, we’re interested in the impact which social media is having on communicating and challenging influence from state and other powerful institutions.

There are different ways of looking at this. A simplification of one argument runs something like this: social media, through enabling access on an equal footing to the same shared public space, is a democratising tool that allows challenge to other forms of influence. People can respond to and question statements that appear dubious, and put across their own point of view. If propaganda is about narrowing the space for debate, then social media provides a powerful means to open it up. Additionally, the new technologies provide freely-available tools by which communities and grass-roots campaigns can network and co-ordinate action to powerful effect. I attended last year’s Netroots UK conference, where Sue Marsh gave an inspirational talk on digital activism and challenges to perceptions and prejudices used in the debate on cuts to welfare benefits for long-term sick and disabled people.

However, some would offer a challenge to the view of social media as always empowering. The vast proliferation of information produced, and the speed by which it is received – so that events or messages are commented on immediately – means that it becomes very hard to check sources and accuracy. Misleading information, or just a point of view put strongly, can be repeated and run unchallenged. In some cases, authority and authorship can be hard to trace. Further, some would argue that new communications technologies allow new opportunities for misdirection in political campaigning. One example is so-called “astro-turfing”, where an apparently local and popular campaign has in fact been set up and co-ordinated by a centralised and well-resourced body. Such activities have existed long before social media, but these new technologies create powerful new ways to both disguise and professionalise the role of the campaigner.

Over the past year, I had the opportunity to create a small collection of websites for the UK Web Archive as part of the Library’s ‘Curators Choice’ programme. This was a great opportunity to start exploring some of these issues, under the heading Political Action and Communication. The collection is more concerned with exploring the interpretation of new media as empowering and democratising, although some sites included, such as WhoFundsYou? are concerned with issues of transparency on the web.

In the collection you’ll find examples of websites set up to support specific campaigns, or organised around specific issues, such as the national and local Frack-Off campaigns against the use of hydraulic fracturing (“fracking”) to extract shale gas from rock. The Occupy protests in London early in 2012 are represented through the Bank of Ideas, which was hosted in disused UBS offices in Hackney, and the Occupied Times of London.

There are also examples of charities and companies that support other organisations in online campaigning. These include FairSay, Social Spark, and Hands Up. All these offer advice, web design and other new media support for charities and campaigning organisations. The Sheila McKechnie Foundation uses its own website and Campaign Central directory to offer support and resources for grass-roots campaigning (on and off-line) around Britain.

I was also interested in the way that blogging is used in campaigning and political commentary. There are examples of individual blogs including Guido Fawkes and Never Seconds. Co-authored blogs can change the style of discussion by bringing in a wider range of viewpoints. Some present views from one political perspective, such as Left Foot Forward. Others attempt to represent a wider spectrum of debate, such as Speaker’s Chair. The latter is particularly interesting in light of criticisms of political communication on the web, which argue that debate quickly polarises as people essentially only read and follow people with whom they already agree.

One area of campaigning that I specifically left out in this collection was party political campaigning during general elections. This is of course a huge area and presents its own challenges for web archiving, as sites are often live for only a short period. The UK Web Archive has however collected websites for the 2010 general election and 2005 general election, as well as the 2009 European parliamentary elections. You can also see more examples of campaign websites and political communication in our collection on the impact of the 2010 public spending cuts.

My thanks go to everyone who supported the Political Action and Communication collection, those who suggested sites and to those who agreed to have their websites archived. All the archived websites included here can be viewed from anywhere, and that of course requires permission from owners of websites – who are often busy running or supporting campaigns. As you’ll see this is a collection that I’m just getting started with, so I need to find more examples to explore further. If you have a site to suggest, would like to comment on the collection, or have found the collection useful, then I’d love to hear from you.

[Propaganda, Power and Persuasion runs at the British Library in St Pancras until 17 September. Ian may be contacted by email at ian dot cooke at bl dot uk ]

Posted by Peter Webster at 10:27 AM

Tags

Collections, Weblogs

03 July 2013

Using open data to visualise the early web

Add comment Comments (0)

[Andy Jackson, web archiving technical lead at the British Library, on what the UK web looked like in 1996, and on teaching machines to classify websites.]

At the end of May, I attended the BL Labs hackathon event, and was able to spend some time talking to students and researchers who are interested in exploring our collections. Those conversations were just the prompt I needed to improve the UK Web Archive Open Data website, as it became clear that the documentation needed some improvement, but also that we had even more data to offer than I understood at first.

In terms of documentation, I was finally able to spend some time documenting the UK Host-Level Link Graph (1996-2010) dataset, released earlier this year. After publicising this updated dataset, there was some immediate interest from someone developing large scale graph visualisation tools, which lead to this excellent visualisation of the 1996 portion of the dataset:

Although further analysis is required to identify all of the clusters and relationships, this unlabelled overview immediately illustrates an important aspect of the web archive. The dots around the edge of the graph indicate individual hosts that are in the UK domain, but are not connected to many other hosts in the UK domain, and are completely disconnected from the main graph in the centre as a result. This implies that, in order to completely archive the UK web domain, we cannot limit ourselves only to the exploration of known UK hosts. This data from the Internet Archive's global crawls shows that there are a significant number of sites that we will only find if we venture out into the global web.

It would be wonderful to see more detailed analysis of this network, and of how the network changes over time. However, even this 1996 slice of the dataset contained some 58,842 hosts (nodes) and 184,433 host-to-host links (edges). The later years contain even more hosts and links, and analysing and visualising such a large link graph remains challenging.

A number of machine learning students also attended the BL Labs event, and talking to them revealed a particular interest in our selective UK Web Archive. We have been working since 2004 on building up this permissions-based archive, with manual classification of those web resources into a two-level subject hierarchy. We have been aware that it should be possible, in principle, to use this manually curated dataset to 'train' a machine learning system, so that it might be able to automatically classify resources. This might help us better to explore large scale domain crawls where we no longer have enough manual effort available to classify millions of sites manually.

At present, we have neither the time nor the expertise to exploit this possible approach to web archive analysis. That said, I realised while talking to the BL Labs attendees that they might be able to help us do that, needing only a relatively simple dataset to get started. Based on their suggestions, we created a simple Website Classification Dataset for the Selective Archive, listing the subject classification and title for each URL in the set. Early indications were that even this very limited amount of information may be enough to distinguish which top-level classification(s) a site belongs to. By providing a bit more information based on the text of the site's pages (the 100 most popular keywords from each, say) it might well be possible to provide a very useful ground truth training set that can be used to create powerful machine classification systems.

We're always keen to investigate more options for exploiting the data and metadata in our archives. If you have any requests for datasets you'd like us to make available, please comment below, or get in touch.

Posted by Peter Webster at 1:36 PM

17 June 2013

Innovation in geographical context: the Cambridge Network collection

Add comment Comments (0)

Most people have heard of Silicon Valley, the area of northern California famous for its concentration of technology companies, both well established and newly started. The term has in more recent years been applied to the area around Cambridge ("Silicon Fen") and perhaps most recently there has appeared "Silicon Roundabout", just a short distance from us here at the British Library. These three examples point towards the key importance of geographical proximity for economic development and for innovation in particular.

We are particularly pleased to to have started to capture some of the web archival record for the cluster of companies, educational institutions and other organisations associated with the Cambridge Network. We were particularly pleased to have been able to work with the Network, which exists to bring business and academia together to facilitate the sharing of ideas, and to encourage collaboration and partnership.

Not surprisingly, many parts of the University of Cambridge are represented in the collection, such as the Centre for Business Research or the Centre for Advanced Photonics and Electronics (CAPE). There are the sites of organisations whose purpose is to facilitate knowledge exchange in general, such as the UK Innovation Research Centre or the Huntingdonshire Business Network. And there are sites from local government, the law, financial services, the charitable sector and the many other parts that go to make up the rich ecology of business in a local area.

And of course there are the companies themselves. Some are well-known, the majority are not. But of the 715 organisations represented in the collection, it may be that some of them grow to become household names. This collection will hopefully be of great value in capturing a snapshot of innovation in progress in a particular geographical area. Browse the collection here.

Posted by Peter Webster at 11:54 AM

Tags

Collections, Science

05 June 2013

Scholars and web archives: a report on the IIPC General Assembly, Slovenia April 2013

Add comment Comments (0)

[Nicola Johnson, Web Archivist at the British Library reports on the General Assembly of the International Internet Preservation Consortium (IIPC) held in Ljubljana, Slovenia in April 2013.]

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The British Library was a founder member of the IIPC which this year saw its tenth anniversary, providing an opportunity for members to reflect on their achievements and discuss future directions of the Consortium.

The General Assembly, hosted by the National and University Library of Slovenia, comprised three days of member meetings and a two-day public conference on the theme ‘Scholarly Access to Web Archives’. This was the part I attended, held at the Hotel Mons just outside Ljubljana surrounded by pine forests with views to mountains beyond.

The overall vibe of the General Assembly was very positive. Members share a strong commonality of purpose and the workshop allowed people to share their experiences through open and honest dialogue.

I was particularly interested in the perspective of scholars using web archives, several of whom presented at the conference. Niels Brugger from Netlab at Aarhus University discussed the various programmes NetLab are currently running in digital humanities and internet studies, such as: Danish Internet Favourites 2009 ; Digital Footprints; Network Analysis of the Danish Parliamentary Elections 2007-2011; Cross Media Production and Communication and ‘Fundamental Tools for Web Archive Research (FUTARC)’.

Niels stressed the importance for researchers of making informed decisions about the completeness of archived websites and determining if there are inconsistencies between versions. Information about what is missing from the archived object enables scholars to assess resources and is something web archivists can help with by documenting the gaps.

Sophie Gebeil (University of Aix-Marseille) provided a different perspective on the researcher’s experience of using web archives when she described her doctoral studies on North African immigrant communities. During her research the Web materialised as the medium of expression for immigrants and therefore web archives were of the greatest significance for her studies. Sophie stressed that archived websites are not original documents but are to a greater or lesser extent artificially reconstructed and historians therefore need to understand the limitations of the material they are working with.

Meghan Dougherty of Loyola University, Chicago expanded further on the difficulties scholars encounter when using web archives. As a web historian and web methodologist Meghan investigates the idea of the Web as a co-authored medium and is interested in the data that users share, expose or trade when communicating through the internet. She put forward the interesting notion that methods in web history are analogous to anthropology or archaeology as researchers in this field seek to reconstruct the user’s journey through a website. To this end the ‘share this’ or ‘like’ buttons on a website ought to be preserved with as much consideration as the content of the website.

In the panel discussion that followed, the consensus was that scholars researching web archives require as much contextual information as possible about the archived objects, including curatorial data, the legal framework in which archiving took palce, and which content is missing. This information is extremely helpful to the web archivist when performing quality assurance checks on harvested material.

Obstacles to collaboration between researchers and archivists were discussed. In the early years of web archiving the immediate concern was to acquire content, and the question of what to preserve was to some extent secondary. Now that there a good number of existing web archives, scholars can start to articulate how exactly they will use them and what cultural and heritage institutions should focus on collecting.

From the web archivist’s point of view one of the big questions is how and where to select content from given the enormous size of the web. So far, the selection of curated web resources has been a largely manual, resource intensive process which is not only expensive but represents the ‘expert view’. To address this, web archiving institutions have begun to explore the benefits of crowdsourcing in selecting web content to archive.

Helen Hockx-Yu, Head of Web Archiving at the British Library demonstrated the Twittervane tool, a prototype application designed to collect and analyse the outputted URLs published in tweets (see previous post). Workshop participants had the opportunity to set up and run collections and to submit feedback which will be used in the further development of the tool.

Delegates were also impressed by the National and University Library of Slovenia’s WayBack Annotator, another prototype tool which enables members of the public to collaborate with other users on a common platform by selecting URLS of interest, forming groups or collections relevant to them, tagging individual URLs and/or whole websites, supplying additional metadata, marking important parts of individual pages and adding notes and annotations to selected pages.

The IIPC General Assembly provided an excellent forum for members to discuss the simultaneous challenges faced by web archiving institutions from the technical challenges of harvesting, preservation and replay to the challenge of defining the future use cases for web archives and the requirements of scholars.

[Other accounts of the conference include those by Ahmed Alsum (Web Science and Digital Libraries Research Group at Old Dominion University), Rosalie Lack of the California Digital Library and Abbey Potter, Program Officer with NDIIPP at the Library of Congress and outgoing Communications Officer at the IIPC.]

Posted by Peter Webster at 2:20 PM

21 May 2013

History is arbitrary (if we let it)

Add comment Comments (0)

[A guest post from Jim Boulton (@jim_boulton), reflecting on digital archaeology and why we preserve the history of the web. His exhibition Error 404 is on at Digital Shoreditch at Shoreditch Town Hall. Free entry from 25th to 31st May, 10am – 7pm.]

The Web was born in 1991. In its short life, it has transformed our lives. Yet, due to the transient nature of websites, evidence of the pioneering years of this new medium is virtually non-existent.

The story of the first webpage is typical. It was continually overwritten until March ’92. A record of that monumental point in history has been lost forever. This is not an isolated case. Most sites from the 90s and early 2000s, that shaped how we now work and play, can no longer be seen. Hardware has become obsolete. Media has become redundant. Files have been lost. The fact that digital content is so easy to duplicate means that copies are not valued. Worse, the original version is also often considered disposable.

Archiving websites is not the only challenge. A book displays itself, a website cannot be displayed without a browser. These too need preserving. Throw in the hardware and this makes web preservation a three-part puzzle.

But why archive websites at all?

My motivation is to tell the untold story of the Web. The story of the engineers that built the Web has been told, as has the story of the entrepreneurs that exploited it. Little is known about the designers and creatives that shaped it.

Take the work of Deepend. Founded in 1994, while their contemporaries were pushing the technical possibilities of the Web, Deepend explored its aesthetic potential. Deepend’s sites for clients including Volkswagen Beetle, Hoover and the Design Museum set the standard that the rest of the industry aspired to. In 2001, Deepend fell victim to the dot-com crash. Its groundbreaking work disappeared with it.

Archiving ensures the historical record is accurate and accessible. Without broad evidence, history is arbitrary, something I was surprised to discover first-hand. It’s frequently stated that the Shoreditch creative tech scene started with fifteen companies in 2008. This is just not true. My digital agency, Large, moved to Shoreditch in 2001 and there were plenty of creative tech companies already there. The convenient assertion is based on a playful Tweet made five years ago. To his credit, the author of the Tweet has done his best to clarify the situation but the myth remains.

My latest project, an exhibition called Error 404, does its bit to set the record straight. Currently showing at Digital Shoreditch, Error 404 showcases the work of influential Shoreditch-based agencies, including De-construct, Deepend, Digit, Hi-ReS!, Lateral and Less Rain, on the hardware and software of the day. Alongside this culturally important work is an early version of the first webpage, reunited with the first browser and shown on a NeXTCube. The show also includes artwork by pioneering iconographer Susan Kare.

Over the last 20 years we have been privileged to witness the birth of the Information Age. We have a responsibility to accurately record this artistic, commercial and social history for future generations. Long live the archive.

Posted by Peter Webster at 2:14 PM

10 May 2013

The new NHS: a reform you could see from space?

Add comment Comments (0)

[A guest post by Jennie Grimshaw, Lead Curator for social policy and official publications at the British Library.]

The controversial Health and Social Care Act 2012 ushered in the most radical reform of the National Health Service since its launch in 1948. On April 1^st 2013, the main changes set out in the Act came into force, and most parts of the NHS will be affected in some way.

Clinical commissioning groups (CCGs) replace primary care trusts (PCTs) and are the cornerstone of the new system. There are 211 CCGs in total, commissioning care for an average of 226,000 people each. Each of the 8,000 GP practices in England is now part of a CCG. These groups will commission the majority of health services, and in 2013/14 will be responsible for a budget of £65bn, about 60% of the total NHS budget. CCGs will be accountable to and supported by NHS England, formerly the NHS Commissioning Board, which will also directly commission primary care and specialist services. NHS services will be opened up to competition from providers that meet NHS standards on price, quality and safety, with a new regulator (Monitor) and an expectation that the vast majority of hospitals will become foundation trusts by 2014.

In addition, local authorities will take on a bigger role, assuming responsibility for budgets for public health. Health and wellbeing boards will have duties to encourage integrated working between commissioners of services across health, social care and children’s services, involving democratically elected representatives of local people. Local authorities are expected to work more closely with other health and care providers, community groups and agencies, using their knowledge of local communities to tackle challenges such as smoking, alcohol, drug misuse and obesity

Finally, the Local Involvement Networks (LINks) were replaced by 152 Local Healthwatch operating under the leadership of a new consumer champion, Healthwatch England. Each local Healthwatch is part of its local community, and will work in partnership with other local organisations to ensure that the voices of consumers and those who use services reach the ears of decision makers.

The Coalition government’s radical reform of the NHS has attracted criticism on various grounds: of cost and disruption; backdoor privatisation; introduction of price competition, which risks decisions being made on the basis of price rather than clinical need; and determined opposition from health professions.

The scope of the collection

In the light of this debate, the British Library has chosen the NHS Reform of 2013 for its first themed collection of archived websites under the new Non-Print Legal Deposit Regulations which allow it to gather and preserve all sites in the UK web domain. For this collection we have hand-selected the sites of:

NHS bodies abolished under the reform (primary care trusts. LINks, strategic health authorities and some public health programmes and agencies). Many of these archived sites are already publicly available in the UK Web Archive. (See also this earlier post);
The emerging new bodies (clinical commissioning groups, health and wellbeing boards, and local healthwatch);
Groups campaigning for or (mainly) against the changes (medical royal colleges, professional associations, medical charities, trade unions, grass roots organisations);
Press and media commentators (including blogs, the BBC and national newspapers from the Sun to the Guardian);
The Government and the regulators (including legislation);
The private sector providers preparing to move into the market.

Behind the scenes, this intensive crawl of a relatively small number of sites is going on alongside the general crawl of the whole UK domain which we blogged about last week. These sites are being crawled more frequently than would be typical for the domain crawl, but for a three month period.

The collection will be available for onsite access at the six legal deposit libraries for the UK from this summer. We hope that it will present a balanced view of the impact of the reform, and the debate surrounding it. We'll be blogging again nearer the time with a review of the archive.

Posted by Peter Webster at 10:14 AM