UK Web Archive blog

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

21 May 2013

History is arbitrary (if we let it)

Add comment Comments (0)

[A guest post from Jim Boulton (@jim_boulton), reflecting on digital archaeology and why we preserve the history of the web. His exhibition Error 404 is on at Digital Shoreditch at Shoreditch Town Hall. Free entry from 25th to 31st May, 10am – 7pm.]

The Web was born in 1991. In its short life, it has transformed our lives. Yet, due to the transient nature of websites, evidence of the pioneering years of this new medium is virtually non-existent.

The story of the first webpage is typical. It was continually overwritten until March ’92. A record of that monumental point in history has been lost forever. This is not an isolated case. Most sites from the 90s and early 2000s, that shaped how we now work and play, can no longer be seen. Hardware has become obsolete. Media has become redundant. Files have been lost. The fact that digital content is so easy to duplicate means that copies are not valued. Worse, the original version is also often considered disposable.

Archiving websites is not the only challenge. A book displays itself, a website cannot be displayed without a browser. These too need preserving. Throw in the hardware and this makes web preservation a three-part puzzle.

But why archive websites at all?

My motivation is to tell the untold story of the Web. The story of the engineers that built the Web has been told, as has the story of the entrepreneurs that exploited it. Little is known about the designers and creatives that shaped it.

Take the work of Deepend. Founded in 1994, while their contemporaries were pushing the technical possibilities of the Web, Deepend explored its aesthetic potential. Deepend’s sites for clients including Volkswagen Beetle, Hoover and the Design Museum set the standard that the rest of the industry aspired to. In 2001, Deepend fell victim to the dot-com crash. Its groundbreaking work disappeared with it.

Archiving ensures the historical record is accurate and accessible. Without broad evidence, history is arbitrary, something I was surprised to discover first-hand. It’s frequently stated that the Shoreditch creative tech scene started with fifteen companies in 2008. This is just not true. My digital agency, Large, moved to Shoreditch in 2001 and there were plenty of creative tech companies already there. The convenient assertion is based on a playful Tweet made five years ago. To his credit, the author of the Tweet has done his best to clarify the situation but the myth remains. Error404 at Shoreditch Town Hall

My latest project, an exhibition called Error 404, does its bit to set the record straight. Currently showing at Digital Shoreditch, Error 404 showcases the work of influential Shoreditch-based agencies, including De-construct, Deepend, Digit, Hi-ReS!, Lateral and Less Rain, on the hardware and software of the day. Alongside this culturally important work is an early version of the first webpage, reunited with the first browser and shown on a NeXTCube. The show also includes artwork by pioneering iconographer Susan Kare.

Over the last 20 years we have been privileged to witness the birth of the Information Age. We have a responsibility to accurately record this artistic, commercial and social history for future generations. Long live the archive.

10 May 2013

The new NHS: a reform you could see from space?

Add comment Comments (0)

[A guest post by Jennie Grimshaw, Lead Curator for social policy and official publications at the British Library.]

The controversial Health and Social Care Act 2012 ushered in the most radical reform of the National Health Service since its launch in 1948. On April 1st 2013, the main changes set out in the Act came into force, and most parts of the NHS will be affected in some way.

Clinical commissioning groups (CCGs) replace primary care trusts (PCTs) and are the cornerstone of the new system. There are 211 CCGs in total, commissioning care for an average of 226,000 people each. Each of the 8,000 GP practices in England is now part of a CCG. These groups will commission the majority of health services, and in 2013/14 will be responsible for a budget of £65bn, about 60% of the total NHS budget. CCGs will be accountable to and supported by NHS England, formerly the NHS Commissioning Board, which will also directly commission primary care and specialist services. NHS services will be opened up to competition from providers that meet NHS standards on price, quality and safety, with a new regulator  (Monitor) and an expectation that the vast majority of hospitals will become foundation trusts by 2014.

In addition, local authorities will take on a bigger role, assuming responsibility for budgets for public health. Health and wellbeing boards will have duties to encourage integrated working between commissioners of services across health, social care and children’s services, involving democratically elected representatives of local people. Local authorities are expected to work more closely with other health and care providers, community groups and agencies, using their knowledge of local communities to tackle challenges such as smoking, alcohol, drug misuse and obesity

Finally, the Local Involvement Networks (LINks) were replaced by 152 Local Healthwatch operating under the leadership of a new consumer champion, Healthwatch England. Each local Healthwatch is part of its local community, and will work in partnership with other local organisations to ensure that the voices of consumers and those who use services reach the ears of decision makers.

The Coalition government’s radical reform of the NHS has attracted criticism on various grounds: of cost and disruption; backdoor privatisation; introduction of price competition, which risks decisions being made on the basis of price rather than clinical need; and determined opposition from health professions.

The scope of the collection

In the light of this debate, the British Library has chosen the NHS Reform of 2013 for its first themed collection of archived websites under the new Non-Print Legal Deposit Regulations which allow it to gather and preserve all sites in the UK web domain. For this collection we have hand-selected the sites of:

  • NHS bodies abolished under the reform (primary care trusts. LINks, strategic health authorities and some public health programmes and agencies). Many of these archived sites are already publicly available in the UK Web Archive. (See also this earlier post);
  • The emerging new bodies (clinical commissioning groups, health and wellbeing boards, and local healthwatch);
  • Groups campaigning for or (mainly) against the changes (medical royal colleges, professional associations, medical charities, trade unions, grass roots organisations);
  • Press and media commentators (including blogs, the BBC and national newspapers from the Sun to the Guardian);
  • The Government and the regulators (including legislation);
  • The private sector providers preparing to move into the market. 

Behind the scenes, this intensive crawl of a relatively small number of sites is going on alongside the general crawl of the whole UK domain which we blogged about last week. These sites are being crawled more frequently than would be typical for the domain crawl, but for a three month period.

The collection will be available for onsite access at the six legal deposit libraries for the UK from this summer. We hope that it will present a balanced view of the impact of the reform, and the debate surrounding it. We'll be blogging again nearer the time with a review of the archive.

30 April 2013

Dispatches from the domain crawl #1

Add comment Comments (1)

After the blaze of publicity surrounding the advent of Non-Print Legal Deposit, the web archiving team have been busy putting the regulations into practice. This is the first of a series of dispatches from the domain crawl, documenting our discoveries as we begin crawling the whole of the UK web domain for this first time.

Firstly, some numbers. In the first week, we acquired nearly 3.6TB of compressed data (in its raw, uncompressed form, the data is ~40% larger) from some 191 million URLs. Although we staggered the launch as a series of smaller crawls, by the end of the week we reached a sustained rate of 300Mb/s. The bulk of this was from the general crawl of the whole domain, which we kicked off with a list of 3.8 million hostnames.

At this stage it is difficult to determine what our success rate is - that is, how successful we are at harvesting each resource we target. This is partly because the Heritrix crawler has what might be described as an optimistic approach to determining what in a harvested page is actually a real link to another resource (particularly when parsing Javascript). As a result, some of the occasions on which Heritrix does not return a resource are due to the fact that there was not a real resource to be had.

At this early stage it is also hard to determine reliably the difference between a erroneous response for a real link resource that has disappeared, and an occasion on which access to a real resource was blocked. Over time, we'll learn more about how best to answer some of these questions, which will hopefully start to reveal interesting things about the UK web as a whole.

Roger Coram / Andy Jackson / Peter Webster

 

16 April 2013

Just what is the UK web domain anyway ?

Add comment Comments (0)

This sounds like a simple question. Ten seconds on most sites will tell a human viewer where a site originates from, and a little digging will produce the answer eventually. But under Non-Print Legal Deposit, we need a scaleable way of settling the question without human intervention. Our remit under the new regulations extends to sites that are issued from a .uk or other UK geographic top-level domain, or where part of the publishing process takes place in the UK. (See the regulations here, and a summary here.) UK map

We estimate that there are just short of five million sites that end in .uk - a simple, unambiguous and machine-readable way of knowing that a site originates from within the UK and so is covered by the remit we now have. However, not all UK domains end in .uk. Many .com, .org and other sites are in fact published from within the UK, and there are few reliable figures as to how many of these there are. And so to identify which of these fall within the scope of the regulations, we need other methods.

One such method is to find out where the site is hosted. www.geoiptool.com provides information on where a server is located, although it is difficult to attain 100% accuracy. Another way is to look at where the domain name is registered, using a service such as www.whois.net. However, in many cases domains are registered by one company on behalf of another or of an individual, perhaps because they want their contact details to remain private. There also isn't (yet) a straightforward way of querying any of these services at scale for thousands or indeed millions of sites.

There may be sites for which we have direct knowledge, from the site owner, that their .com domain is operated from within the UK, but that could only ever be for a tiny proportion of sites. And so after all these possibilities are exhausted, the next step is to make judgements based on the presentation of the site itself. But what in a site is "enough" ? A postal address in a Contact Us page is a possibility; so is a UK-domain email address (for those sites whose owners don't use anything as twentieth century as the post).

What if a site doesn't disclose the information we might like, but is self-evidently from the UK (once you look at the content)? One example is Conservative Home, a prominent political site, which nowhere explicitly states that it is published in the UK. This is a particular issue for blogs, which are often hosted on a platform service such as Wordpress (which is based in San Antonio, Texas) but would be thought by most to be "published" from wherever the author is based. There are similar issues in determining which parts of social media sites such as Twitter or Facebook should be treated as published from within the UK.

All of this of course supposes that all website owners tell the truth about where they are based. There may be cases where a site is published in another country but purports to be from the UK, perhaps to protect the author from a repressive regime. Conversely an owner might, for reasons which are hard to predict, wish that their site published within the UK did not appear to be.

It's early days for Non-Print Legal Deposit, and some of these issues will become clearer as we gain more experience with just these sorts of difficult questions. 

[Map reproduced courtesy of Showeet.com, under a Creative Commons Attribution-NoDerivs 3.0 licence.]

Peter Webster, Web Archiving Engagement and Liaison Manager

12 April 2013

Health and Social Care Act 2012: collection now available

Add comment Comments (0)

Some weeks ago we blogged about our effort to capture some of the web estate of the NHS. There was an urgency in this, as Primary Care Trusts (PCTs), Strategic Health Authorities (SHAs) and some other organisations would cease to exist at the beginning of April, as the reforms under the Health and Social Care Act 2012 took effect. And at that point those bodies would no longer be obliged to keep those sites available.

We're now delighted to be able to announce the launch of this collection of over three hundred sites. It contains the sites of the SHAs and the PCTs, grouped by region. It also includes the Local Involvement Networks (now superseded by Healthwatch).

The collection also includes sites such as that of the National Institute for Health and Clinical Excellence (NICE), the Health Protection Agency, and information about the change from the Department of Health, and from the media.

Thanks to the tireless work of Ravish Mistry, the archive of sites from the PCTs and SHAs is comprehensive, and the coverage of the other types of sites is very full. The collection represents a highly important resource for future historians of the National Health Service, as well as being a reference point for more current discussion of the implementation of the reforms as they continue.

Peter Webster
Web Archiving Engagement and Liaison Manager, British Library

05 April 2013

Non-Print Legal Deposit: it's here !

Add comment Comments (0)

Ten years after the Legal Deposit Libraries Act 2003 established the principle, from tomorrow we shall be beginning to archive the whole of the UK web domain, in partnership with the other five legal deposit libraries for the UK. The new regulations are here.

I thought it worth drawing together some key information, along with some of the media coverage that has appeared this week.

The British Library's press release is here, and there are also some useful FAQs which fill in some of the detail. These cover:

There has also been much coverage in the media, including (in roughly chronological order):

Associated Press (4 April)

The Verge (4 April)

Wired (5 April)

The Guardian (5 April) (and coverage of the launch event)

BBC News (5 April)

Daily Express (5 April)

Daily Telegraph (5 April)

International Business Times (5 April)

Paidcontent.org (5 April)

Times Higher Education Supplement (6 April)

Al Jazeera (6 April) (with video)

ZDNET (by @jackschofield) (8 April)

The Spectator (Books Blog) (11 April)

I shall keep adding to this list as more coverage appears. From outside the UK, see the New Zealand Herald, La Stampa (Italy), Computerworld New Zealand

Peter Webster, Web Archiving Engagement and Liaison Manager

04 April 2013

Librarianship in the 21st century: a new collection

Add comment Comments (0)

[A guest post from Rossitza Atanassova, Digital Curator at the British Library]

What better institution to archive UK librarianship-related websites than The British Library! The
evolving role of libraries in the UK
collection launches with a modest number of websites worthy of preservation, and with a call to librarians, information professionals, researchers and the public to nominate many more worthwhile sites.

The collection aims to reflect developments within the UK library community in the 21st century, in response to financial, technological, political, social and other pressures and challenges. As well as some important institutional and organisation sites (CILIP, MLA, RIN), the collection showcases collaborations (Inspire, UKRR) and advocacy blogs (Public Libraries News), special interest groups (MMIT) and fora (LILAC), communities of knowledge exchange (LIKE, #UKLibChat) and of research and practice (LIS Research Coalition, Research Active). It tries to highlight the work of inspirational professional individuals (Joeyanne Libraryanne) and groups (Heart of the School); innovative services supporting learning and research (SCARLET) and the visually impaired (RNIB, Reading Sight, Speaking Volumes). One of the more dominant themes in the collection is of open access institutional repositories and the new role for librarians and information professionals in digital repositories and data management (RSP, UKCoRR, Open and Shut?)

I am most grateful for the enthusiastic response from website owners whom I had contacted and huge thank you to the Web Archiving Team for doing all the technical work behind the scenes!

22 March 2013

APIs, data services, and being generous

Add comment Comments (0)

Traditionally, the online presence of most galleries, archives, libraries and museums have concentrated on delivering access to individual items, directly to users, one by one. This is changing. As more items are either born digital or have excellent digital facsimiles, these organisations (sometimes collectively designated as GLAM) are beginning to offer data access and services in addition to simple direct use. This allows the communities we serve to build great things.

One of the most successful examples is the National Library of Australia's Trove database. Trove provides a rich API, that allows independent developers such as Tim Sherratt (@wragge) to create all sorts of new interfaces for particular needs. These, since they are fitted afresh to each community of users, can be much nearer what Dan Cohen (after Mitchell Whitelaw) has called "generous interfaces". Similarly, the British Library provides various free data services and The National Archives of the UK has started offering direct API access to its discovery systems.

Web archives have tended to focus on the playback of individual web pages, by means of the Wayback machine, and this is what most users are used to. However, for many years now, that same playback infrastructure has been used to develop other data about and interfaces to the content. These APIs allow structured metadata about archival holdings to be retrieved programmatically, and in subsequent posts we'll explore how the Wayback queries and Memento protocols can be used to exploit web archives. (See earlier post about our web-based use of Memento here.)

Alongside these online services, we've also been exploring the possibilities around making metadata datasets available for research and analysis, based on an archive of the UK web for 1996-2010, secured for the nation by the JISC and which we look after. So far we've released an historical geo-index and a data format profile. We're also about to make further, even richer datasets available, based on the same archive, and drawing on the experiences of the AADDA and Big Data projects. Watch this space for more news on these in future posts.

Andy Jackson, Web Archiving Technical Lead (British Library)