UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

16 April 2013

Just what is the UK web domain anyway ?

Add comment Comments (0)

This sounds like a simple question. Ten seconds on most sites will tell a human viewer where a site originates from, and a little digging will produce the answer eventually. But under Non-Print Legal Deposit, we need a scaleable way of settling the question without human intervention. Our remit under the new regulations extends to sites that are issued from a .uk or other UK geographic top-level domain, or where part of the publishing process takes place in the UK. (See the regulations here, and a summary here.) UK map

We estimate that there are just short of five million sites that end in .uk - a simple, unambiguous and machine-readable way of knowing that a site originates from within the UK and so is covered by the remit we now have. However, not all UK domains end in .uk. Many .com, .org and other sites are in fact published from within the UK, and there are few reliable figures as to how many of these there are. And so to identify which of these fall within the scope of the regulations, we need other methods.

One such method is to find out where the site is hosted. www.geoiptool.com provides information on where a server is located, although it is difficult to attain 100% accuracy. Another way is to look at where the domain name is registered, using a service such as www.whois.net. However, in many cases domains are registered by one company on behalf of another or of an individual, perhaps because they want their contact details to remain private. There also isn't (yet) a straightforward way of querying any of these services at scale for thousands or indeed millions of sites.

There may be sites for which we have direct knowledge, from the site owner, that their .com domain is operated from within the UK, but that could only ever be for a tiny proportion of sites. And so after all these possibilities are exhausted, the next step is to make judgements based on the presentation of the site itself. But what in a site is "enough" ? A postal address in a Contact Us page is a possibility; so is a UK-domain email address (for those sites whose owners don't use anything as twentieth century as the post).

What if a site doesn't disclose the information we might like, but is self-evidently from the UK (once you look at the content)? One example is Conservative Home, a prominent political site, which nowhere explicitly states that it is published in the UK. This is a particular issue for blogs, which are often hosted on a platform service such as Wordpress (which is based in San Antonio, Texas) but would be thought by most to be "published" from wherever the author is based. There are similar issues in determining which parts of social media sites such as Twitter or Facebook should be treated as published from within the UK.

All of this of course supposes that all website owners tell the truth about where they are based. There may be cases where a site is published in another country but purports to be from the UK, perhaps to protect the author from a repressive regime. Conversely an owner might, for reasons which are hard to predict, wish that their site published within the UK did not appear to be.

It's early days for Non-Print Legal Deposit, and some of these issues will become clearer as we gain more experience with just these sorts of difficult questions. 

[Map reproduced courtesy of Showeet.com, under a Creative Commons Attribution-NoDerivs 3.0 licence.]

Peter Webster, Web Archiving Engagement and Liaison Manager

12 April 2013

Health and Social Care Act 2012: collection now available

Add comment Comments (0)

Some weeks ago we blogged about our effort to capture some of the web estate of the NHS. There was an urgency in this, as Primary Care Trusts (PCTs), Strategic Health Authorities (SHAs) and some other organisations would cease to exist at the beginning of April, as the reforms under the Health and Social Care Act 2012 took effect. And at that point those bodies would no longer be obliged to keep those sites available.

We're now delighted to be able to announce the launch of this collection of over three hundred sites. It contains the sites of the SHAs and the PCTs, grouped by region. It also includes the Local Involvement Networks (now superseded by Healthwatch).

The collection also includes sites such as that of the National Institute for Health and Clinical Excellence (NICE), the Health Protection Agency, and information about the change from the Department of Health, and from the media.

Thanks to the tireless work of Ravish Mistry, the archive of sites from the PCTs and SHAs is comprehensive, and the coverage of the other types of sites is very full. The collection represents a highly important resource for future historians of the National Health Service, as well as being a reference point for more current discussion of the implementation of the reforms as they continue.

Peter Webster
Web Archiving Engagement and Liaison Manager, British Library

05 April 2013

Non-Print Legal Deposit: it's here !

Add comment Comments (0)

Ten years after the Legal Deposit Libraries Act 2003 established the principle, from tomorrow we shall be beginning to archive the whole of the UK web domain, in partnership with the other five legal deposit libraries for the UK. The new regulations are here.

I thought it worth drawing together some key information, along with some of the media coverage that has appeared this week.

The British Library's press release is here, and there are also some useful FAQs which fill in some of the detail. These cover:

There has also been much coverage in the media, including (in roughly chronological order):

Associated Press (4 April)

The Verge (4 April)

Wired (5 April)

The Guardian (5 April) (and coverage of the launch event)

BBC News (5 April)

Daily Express (5 April)

Daily Telegraph (5 April)

International Business Times (5 April)

Paidcontent.org (5 April)

Times Higher Education Supplement (6 April)

Al Jazeera (6 April) (with video)

ZDNET (by @jackschofield) (8 April)

The Spectator (Books Blog) (11 April)

I shall keep adding to this list as more coverage appears. From outside the UK, see the New Zealand Herald, La Stampa (Italy), Computerworld New Zealand

Peter Webster, Web Archiving Engagement and Liaison Manager

04 April 2013

Librarianship in the 21st century: a new collection

Add comment Comments (0)

[A guest post from Rossitza Atanassova, Digital Curator at the British Library]

What better institution to archive UK librarianship-related websites than The British Library! The
evolving role of libraries in the UK
collection launches with a modest number of websites worthy of preservation, and with a call to librarians, information professionals, researchers and the public to nominate many more worthwhile sites.

The collection aims to reflect developments within the UK library community in the 21st century, in response to financial, technological, political, social and other pressures and challenges. As well as some important institutional and organisation sites (CILIP, MLA, RIN), the collection showcases collaborations (Inspire, UKRR) and advocacy blogs (Public Libraries News), special interest groups (MMIT) and fora (LILAC), communities of knowledge exchange (LIKE, #UKLibChat) and of research and practice (LIS Research Coalition, Research Active). It tries to highlight the work of inspirational professional individuals (Joeyanne Libraryanne) and groups (Heart of the School); innovative services supporting learning and research (SCARLET) and the visually impaired (RNIB, Reading Sight, Speaking Volumes). One of the more dominant themes in the collection is of open access institutional repositories and the new role for librarians and information professionals in digital repositories and data management (RSP, UKCoRR, Open and Shut?)

I am most grateful for the enthusiastic response from website owners whom I had contacted and huge thank you to the Web Archiving Team for doing all the technical work behind the scenes!

22 March 2013

APIs, data services, and being generous

Add comment Comments (0)

Traditionally, the online presence of most galleries, archives, libraries and museums have concentrated on delivering access to individual items, directly to users, one by one. This is changing. As more items are either born digital or have excellent digital facsimiles, these organisations (sometimes collectively designated as GLAM) are beginning to offer data access and services in addition to simple direct use. This allows the communities we serve to build great things.

One of the most successful examples is the National Library of Australia's Trove database. Trove provides a rich API, that allows independent developers such as Tim Sherratt (@wragge) to create all sorts of new interfaces for particular needs. These, since they are fitted afresh to each community of users, can be much nearer what Dan Cohen (after Mitchell Whitelaw) has called "generous interfaces". Similarly, the British Library provides various free data services and The National Archives of the UK has started offering direct API access to its discovery systems.

Web archives have tended to focus on the playback of individual web pages, by means of the Wayback machine, and this is what most users are used to. However, for many years now, that same playback infrastructure has been used to develop other data about and interfaces to the content. These APIs allow structured metadata about archival holdings to be retrieved programmatically, and in subsequent posts we'll explore how the Wayback queries and Memento protocols can be used to exploit web archives. (See earlier post about our web-based use of Memento here.)

Alongside these online services, we've also been exploring the possibilities around making metadata datasets available for research and analysis, based on an archive of the UK web for 1996-2010, secured for the nation by the JISC and which we look after. So far we've released an historical geo-index and a data format profile. We're also about to make further, even richer datasets available, based on the same archive, and drawing on the experiences of the AADDA and Big Data projects. Watch this space for more news on these in future posts.

Andy Jackson, Web Archiving Technical Lead (British Library)

06 March 2013

NHS Reform: capturing the change

Add comment Comments (0)

Last week we blogged about our Governing the Police collection, in which we managed to capture a complete set of the sites of the police authorities, due to be abolished in November 2012 and superseded by elected Police and Crime Commissioners. It was a major change in public administration, which we could see coming, and could plan for.

And there is an even bigger change coming very soon: the reorganisation of the National Health Service, under the terms of the Health and Social Care Act 2012. The change is due at the end of this month.

At that point in time, several organisations within the NHS will cease to exist. These include the regional Strategic Health Authorities, approximately 150 Primary Care Trusts, the Health Protection Agency, and  c.150 Local Involvement Networks (LINks).

We are currently working with these bodies to secure permission to archive their sites before the beginning of April, when they will no longer be obliged to keep them live, thus saving a wealth of vital documentary material. We hope to make the collection publicly available later in the year.

Peter Webster

26 February 2013

Governing the Police: a special collection

Add comment Comments (1)

In an earlier post, I wrote about our efforts to capture the c.41 police authority websites, due to go offline with the abolition of the authorities themselves following the new Police Reform and Social Responsibility Act 2011. Saving web-based content from disappearing forever like this is a key part of the mission of the Web Archive.

However, our objectives also include collecting comprehensively around current issues, in order to capture all the issues and debates. It was therefore important to capture the sites of the newly appointed Police and Crime Commissioners as well, to set alongside the sites of the defunct police authorities, so that researchers will be able to track changes in the way the police are governed over time. Other websites in this collection include some relating to the first elections of Police and Crime Commissioners in November 2012 and a sample of news coverage. Ass PCC thumbnail

For websites for the selective archive, we use a permissions approach, which is resource-intensive. Each police authority was contacted on average between four and seven times to secure the permission and for us to answer questions. Nevertheless we achieved a 100% success rate with the PA sites as there was the added impetus for the website publishers of the sites going offline. With the Commissioners websites the process was a little easier as the staff were (by and large) the same people we had contacted at the police authorities. It helped that we were able to articulate the benefits of having a corporate archive from the very beginning that would be accessible by both the commissioners and their staff and also by the public, capturing content that may be taken off the live website but may be needed in future.

At the time of writing there were 80 titles in the Governing the Police collection although we are still adding titles as and when we receive permission to archive them. We will be taking regular snapshots every six months to capture developments over time, and so the collection promises to be a fundamental resource to scholars of policing in Britain in the years ahead.

Nicola Johnson and Ravish Mistry

19 February 2013

Nineteenth century English literature: a new special collection

Add comment Comments (0)

[A guest post from Andrea Lloyd, Curator of Printed Literary Sources, 1801-1914 at the British Library]

After almost a year of gathering I’m pleased to announce that my ‘Curator’s Choice’ collection of websites relating to 19th century English literature has now been published on the UK Web Archive.

As a curator of printed literary sources for the period 1801-1914 it doesn’t require a great leap of imagination to discover why I chose this particular topic. The collection is intended to reflect the diverse interests in the genre that are substantiated on the web. Opinions about, and interpretations of 19th century literature and its authors are constantly evolving and I hope that this resource contextualises these important scholarly and cultural changes.

The sites included so far display a broad and eclectic array of subject matters – ranging from author societies to museums; from literary adaptations to academic syllabi. 19th century literature is still hugely popular and attracts a wide audience. Given the massive interest in the likes of Jane Austen and Charles Dickens, I initially thought I would concentrate on lesser-known authors, and on literature that has grown somewhat obscure in the intervening years. This ultimately isn’t how the collection has evolved – sometimes because many of the more niche sites are published without giving any administrator contact details (so permission cannot be sought to archive the site). In other cases, the owners have not responded to permission requests – often because they have cast the sites off into the vast ‘webosphere’ to fend for themselves.

Anna_t BY-NC-SA Flickr

As someone who works with 19th century printed ephemera on a regular basis I found this exercise particularly fascinating. Pertinent comparisons can be drawn between the ephemeral items that are published on the web and those that were printed in the 19th century. A great deal of the ephemeral literature produced in the 19th century has survived to this day (albeit in a fragile state) – either through luck or thanks to collectors with foresight. Given its transient and contributory nature there is a great danger that similar items produced in electronic formats may not be so lucky – hence the reason the Web Archive is so vital. Hopefully my 22nd century counterpart will thank me for choosing to preserve for posterity some of the more marginal, fleeting and subjective sites available relating to the genre!

Now it’s available for all to see, I hope that others will recommend sites that they think would complement the theme and  help to create a lasting snapshot of 19th century literary scholarship in the 21st century. Do get in touch via this blog, or @UKWebArchive on Twitter.

[Image by anna_t, Creative Commons BY-NC-SA]