UK Web Archive blog

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

06 March 2013

NHS Reform: capturing the change

Add comment Comments (0)

Last week we blogged about our Governing the Police collection, in which we managed to capture a complete set of the sites of the police authorities, due to be abolished in November 2012 and superseded by elected Police and Crime Commissioners. It was a major change in public administration, which we could see coming, and could plan for.

And there is an even bigger change coming very soon: the reorganisation of the National Health Service, under the terms of the Health and Social Care Act 2012. The change is due at the end of this month.

At that point in time, several organisations within the NHS will cease to exist. These include the regional Strategic Health Authorities, approximately 150 Primary Care Trusts, the Health Protection Agency, and  c.150 Local Involvement Networks (LINks).

We are currently working with these bodies to secure permission to archive their sites before the beginning of April, when they will no longer be obliged to keep them live, thus saving a wealth of vital documentary material. We hope to make the collection publicly available later in the year.

Peter Webster

26 February 2013

Governing the Police: a special collection

Add comment Comments (1)

In an earlier post, I wrote about our efforts to capture the c.41 police authority websites, due to go offline with the abolition of the authorities themselves following the new Police Reform and Social Responsibility Act 2011. Saving web-based content from disappearing forever like this is a key part of the mission of the Web Archive.

However, our objectives also include collecting comprehensively around current issues, in order to capture all the issues and debates. It was therefore important to capture the sites of the newly appointed Police and Crime Commissioners as well, to set alongside the sites of the defunct police authorities, so that researchers will be able to track changes in the way the police are governed over time. Other websites in this collection include some relating to the first elections of Police and Crime Commissioners in November 2012 and a sample of news coverage. Ass PCC thumbnail

For websites for the selective archive, we use a permissions approach, which is resource-intensive. Each police authority was contacted on average between four and seven times to secure the permission and for us to answer questions. Nevertheless we achieved a 100% success rate with the PA sites as there was the added impetus for the website publishers of the sites going offline. With the Commissioners websites the process was a little easier as the staff were (by and large) the same people we had contacted at the police authorities. It helped that we were able to articulate the benefits of having a corporate archive from the very beginning that would be accessible by both the commissioners and their staff and also by the public, capturing content that may be taken off the live website but may be needed in future.

At the time of writing there were 80 titles in the Governing the Police collection although we are still adding titles as and when we receive permission to archive them. We will be taking regular snapshots every six months to capture developments over time, and so the collection promises to be a fundamental resource to scholars of policing in Britain in the years ahead.

Nicola Johnson and Ravish Mistry

19 February 2013

Nineteenth century English literature: a new special collection

Add comment Comments (0)

[A guest post from Andrea Lloyd, Curator of Printed Literary Sources, 1801-1914 at the British Library]

After almost a year of gathering I’m pleased to announce that my ‘Curator’s Choice’ collection of websites relating to 19th century English literature has now been published on the UK Web Archive.

As a curator of printed literary sources for the period 1801-1914 it doesn’t require a great leap of imagination to discover why I chose this particular topic. The collection is intended to reflect the diverse interests in the genre that are substantiated on the web. Opinions about, and interpretations of 19th century literature and its authors are constantly evolving and I hope that this resource contextualises these important scholarly and cultural changes.

The sites included so far display a broad and eclectic array of subject matters – ranging from author societies to museums; from literary adaptations to academic syllabi. 19th century literature is still hugely popular and attracts a wide audience. Given the massive interest in the likes of Jane Austen and Charles Dickens, I initially thought I would concentrate on lesser-known authors, and on literature that has grown somewhat obscure in the intervening years. This ultimately isn’t how the collection has evolved – sometimes because many of the more niche sites are published without giving any administrator contact details (so permission cannot be sought to archive the site). In other cases, the owners have not responded to permission requests – often because they have cast the sites off into the vast ‘webosphere’ to fend for themselves.

Anna_t BY-NC-SA Flickr

As someone who works with 19th century printed ephemera on a regular basis I found this exercise particularly fascinating. Pertinent comparisons can be drawn between the ephemeral items that are published on the web and those that were printed in the 19th century. A great deal of the ephemeral literature produced in the 19th century has survived to this day (albeit in a fragile state) – either through luck or thanks to collectors with foresight. Given its transient and contributory nature there is a great danger that similar items produced in electronic formats may not be so lucky – hence the reason the Web Archive is so vital. Hopefully my 22nd century counterpart will thank me for choosing to preserve for posterity some of the more marginal, fleeting and subjective sites available relating to the genre!

Now it’s available for all to see, I hope that others will recommend sites that they think would complement the theme and  help to create a lasting snapshot of 19th century literary scholarship in the 21st century. Do get in touch via this blog, or @UKWebArchive on Twitter.

[Image by anna_t, Creative Commons BY-NC-SA]

12 February 2013

What’s in a name ? Domain names and website longevity

Add comment Comments (3)

I wrote about how to make websites more archivable in a previous post. Having websites archived and making an effort to make websites “archive-friendly” are all good steps which can help increase their longevity. This blog post is about domain names, the name you use to call your website and the address which identifies it on the Web.

To obtain a domain name, you need to pay an annual fee with a registrar for the right to use it. The rented nature of domain names means that they are not permanent and the same domain name could host completely different content at different times if it changes hands.

When planning the take-down or replacement of a website, the question of what to do with the domain name requires some thought. As well as being relevant to record-keeping, it is an important part of (business) continuity.

CyboRoz 404

In most cases the existing domain name is used to host the new version of the website. This is usually the right thing to do – users expect it and (if you chose the right one) a domain name often becomes a part of the identity of the website and/or the brand. Unless there are good reasons to switch to a new one, most domain names are kept when changing websites. Many websites also provide users with the option to view historical versions of the website by linking to a web archive or putting in place a landing page which points to old versions as well as new.

When a website is taken out of service, keeping the domain name and redirecting it to the archival version is also an option. This will incur a small charge in retaining the domain name; but this is much less than paying for the hosting fee and technical support to keep a website live. The advantage of this approach is seamless continuity: users are automatically referred to an archival version of the website without having to be aware of the existence of the web archive. For example, www.oneandother.co.uk, the domain name of the One and Other Project, featuring artist Antony Gormley’s commission for Trafalgar Square’s ‘empty’ fourth plinth in July 2009, points directly to the archival version in the UK Web Archive. Users can type the same web address or click on a link as they used to do and get to the website, despite the fact that it disappeared from the live web years ago.

Keeping the domain name may not be the right solution for everyone but it’s a possibility well worth considering.

Helen Hockx-Yu

[Image courtesy of Roberto Zingales, Creative Commons CC-BY 2.0, via Flickr]

07 February 2013

Archiving social media: a workshop report

Add comment Comments (0)

I was very pleased to be invited to a recent workshop on social media archiving. It was organised by Laura Lannin and colleagues at the Museum of London, to whom many thanks for a wide-ranging and stimulating afternoon.

The day saw a cluster of diverse and useful presentations. Among them was our very own Helen Hockx-Yu, on the potential and problems relating to social media archiving on a national scale, as we experience them at the UK Web Archive. Web archiving is always a technological arms race, with the archiving technologies having to adapt constantly as the way the web works continues to change.

The other presentations between them showed the wide variety of perspectives from which the whole issue needs to be approached. Two projects examined the way in which Twitter can be used as a means of identifying content on the wider web that should be preserved, as well as an archive resource in itself. Both projects came from within specialist museums, and both were concerned with the Olympics. The Victoria and Albert Museum (represented by Catherine Flood) had monitored Twitter to identify graphically significant visual resources, shared on Flickr as the Collect London 2012 collection. The Museum of London (in partnership with Peter Ride of the University of Westminster) had gone a step further, bringing together a team of Citizen Curators to keep eyes and ears open during the Games for important resources, and to identify them by means of the Twitter hashtag #citizencurators for later harvesting.

In contrast, Ruth Page (University of Leicester, or @ruthtweetpage) gave us the perspective of a linguist interested in the analysis of large corpora of tweets, for the patterns of language usage within them. And although there was not a presentation from this perspective, several of those present were responsible for social media engagement between museums and their users, and are faced with working out how best to archive their own social media output.

In a previous post, Nicola Johnson reported on the difficulties of implementing web archiving activity in national libraries charged with archiving the web outside their own walls. This workshop neatly showed the different concerns of a wider group of interested parties. Whether it is national libraries, museums or users; whether it is social media content itself or the other resources they link to, there is much to think about when it comes to social media archiving. 

Peter Webster (@pj_webster)

30 January 2013

Surfing the web in time: Mementos

Add comment Comments (0)

Have you ever needed to see a copy of a now-lost website, and didn't know where to start ? Help is at hand, with Mementos.

Mementos search

The Memento protocol has been around for a while (since 2009). It's a way of adding a time dimension to our common HTTP-based way of browsing the web, and has been available as a plug-in for Firefox. (See mementoweb.org for details.)

On the UKWA site, we have launched an alternative web-based way of delivering Memento, without needing to amend your browser. Mementos allows you to search across multiple different web archives around the world at once - particularly helpful if you don't know by which territorial web archive a site is most likely to be kept. It gives a breakdown of how many versions each archive holds, and from when, and leads users through to the archived versions themselves.

Get started with the search page; or, see it in action for the Google homepage (over 4,000 snapshots in four archives since 1999) and the BBC homepage (more than 5,000, in five archives, since 1996).

For those interested in the detailed workings and in reusing the web client, the source code is hosted on Github.

Peter Webster

 

24 January 2013

Web archiving: how to fit it in ? A workshop report

Add comment Comments (0)

[A report by Nicola Johnson, Web Archivist at the British Library]

I attended a workshop “How to fit in – integrating a web archiving program in your organization” at the Bibliotheque Nationale de France, in Paris, 26th – 30th November 2012. It was sponsored by the International Internet Preservation Coalition.

The workshop was intended for curators, archivists and managers involved in (or about to embark on) web archiving at their institutions. The BnF has been archiving websites since late 1999 and has a vast amount of expertise. France was an early adopter of legal deposit for websites, with legislation in August 2006 meaning that websites from the French national domain can be collected by the BnF for preservation and public use. I was particularly interested in the transition that they have made to this large-scale operation, as Legal Deposit legislation is expected in the UK this April and we will have the task of integrating large scale archiving with our current selective undertaking.

BNF

Several IIPC member organisations attended the workshop, hosted in one of the four ‘towers of open books’ at the BnF’s main site. The Francois Mitterrand building was one of the grands projets of the former president and is one of the largest and most modern libraries in the world. Participants included the British Library, the national libraries of Germany, Slovenia, Estonia, Spain and the Netherlands. Also represented were the Bavarian State Library, the California Digital Library, the National Library and Archives of Quebec, the Bibliotheca Alexandrina and the Library of Congress. Participants represented a range of experience in web archiving and were at different stages of national legal deposit legislation.

A wide range of topics were covered, including the integration of web archiving in acquisition practices; the role of subject librarians in selecting websites; and how web collections should align with general collection development policies. As the business of web archiving involves several parts of a library, we also heard representatives of various departments at the BnF speak of their role, including IT, conservation, legal deposit, collections co-ordination and digital and bibliographic information. There were subject specialists from the music, literature and art departments, who spoke about their collection development policies and how to incentivise staff to select websites when they have a multitude of other duties to perform. Given my role as Web Archivist I was particularly interested in the role of the 70 or so curators or “recommending officers” who select websites for the focussed crawls undertaken by the BnF.

A presentation was also made by the Internet Memory Foundation, a non-profit institution based in Amsterdam and Paris. The foundation provides a shared platform for institutions to collect websites and is archiving dozens of terabytes of data every month. They are also involved in various research projects with institutions and are developing a new crawler and architecture for web-scale crawling. Later in the week we also had the opportunity to visit the National Audiovisual Institute (INA), a repository containing 70 years of French radio programmes and 60 years of TV. The INA shares responsibility for collecting legal deposit online content with BnF and began collecting broadcast-related websites in February 2009. It holds approximately 10,000 websites, employing multiple crawlers for different types of content. Access is available at six sites in France, but some material under open licence is available online.

Our hosts succeeded in creating an atmosphere that was relaxed and stimulating (see the pictures); a great many ideas were exchanged and the commonality of purpose among the participants was encouraging. I have returned to work with a renewed vigour and positivity towards web archiving and I know the other participants have after reading their messages after the event. Positive changes are being made in our respective institutions as a result of the workshop.

[Image of the BNF (Creative Commons BY-NC-SA) from Images et Voyages ]

21 January 2013

What could you do with an archive of the UK web, 1996-2010 ?

Add comment Comments (0)

The Analytical Access to the Domain Dark Archive (AADDA) project has brought together a group of scholars to help us formulate which analytical tools users will need to make the most of the JISC UK Web Domain Dataset, a dataset of all the holdings of the Internet Archive for the UK from 1996 to 2010.

A (very large) geo-index of the data is already available for download, and the dataset can also be visualised using the Ngram. But this group of scholars of the humanities and social sciences are beginning to imagine the projects they would like to pursue using the data. I myself began to sketch an answer in a previous post on the AADDA blog. Wikimedia_Servers-0051_17

Since then, summaries of those projects have been appearing on the project blog. Here are some of them.

(i) Dr Richard Deswarte will be Exploring and uncovering Euroscepticism in the Dark Archive.

(ii) Saskia Huc-Hepher (University of Westminster) will be exploring the spatial dimensions of the French community in London.

(iii) Professor Gemma Moss (Institute of Education) will be examining the use of statistical data in setting agendas for education change, and the PISA rankings in particular.

(iii) Carole Taylor is investigating the decline of parliamentary political engagement and its implications.

(iv) Helen Taylor (Royal Holloway, University of London) will be examining the reception of the Liverpool poets

Watch out for more posts here on this project as it unfolds. It is a collaboration between ourselves at the British Library, the Institute of Historical Research (University of London) and the University of Cambridge, and is funded by the JISC.

Creative Commons image courtesy of Wikimedia Commons.