Ten years of archiving the web - A reflective blog post by Nicola Bingham, Web Archivist
It is ten years since the UK Web Archiving Consortium, the precursor to the UK Web Archive, launched one of the world’s first openly accessible web archives. It is therefore a fitting time to look up from the crawl logs and reflect on what we have achieved over the past decade.
Early Years
In the late 1990s and early 2000s memory institutions around the world started to turn their attention to preserving the enormous amount of highly ephemeral material on the emerging World Wide Web. The earliest exploration around web archiving at the British Library was in 2001-2 with the “UK Domain project”, a pilot study to explore the feasibility of archiving around 100 websites. When the Legal Deposit Libraries Act was passed in 2003 it meant that the Library could plan to scale up its operations to archive the whole UK web domain - once further enabling legislation had been put into place. We did not realise at the time that this process would take a further ten years!
In order to put in place the systems, people and policies to carry out web archiving the Web Archiving Programme was launched in 2003. The Programme’s aims were to “enable the British Library to become the first point of resort for anyone who wants to access a comprehensive archive of material from the UK web domain.”
In order to realise these ambitious goals the Library joined the UK Web Archiving Consortium in 2003 along with five other partners (the National Archives, National Library of Wales, National Library of Scotland, JISC and the Wellcome Trust). To the best of the partner’s knowledge there were no other UK institutions working in this way to archive the UK Web. The achievements of the Consortium were summarised in the final project report:
On “…..strategic and operational levels, the Consortium has been successful in addressing, in a shared and collaborative manner, a range of legal, technical, operational, collection development and management issues relating to web archiving. UKWAC has laid the foundations of a national web archiving strategy and a shared technical infrastructure for the United Kingdom (UK) and has prepared the ground for future development, taking into account the need to prepare for forthcoming secondary legislation associated with the Legal Deposit Libraries Act 2003 and the extension of legal deposit to non-print materials including websites.”
The author of this post joined the web archiving team in January 2005 just a few days after the Indian Ocean Tsunami. Our web archiving operations were on a much smaller scale than they are today; websites being archived selectively with the express permission of site publishers. The permissions process took up much of the team’s time and resources. Web archiving tools were still very much in development. Crawling was carried out with the PANDORA web archiving system developed by the National Library of Australia using infrastructure shared by the UKWAC. One of the many issues facing the Consortium partners in these early days was the issue of manually controlling load balancing on the system at periods of high intensity crawling using a traffic light system. Green meaning a crawl could be initiated, red meaning go away and have a cup of tea while the crawl backlog cleared.
New forms of content
In addition to the technical aspects of web crawling, the Library was getting to grips with cataloguing, describing and providing access to completely new forms of publication. We explored the fact that “many new ways of communicating are entirely web based; Blogs, Wikis, MySpace and YouTube”. We established a special collection of “Blogs” to reflect the fact that this new and exciting format coincided with the advent of web publishing tools that facilitated the posting of content to the web by non-technical users.
In terms of access, the focus was initially on presenting website snapshots as documents categorised according to traditional library subject taxonomies or as special collections. One of the first collections we published was in response to the Indian Ocean Tsunami of Boxing Day 2004.
First harvests
The UK Web Archive went live on May 9 2005. Figures from 20 March 2006 reveal that 1172 Titles and 3641 instances were accessible. Some of the first websites we archived were:
Pathways to the past (TNA) http://www.webarchive.org.uk/ukwa/target/99634/source/search
Y Gwleidydd (The Politician) NLW http://www.webarchive.org.uk/ukwa/target/101915/source/search
Listening Room (BL) http://www.webarchive.org.uk/ukwa/target/101989/source/search typical of the time, frames, coloured text on black background, message board.
Arthurlloyd music hall (BL) http://www.webarchive.org.uk/ukwa/target/102127/source/search
trAce (BL) online writing centre http://www.webarchive.org.uk/ukwa/target/102190/source/search
Menna Elfyn (NLW) welsh poet have contemporary copies http://www.webarchive.org.uk/ukwa/target/103764/source/search
BioCentre – Centre for Bioethics and Public Policy (Wellcome) http://www.webarchive.org.uk/ukwa/target/103792/source/search
Social Care Institute for Excellence (BL) http://www.webarchive.org.uk/ukwa/target/101868/source/search
http://www.webarchive.org.uk/ukwa/target/102148/source/search
Glasgow Effective Records Management Project (JISC) http://www.webarchive.org.uk/ukwa/target/99707/source/search
Churches Together in England (BL) http://www.webarchive.org.uk/ukwa/target/101767/source/search (has good spread of instances)
The earliest material we hold is the first version of the BL’s website “Portico” which was reconstructed from files stored on the BL’s servers.
Big Data
In 2015 the scale of our web archiving activity has magnified from thousands to millions of websites and billions of URLs. As researchers look beyond text as the object of study, we no longer take a document focussed approach. The kind of distant reading that we hope to provide will allow researchers to explore patterns of change, geolocation, linked networks and entities, the so-called Big Data approach.
Collaboration
In 2008 we began engaging seriously with researchers, initially to get researchers to curate collections within their own areas of expertise and later focussing on presenting the archive as a dataset for study. The UK domain dataset (1996-2013) acquired by JISC from the Internet Archive, has been made available to researchers to experiment with query building, corpus formation and handling. Some of the projects carried out with this data include the Analytical Access to the Domain Dark Archive, led by the Institute of Historical Research in partnership with the British Library and the University of Cambridge. The JISC data set, along with the Open UK Web Archive was used by Jules Mataly (University of Amsterdam), for his thesis, The Three Truths of Mrs Thatcher, completed in 2013. Dr Rainer Simon explored how the UK web was linked in 1996 using the 1996 portion of the JISC dataset. Going forward, we will use our experience of working with researchers to influence how we archive, store and present web archives to fully integrate with scholar’s workflows.
Having been launched as a pilot Programme little more than a decade ago, web archiving is now a key aspect of the British Library’s Strategic Priorities and is considered in corporate terms a ‘business as usual activity’. In ‘Living Knowledge’, the publication which lays out the key strategic priorities for the Library on its journey to its 50th anniversary in 2023, Roly Keating, Chief Executive, states that our partnership with the National Libraries of Scotland and Wales, the Libraries of the Universities of Oxford, Cambridge and Trinity College Dublin “lies at the heart of our single greatest endeavour in digital custodianship, the comprehensive collecting under Legal Deposit of the UK and Ireland’s output of born-digital content, including the archiving of the entire UK web.”
What have we achieved?
In ten years we have come a long way.
- Legal Deposit Legislation enacted in 2003 and enabled in 2013, allowing the Legal Deposit Libraries to comprehensively archive the UK web space.
- Tools and infrastructure. The implementation of state of the art crawling and indexing technologies enabling ingest of and access to archived material. A bespoke annotation curation tool which allows non-technical curators to harvest and describe the web, as well as build their own collections.
- A publicly accessible web archive. The Open UK Web Archive is one of the few web archives in the world offering a full-text search. Our open permissions based web archive has 15,000 websites and 68,000 snapshots, serving as a window for our larger legal deposit collection discoverable through the British Library’s online catalogue and searchable by key word in the reading rooms.
- In excess of 60 special collections, including a decade’s worth of UK General Elections. Other collections include researcher led collections on…..?. Rapid response collections, e.g., the London bombings in 2005, Olympic Games 2012. Women’s issues.
- Over eight billion resources and over 160 TB compressed data (comprising the Open Archive since 2004, Legal Deposit Archive since 2013 and JISC Historical Archive 1996-2013.
- Successful collaborative relationships with the global archiving community. International Internet Preservation Consortium, partners across the UK and beyond The IIPC had 12 founding members in 2003 and now has nearly 50 members in 2015.
How has my job as a web archivist changed over the past ten years? The goals remain the same, to acquire, describe and preserve web content for the benefit of future generations of researchers. Obviously though the task is now much bigger. One of the things I enjoy most about my job is the juxtaposition of the micro and macro approach. On the one hand the web archivist must consider the curation and development of the collection as a whole – millions of websites, billions of documents, hundreds of terabytes of data. On the other hand a micro approach is required to ensure the collection is properly curated. This involves a particularly high level of forensic endeavour for example in analysing a crawl log to determine why a particular object has not been picked up by the crawler. And on that note, I think it is time to go back to the crawl logs.
Nicola Bingham, Web Archivist