THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

03 July 2015

What is a Web Archive? (in less than 3 mins)

Add comment Comments (0)

You may have heard of the term 'Web Archiving' but what is it and why is it important that the UK Legal Deposit libraries support this? This short video is a good start:

 

What do the UK Web Archive collect?
What can you expect to find and where might you go to access the three collections that the UK Web Archive hold?

 These videos were produced as part of the AHRC funded 'Big UK Domain Data for the Arts and Humanities' project.

26 June 2015

Ten years of archiving the web - A reflective blog post by Nicola Bingham, Web Archivist

Add comment Comments (0)

It is ten years since the UK Web Archiving Consortium, the precursor to the UK Web Archive, launched one of the world’s first openly accessible web archives. It is therefore a fitting time to look up from the crawl logs and reflect on what we have achieved over the past decade.

Portico

 Early Years

In the late 1990s and early 2000s memory institutions around the world started to turn their attention to preserving the enormous amount of highly ephemeral material on the emerging World Wide Web. The earliest exploration around web archiving at the British Library was in 2001-2 with the “UK Domain project”, a pilot study to explore the feasibility of archiving around 100 websites. When the Legal Deposit Libraries Act was passed in 2003 it meant that the Library could plan to scale up its operations to archive the whole UK web domain - once further enabling legislation had been put into place. We did not realise at the time that this process would take a further ten years!

In order to put in place the systems, people and policies to carry out web archiving the Web Archiving Programme was launched in 2003. The Programme’s aims were to “enable the British Library to become the first point of resort for anyone who wants to access a comprehensive archive of material from the UK web domain.”

  UKWAC_logo

In order to realise these ambitious goals the Library joined the UK Web Archiving Consortium in 2003 along with five other partners (the National Archives, National Library of Wales, National Library of Scotland, JISC and the Wellcome Trust). To the best of the partner’s knowledge there were no other UK institutions working in this way to archive the UK Web. The achievements of the Consortium were summarised in the final project report:

On “…..strategic and operational levels, the Consortium has been successful in addressing, in a shared and collaborative manner, a range of legal, technical, operational, collection development and management issues relating to web archiving. UKWAC has laid the foundations of a national web archiving strategy and a shared technical infrastructure for the United Kingdom (UK) and has prepared the ground for future development, taking into account the need to prepare for forthcoming secondary legislation associated with the Legal Deposit Libraries Act 2003 and the extension of legal deposit to non-print materials including websites.”

The author of this post joined the web archiving team in January 2005 just a few days after the Indian Ocean Tsunami. Our web archiving operations were on a much smaller scale than they are today; websites being archived selectively with the express permission of site publishers. The permissions process took up much of the team’s time and resources. Web archiving tools were still very much in development. Crawling was carried out with the PANDORA web archiving system developed by the National Library of Australia using infrastructure shared by the UKWAC. One of the many issues facing the Consortium partners in these early days was the issue of manually controlling load balancing on the system at periods of high intensity crawling using a traffic light system. Green meaning a crawl could be initiated, red meaning go away and have a cup of tea while the crawl backlog cleared.

New forms of content

In addition to the technical aspects of web crawling, the Library was getting to grips with cataloguing, describing and providing access to completely new forms of publication. We explored the fact that “many new ways of communicating are entirely web based; Blogs, Wikis, MySpace and YouTube”. We established a special collection of “Blogs” to reflect the fact that this new and exciting format coincided with the advent of web publishing tools that facilitated the posting of content to the web by non-technical users.

In terms of access, the focus was initially on presenting website snapshots as documents categorised according to traditional library subject taxonomies or as special collections. One of the first collections we published was in response to the Indian Ocean Tsunami of Boxing Day 2004

First harvests

The UK Web Archive went live on May 9 2005. Figures from 20 March 2006 reveal that 1172 Titles and 3641 instances were accessible. Some of the first websites we archived were:

Pathways to the past (TNA) http://www.webarchive.org.uk/ukwa/target/99634/source/search

Y Gwleidydd (The Politician) NLW http://www.webarchive.org.uk/ukwa/target/101915/source/search

Listening Room (BL) http://www.webarchive.org.uk/ukwa/target/101989/source/search typical of the time, frames, coloured text on black background, message board.

Arthurlloyd music hall (BL) http://www.webarchive.org.uk/ukwa/target/102127/source/search

trAce (BL) online writing centre http://www.webarchive.org.uk/ukwa/target/102190/source/search

Menna Elfyn (NLW) welsh poet have contemporary copies http://www.webarchive.org.uk/ukwa/target/103764/source/search

BioCentre – Centre for Bioethics and Public Policy (Wellcome) http://www.webarchive.org.uk/ukwa/target/103792/source/search

Social Care Institute for Excellence (BL) http://www.webarchive.org.uk/ukwa/target/101868/source/search

http://www.webarchive.org.uk/ukwa/target/102148/source/search

Glasgow Effective Records Management Project (JISC) http://www.webarchive.org.uk/ukwa/target/99707/source/search

Churches Together in England (BL) http://www.webarchive.org.uk/ukwa/target/101767/source/search (has good spread of instances)

The earliest material we hold is the first version of the BL’s website “Portico” which was reconstructed from files stored on the BL’s servers.

 Big Data

 In 2015 the scale of our web archiving activity has magnified from thousands to millions of websites and billions of URLs. As researchers look beyond text as the object of study, we no longer take a document focussed approach. The kind of distant reading that we hope to provide will allow researchers to explore patterns of change, geolocation, linked networks and entities, the so-called Big Data approach.

    AADDA

Collaboration

 In 2008 we began engaging seriously with researchers, initially to get researchers to curate collections within their own areas of expertise and later focussing on presenting the archive as a dataset for study. The UK domain dataset (1996-2013) acquired by JISC from the Internet Archive, has been made available to researchers to experiment with query building, corpus formation and handling. Some of the projects carried out with this data include the Analytical Access to the Domain Dark Archive, led by the Institute of Historical Research in partnership with the British Library and the University of Cambridge. The JISC data set, along with the Open UK Web Archive was used by Jules Mataly (University of Amsterdam), for his thesis, The Three Truths of Mrs Thatcher, completed in 2013. Dr Rainer Simon explored how the UK web was linked in 1996 using the 1996 portion of the JISC dataset. Going forward, we will use our experience of working with researchers to influence how we archive, store and present web archives to fully integrate with scholar’s workflows.

 Having been launched as a pilot Programme little more than a decade ago, web archiving is now a key aspect of the British Library’s Strategic Priorities and is considered in corporate terms a ‘business as usual activity’. In ‘Living Knowledge’, the publication which lays out the key strategic priorities for the Library on its journey to its 50th anniversary in 2023, Roly Keating, Chief Executive, states that our partnership with the National Libraries of Scotland and Wales, the Libraries of the Universities of Oxford, Cambridge and Trinity College Dublin “lies at the heart of our single greatest endeavour in digital custodianship, the comprehensive collecting under Legal Deposit of the UK and Ireland’s output of born-digital content, including the archiving of the entire UK web.”

What have we achieved?

In ten years we have come a long way.

  • Legal Deposit Legislation enacted in 2003 and enabled in 2013, allowing the Legal Deposit Libraries to comprehensively archive the UK web space.
  • Tools and infrastructure. The implementation of state of the art crawling and indexing technologies enabling ingest of and access to archived material. A bespoke annotation curation tool which allows non-technical curators to harvest and describe the web, as well as build their own collections.
  • A publicly accessible web archive. The Open UK Web Archive is one of the few web archives in the world offering a full-text search. Our open permissions based web archive has 15,000 websites and 68,000 snapshots, serving as a window for our larger legal deposit collection discoverable through the British Library’s online catalogue and searchable by key word in the reading rooms.  
  • In excess of 60 special collections, including a decade’s worth of UK General Elections. Other collections include researcher led collections on…..?. Rapid response collections, e.g., the London bombings in 2005, Olympic Games 2012. Women’s issues.
  • Over eight billion resources and over 160 TB compressed data (comprising the Open Archive since 2004, Legal Deposit Archive since 2013 and JISC Historical Archive 1996-2013.
  • Successful collaborative relationships with the global archiving community. International Internet Preservation Consortium, partners across the UK and beyond The IIPC had 12 founding members in 2003 and now has nearly 50 members in 2015.

Nicola
How has my job as a web archivist changed over the past ten years? The goals remain the same, to acquire, describe and preserve web content for the benefit of future generations of researchers. Obviously though the task is now much bigger. One of the things I enjoy most about my job is the juxtaposition of the micro and macro approach. On the one hand the web archivist must consider the curation and development of the collection as a whole – millions of websites, billions of documents, hundreds of terabytes of data. On the other hand a micro approach is required to ensure the collection is properly curated. This involves a particularly high level of forensic endeavour for example in analysing a crawl log to determine why a particular object has not been picked up by the crawler. And on that note, I think it is time to go back to the crawl logs.

Nicola Bingham, Web Archivist

19 June 2015

RESAW Conference – showcasing research of the historical web

Add comment Comments (0)

RESAW

RESAW, a self-organising initiative aimed at building a pan-European research infrastructure for the study of web archives, has been active for a couple of years now.  A group of active researchers from Europe and North America have gathered around this network. They met last week in Aarhus, Denmark and presented their work at the RESAW Conference, entitled “Web Archives as Scholarly Sources: Issues, Practices and Perspectives”. The diverse approaches and findings I witnessed reflect the increasing awareness and understanding of the characteristics of archived web material, and the development of appropriate research methods to study it.

Packed programme

The conference had a packed programme, with parallel sessions running on all three days. In addition to 3 plenary sessions, it included 10 long papers, 12 short papers, 4 themed panels and 1 workshop. The format was refreshing and worked really well in bringing forward different perspectives: presentations were kept strictly to fixed time, while each paper received structured comments, followed by questions and discussion with the audience. The only downside was the hard choices one had to make, deciding where to go when a number of interesting papers were on at the same time.

CG-XlQ_WgAAS5fk

Meghan Dougherty of Loyola University, one of the very first researchers working with web archives, called for a more exclusive approach to archiving the web in her opening keynote.  Instead of preserving the web as series of linked documents, the focus should be on its rich complexity as new media including interactions, expectations, and how people live through and experience it. We otherwise risk excluding many features of today’s live web experience which will be valuable for future researchers.  Recognising the lack of good methodology for studying the historical web, Meghan observed the relevance of archaeological methods and practices, how they can help recover, document and analyse a record of information culture through virtual digging, and in that process taking into account the invisible and missing elements. She also asked archivists to reach out to researchers and researchers to collaborate more so that specialist knowledge and skills can be joined up. 

Netarkivet

Aarhus University is also the home of the Danish State and University Library, which has been collaborating with the Royal Danish Library to archive the Danish Web since 2005. The conference coincided with the ten year anniversary of the National Danish Web Archive, which now contains 600TB of data. Ditte Laursen and Per Møldrup-Dalum presented an overview of the Danish Web and shared the various legal, curatorial, technical and access challenges the Archive had to face and address.  A key one is the identification and collection of Danish material hosted on non .dk domains, which is applicable to many national web archives. After focusing on comprehensive data collection, access and use are now high on the agenda. The Archive launched full-text search on the anniversary and there are exciting plans to actively develop data mining and analytics, and to strengthen collaboration with researchers.

CHDNRGVXAAAaf28

It is no surprise that historians were among the first who started using web archives to study contemporary history. There was a strong presence of historians at the Conference who explored diverse aspects of the historical web. Ian Milligan of Waterloo University used the GeoCities Web Archive to explore the nature of virtual communities, highlighting the technical challenges and how critical overcoming these is to the historiography of the early web. Peter Webster studied British creationism in the historical UK Web Archive by analysing the creationist web estate and high-level patterning of host-to-host linkage, to conclude that in addition to its marginalisation British creationism was mostly ignored by academia, the media and the churches.  Sophie Gebeil presented a history of North African immigration memories through the French Web archives. A number of the researchers attached to the Big UK Domain Data for the Arts and Humanities project also presented their work, sharing the methodological frustrations and highlighting the challenges for large scale web archives to support qualitative research.

CHISP2pUkAAezJ-

Media scholars, social scientists, computer scientists and music and literature scholars are also using web archives. It is encouraging to see how aspects of the web other than “text” were explored by researchers, including software, programming language, social networks and the earlier Bulletin Board System.  Anne Helmond showed how to make use of the social media code snippets, embedded in the archived source code , to issue API calls to social network platforms and obtain the embedded content (currently not collected by web archives) .  Anat Ben-David presented an impressive effort in understanding and recovering the former .yu ccTLD which has now disappeared from the web entirely. In both cases, I think there are things web archives can do to remove reliance on social networks and to surface content related to all expired ccTLDs.  

CHDZuv7UIAAvIpV

There was so much inspiring research, covering all aspects of the web, in ways we have not envisaged. Those interested in finding out more should follow the storyfied tweets, put together by the Institute of Historical Research (IHR), University of London, which was also a co-organiser of the Conference. This is what significantly differs from the past – providers of web archives had to speculate possible use scenarios. I do not think we are short of use cases now. The RESAW conference has given us much evidence and food for thought. The next step is to collate, synthesize and extract high-level requirements out of these and use them to guide our development of tools and services.

CHHl8ZPUkAAE9Nn

As a proud co-organiser of the Conference, it was a delight to see work produced by the British Library Web Archiving Team being used by researchers and other web archives. We should however bear in mind that it is too early to settle on fixed methods of using web archives. We must try different approaches and continue the exploration and experiments to move forward.

The absolute highlight of the Conference was the announcement of the next RESAW Conference in London, to be sponsored and organised by the IHR.  Hopefully RESAW will become an on-going platform for showcasing more research on the historical web, carried out by more researchers including those from the less privileged countries and regions. 

 

Helen Hockx-Yu, Head of Web Archiving