UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

11 March 2014

‘Vague, but exciting’ - #web25

Add comment Comments (1)

When Tim Berners-Lee submitted a proposal in March 1989 for a "distributed hypertext system", his supervisor Mike Sendall commented: "Vague, but exciting". The Web is 25 years old today, no longer vague, still exciting.

We feel a sense of pride being one of those tasked with the mission of keeping a history of the Web. The British Library did not get involved in Web archiving until 2004, and our early efforts were done selectively, under licence. Supported by the Legal Deposit Act and Regulations we have been permitted to archive the UK Web at scale since April 2013. We completed our first domain crawl in June 2013, collecting 31TB data from over 1.3 billion URLs. We are currently getting ready for our 2014 domain crawl, planned to take place in May.

It is interesting to take a pause on the 25th birthday of the Web, and give some attention to the earliest instance of an archived website in the UK Web Archive. This happens to be a copy of the British Library website from 18 April 1995, not crawled from the live web  at that time using a web crawler but recreated and reassembled in 2011 using files found on a server - I still vividly remember the day when a colleague delivered a dusty box filled with CDs to my office. The notes by the web archivist read as follows:

This is the earliest archival version of the British Library website, showing the Library's first explorations into hypertext and embedded images from collection material with links to larger images, sound files and further information. "Portico" was a brand or service name for the British Library website which was replaced a few years later. In 2011 zip files making up the website were discovered containing a testing copy of the Library's 1995 website. After decompressing the files, the resulting directory structure was used to create a representation of the original site's layout for ingest into the Web Archive. This representation does not include the complete dataset. Links to information hosted then on a Gopher server are broken. Gopher is a predecessor of and later an alternative to the World Wide Web.

To my (pleasant) surprise,  the recording of a nightingale, embedded on the page which features John Keats' `Ode to a nightingale', in Au file format, played beautifully on my machine in Chrome, Firefox as well as Internet Explorer - I do wonder if this qualifies as the earliest "tweet" on the Web?

BL

A Web archive not only contains historical copies of individual websites, when viewed in its entirety, it also provides a bigger picture and allows analysis and data mining which can lead to undiscovered patterns and trends. We blogged previously about  Austrian researcher Rainer Simon's analysis and visualisation of the 1996 UK Web, using our UK Host-Level Link Graph (1996-2010) dataset. Our effort in data analytics will continue in the  Big UK Domain Data for Arts and Humanities project, funded by the Arts and Humanities Research Council to develop both a methodological framework and tools to support the analysis of the UK Web Archive by researchers in the arts and humanities. The project aims to deliver a major study of the history of UK Web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on. 

Tim Berners-Lee, the World Wide Web Foundation and the World Wide Web Consortium are inviting everyone, everywhere to wish the Web a happy birthday using #web25. They have also joined forces to create webat25.org, a site where users can leave birthday greetings for the Web, view greetings from others and find out more about the Web’s history. 

Please join in.

Helen Hockx-Yu
Head of Web Archiving 

19 February 2014

Jorge Luis Borges and Twitter

Add comment Comments (0)

[A guest post from writer and Museum Studies tutor Rebecca Reynolds]

When I first heard that the British Library was archiving every webpage with a .uk domain name, I immediately thought of Borges's short story Funes the Memorious, about a man who can forget nothing. 'I have more memories in myself alone than all men have had since the world was a world', Funes says; 'my memory, Sir, is like a garbage disposal'.

I spoke to Helen Hockx-Yu, Head of Web Archiving at the British Library, about this, focusing on Twitter pages. Will ephemera in such quantities be truly useful to researchers of the future?

Helen commented that this was up to researchers to decide but was clear that as many webpages as possible needed to be kept. 'When you research a person's life, or history, you don't have everything - you piece it together.' she said. 'Hopefully what we're doing would form part of those pieces.' She gave as an example Antony Gormley's 2009 One and Other art project in which members of the public took turns to stand on the fourth plinth in Trafalgar Square and say whatever they wanted. The website recording these people is no longer available but is in the UK Web Archive. For some websites, Helen said, 'being ephemeral is exactly their significance'.

And what about privacy? Would you like researchers of the future poring over one of your ill-considered blog posts or tweets? Webpages can be withdrawn only under certain circumstances such as defamation or breaches of confidentiality. Helen's advice here was simply to be careful what you put in the public domain.

I also spoke to Jonathan Fryer, Liberal Democrat Euro-candidate for London, two of whose Twitter pages have been put in a UK Web Archive collection devoted to blogs and bloggers. He thought archiving Twitter feeds was a good idea: 'Twitter has taken over from letters and other forms of exchange of information and ideas. Forms of communication such as blogs and Twitter need to be kept instead.'

Jonathan Fryer

Back to Borges's story. The narrator doubts that Funes can think, despite his prodigious memory: 'To think is to forget a difference, to generalise, to abstract. In the overly replete world of Funes there were nothing but details, almost contiguous details.' Perhaps the Twittersphere is another 'overly replete' world. In any case, here are some 'contiguous details' from Jonathan Fryer's Twitter page in the archive. Which, if any, do you think might be worth keeping?

Just purged 8 American floozies from my followers. How do they get to latch onto one like limpets?

David Cameron is 'very relaxed' about Andy Coulson and allegations of bugging and blagging. He shouldn't be.

Went to see 'Bruno'; a real curate's egg, but two or three brilliant scenes.

Jonathan Fryer's Twitter page will appear in a book I am currently working on, exploring unusual museum objects from around the UK, using interviews with people from inside and outside museums. Other ephemera in the book are a 19th-century leaflet advertising a live mermaid from Reading University's Centre for Ephemera Studies, and toilet paper from The Land of Lost Content museum in Shropshire.

Rebecca Reynolds (Twitter: @rebrey)

07 February 2014

New research project: Big UK Domain Data for the Arts and Humanities

Add comment Comments (0)

We are delighted to have been awarded Arts and Humanities Research Council funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project, one of 21 to be funded as part of the AHRC’s Big Data Projects call, is led by the Institute of Historical Research (University of London), in collaboration with ourselves at the British Library, the Oxford Internet Institute and Aarhus University.

Here are some details, from the project blog:

"The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analysis.

"A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.

 

15 January 2014

RESAW: Research infrastructure for the Study of Archived Web materials

Add comment Comments (0)

[Helen Hockx-Yu, Head of Web Archiving at the British Library, writes:]

Two scholars at Aarhus University, Denmark, Niels Brugger and Niels Ole Finneman, organised a workshop in December for potential partners of RESAW, an initiative aimed at building a pan-European research infrastructure for the study of web archives. An important element of the infrastructure is existing national web archives, often underpinned by legal frameworks such as legal deposit or copyright law but not fully available publicly. To make use of such archives, researchers have to be present physically at archiving institutions’ premises.

A research infrastructure, however, is more than isolated national web archives with restricted access, often referred to as “dark archives”. The goal is to find ways to link these together and offer seamless access to distributed web archives. The Mementos Service developed by the UK Web Archive, which allows discovery and delivery of archived web pages from multiple web archives, is a good example of how this could be done. Anat Ben David of the University of Amsterdam, associated with the WebArt project, presented impressive and promising search and visualisation approaches, which significantly improve access to large scale, closed national web archives.

Awareness and understanding of the characteristics of archived web material, and the development of appropriate research methods to study it, are equally indispensable elements of RESAW. It is not surprising that, in addition to a number of national web archives, there was strong representation of researchers at the workshop, from universities and research institutions across Europe. In his keynote, Niels Ole Finneman analysed the particularities of archived web material against the context of the live web as well as in the study of other digital sources. He argued that the archived web is “re-born” digital content, and differs from the live web in many ways. RESAW does not have a particular disciplinary focus but aims to allow for all kinds of epistemological and methodological approaches, whether rooted within the sciences, the social sciences or the humanities.

I was honoured to be able to present the perspectives of web archiving institutions, and was given a brief to focus on our interactions with scholars. I reported on our earlier work on scholarly feedback and highlighted an increasing amount of interactions with scholars in recent years, with a number of research groups emerging, which devote effort and attention to web archives. UK institutions among these include the Institute of Historical Research based at the University of London, and the Oxford Internet Institute. Both have recently been funded by the Joint Information Systems Committee (JISC) to carry out research projects using web archives, in partnership with the British Library. A general trend with three phases can be observed with regard to scholarly interaction with web archives:

Phase 1: Building collections
Scholars are involved in scoping collections, selecting and describing websites relevant to research interests. This effort often ended up with the creation of specific, if sometimes narrow, topical collections.

Phase 2: Formulating research questions
This often takes the forms of brain-storming sessions, workshops and projects, where researchers are made aware of web archives and asked the question: which research questions might web archives help you answer? This is a much more bilateral interaction and represents a shift of focus to web archives in their entirety. It however suffers from being required to define the unknown, and is also time- and resource-consuming.

Phase 3: independent use of web archives
This type of interaction has just begun to emerge. It is the desired “go-to” state, where interfaces to web archives already meet the most common scholarly requirements. Scholars are able to use web archives without having to depend on (personal) interactions with providers. This requires user interfaces to be self-explanatory, jargon-free and to contain base-line information about the archive. This includes information on the scope of the archive, its coverage and lacunae, how it was collected, and how a particular website was crawled.

RESAW is aiming to apply for funding from the European Commission under the Horizon 2020 Framework. The workshop was an opportunity to identify issues and discuss a plan. It produced a list of work, which RESAW will tackle and address, as well as the steps towards a funding application.

As one of the providers of the UK’s national web archive, we are pleased to be involved as we see RESAW as an important initiative which will help connect scholars with web archives and with each other in new ways.

11 December 2013

Political party web archives

Add comment Comments (0)

There's been some news coverage in the last few weeks of the decision of the Conservative Party to reorganise their website, removing an archive of speeches up to 2010. The original report appeared in Computer Weekly (here) and subsequently the story was picked up by media including The Guardian, the Financial Times and Channel 4 News. In the subsequent debate there were a few factual inaccuracies, and so we thought it worth blogging about archival copies of these pages, and of other UK political party content.

Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted - all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page.  The Internet Archive policy extends the same courtesy to playback.

At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively,  it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.

Whatever the details of this particular case, it's worth noting that the Internet Archive's playback policy is not widely known. Most webmasters only consider search engine crawlers when they configure their robot rules. For example, it is not uncommon to use this mechanism in order to prevent crawlers from creating lots of 'Not Found' errors as they follow incoming links to content that is not longer available.

For our own part, we had been archiving the whole Conservative Party site since 2004, by the express permission of the party, and those archived copies are available in the public UK Web Archive (UKWA). We have also archived the sites of the Labour party and the Liberal Democrats since around the same time. In contrast with the Internet Archive, we do not use recent changes to robots.txt to determine access to archived sites.

There are many other sites for which we do not have the same permission. However, since the advent of Non-Print Legal Deposit in April 2013, we may archive any site from within the UK, although users must visit one of the six legal deposit libraries for the UK in order to see the archived copy.

It isn't only the sites of the main political parties that we archive. Also in UKWA are extensive collections for the 2005 and 2010 general elections and the 2009 elections to the European Parliament. As well as the sites of the main parties, these include the sites of local party organisations and individual candidates, as well as news media coverage, opinion polls and the contributions of interested groups and individuals. There are also many websites of sitting MPs, many of which have since disappeared from the live web as the member lost their seat. Examples of these include Kitty Ussher, minister in the Labour government between 2007 and 2009, and the Conservative former minister Peter Bottomley.

We have also archived materials relating to major changes in public administration, such as the abolition of the police authorities in England and Wales in 2012, and the reorganisation of the NHS (also in England and Wales) in April 2013.

23 October 2013

The three truths of Margaret Thatcher

Add comment Comments (0)

[In this guest post, Jules Mataly describes his research at the University of Amsterdam, making comparative use of three different web archives, including the UK Web Archive. His thesis, The Three Truths of Mrs Thatcher was completed earlier this year.]

As a Master’s student of New Media and Digital Culture at the University of Amsterdam, my final thesis made use of the UK Web Archive to a great extent. The goal of the thesis was to do a comparative analysis of different archives on a given topic. I had in mind to compare, and to find a way to quantify, the impact of different curating approaches on archived materials. This was not in relation to the gigabytes collected (all the collections are huge anyway), but rather in terms of the sources and origins of the archived pages. After some deliberation, the choice of research topic fell on Margaret Thatcher.

Margaret_Thatcher
By work provided by Chris Collins of the Margaret Thatcher Foundation, [CC-BY-SA-3.0, via Wikimedia Commons]

At that point in time, Mrs Thatcher had just passed away. Given her status, there emerged a great number of online articles seeking to establish what her impact on politics had been. And so I wondered: what will an historian ten or twenty years from now be likely to find, if researching the online publications of today? What parts of the seemingly significant material of our time will be successfully archived and preserved for future times, and what is likely to be lost? Here I discuss the methods used for my thesis, entitled “The Three Truths of Margaret Thatcher”.

When compiling the research, it was necessary to find web archives with which to compare the UK Web Archive. After being introduced to my research project, the head of the UK Web Archive, Helen Hockx-Yu, very kindly offered access to a brand new research interface. This yet-to-be-officially-released interface is built by the UK Web Archive team, and based upon what the Internet Archive had collected of the UK web domain (the JISC UK Web Domain Dataset). Finally, and thanks to the help of Erik Borra from the Digital Methods Institute in Amsterdam, I created a list of URLs that were curated by Google and accessible through the Internet Archive. Google is the front door of the Internet to most people, but it also allows querying pages within a time-range, giving access to pages of the past that would not appear in today’s results. Studying pages retrieved through the Google search engine, I tried to find which ones the Internet Archive had saved. These pages were then used create a third corpus.

The UK Web Archive is – unlike numerous other national web archive initiatives – online, available to all and without restrictions. This makes it a great prospect for research. As I was researching a topic that has not been archived purposely by the British Library (i.e. not as part of a special collection), I used the text-based search to query the archive’s databases, thus entering web archives in a “Google fashion", by keywords.

Users of the Internet Archive previously only had the option to search by URL. Now, however, it is also possible to extract archived material through full-text search in other archives. For the thesis research in question, it was necessary to generate lists of URLs that could be easily provided by text search. When creating such lists, it is possible to group the results by domain, but I purposely ignored that option. My interest was primarily in complete URLs, and only secondarily in web domains and top-level domains. Had it been possible to group more than ten results per page, it would have significantly improved the workflow. An optimal situation would have been to be able to download complete lists of the resulting URLs, perhaps in a .csv format, preferably with corresponding meta-data (e.g. date of crawl, number of times visited by crawlers).

By querying the UK Web Archive as one would query the live web through search engines, I obtained a list of websites. This list was sorted by unknown criteria. Had it been possible to sort search results according to various parameters – occurrences of the search terms, number of times the specific page had been visited by the crawler, crawl date – the resulting list would have been open to greater research possibilities. It would not only be possible to study the archived materials (or a sample of them, as I did in my research), but also enable studies on what the users are confronted with when browsing the archives. It can be deduced that in the light of the sheer volume of available data, knowledge of what sites the users actually access is of great importance.

In conclusion, I would like to use this opportunity to sincerely thank Helen Hockx-Yu for sharing interesting thoughts and providing access to the user interface prototype built upon the Internet Archive data. My thesis “The Three Truths of Margaret Thatcher” would not have been complete without it.

30 September 2013

Watching the UK domain crawl with Monitrix

Add comment Comments (0)

We at the UK Web Archive have been archiving selected websites since 2004, and throughout we have worked to ensure that the quality of those archived sites is acceptably high. This involves a lot of manual effort; it means inspecting the web pages on each site, tracking down display issues, and re-configuring and re-crawling as necessary. On this basis, we have to date archived over 60,000 individual snapshots of websites over nearly a decade.

Now that the Legal Deposit legislation is in place, we are presented with a formidable challenge. As we move from thousands of sites to millions, what can we do to ensure the quality is high enough? We have the resources to manually inspect a few thousand sites a year, but that's now a drop in the ocean.

At large scale, even fairly basic checks become difficult. When there are only a few crawls running at once, it is easy to spot when the crawl of a single site fails for some unexpected reason. When we have very large numbers of sites being crawled simultaneously, and at varying frequencies, simply keeping track of what is going on at any given moment is not easy, and failed crawls can go unnoticed.

This is also particularly important for those rare occasions when a web publisher has contacted us with an issue about our crawling activity. We need to be able to work out straight away what's been going on, in which crawler process, and to modify its behaviour. This is why we began to develop Monitrix, a crawl monitoring component to complement our crawler.

The core idea is quite simple: Monitrix consumes the crawl log files produced by Heritrix3 and, in real time, derives statistics and metrics from that stream of crawl events. That critical information is then made available via a web-based interface.

 

Monitrix screenshot
Monitrix in action, showing graphs of data volume over time and other key indicators.

 

We initially trialled Monitrix during our first Legal Deposit crawl, relating to the reorganisation of the NHS in England and Wales in April. This worked very well, and the interface allowed us to track and explore the crawler activity as it happened. Simple things, like being able to flip back quickly through the chain of links that brought the crawlers to a particular site, proved very helpful in understanding the crawl's progress.

But then came the real challenge: using Monitrix during the domain crawl. The NHS collection contained only 5,500 sites, collecting just 1.8TB of archived data. In contrast, the domain crawl would eventually include millions of sites and over 30TB of data. Initially, Monitrix worked quite well, but as the crawl went on it became clear that it could not keep up with the sheer volume of data being pushed into it. The total number of URLs climbed into the millions, being collected at one point at a rate of 857 per second. Under this bombardment, Monitrix became slower and slower.

What was the problem ? With that twenty-twenty vision that comes only with hindsight, it became abundantly clear that the architecture of the MongoDB  database (on which Monitrix is based) was not well suited to this, our largest scale use case. However, we now believe we have found at least one appropriate alternative technology, Apache Cassandra, and we are in the process of moving Monitrix over to that database system.

Andy Jackson, Web Archiving Technical Lead, British Library

16 September 2013

Crawling the UK web domain

Add comment Comments (0)

After the initial flurry of publicity surrounding the final advent of Non-Print Legal Deposit in April, we in the web archiving team at the British Library began the job of actually getting on with part of that new responsibility: that is, routinely archiving the whole of the UK web domain. This is happening in partnership with the other five legal deposit libraries for the UK: the National Library of Wales, the National Library of Scotland, Cambridge University Library, the Bodleian Libraries of the University of Oxford, and Trinity College Dublin.

We blogged back in April about how we were getting on, having captured 3.6TB of compressed data from some 191 million URIs in the first week alone.

Now, we're finished. After a staggered start on April 8th, the crawl ended on June 21st, just short of eleven weeks later. Having started off with a list of 3.8 million seeds, we eventually captured over 31TB of compressed data. At its fastest, a single crawler was visiting 857 URIs per second.

There is of course a great deal of fascinating research that could be done on this dataset, and we'd be interested in suggestions of the kinds of questions we ought to ask of it. For now, there are some interesting views we can take of the data. For example, here is the number of hosts plotted against the total volume of data.

2013 Domain Crawl TotalDataVolumeDistribution - resized
2013 domain crawl: data volumes and hosts

This initial graphing would suggest there are a great many domains that are very small in size indeed; more than 200,000 domains yield only 64B, a minuscule amount of data. These could be sites that return no content at all, or that are redirections to elsewhere, or that "park" domains. At the other end of the scale, there are perhaps c.50,000 domains that return  256MB of data or more.

It's worth remembering that this only represents those sites which we can know (in a scaleable way) are from the UK, which for the most part means sites with domains ending in .uk. There are various means of determining whether a .com, .org, or .net site falls within the scope of the regulations, none of which are yet scaleable; and so best estimates suggest that there may be half as many sites again from the UK which we are not yet capturing.

The next stages are to index all the data and then to ingest it into our Digital Library System, tasks which themselves take several weeks. We anticipate the data being available in the readings rooms of the legal deposit libraries at the very end of 2013. We plan a domain crawl at least once a year, and possibly twice if resources allow.