UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites


News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Peter Webster (Engagement and Liaison Manager). Read more

11 March 2014

‘Vague, but exciting’ - #web25

Add comment Comments (1)

When Tim Berners-Lee submitted a proposal in March 1989 for a "distributed hypertext system", his supervisor Mike Sendall commented: "Vague, but exciting". The Web is 25 years old today, no longer vague, still exciting.

We feel a sense of pride being one of those tasked with the mission of keeping a history of the Web. The British Library did not get involved in Web archiving until 2004, and our early efforts were done selectively, under licence. Supported by the Legal Deposit Act and Regulations we have been permitted to archive the UK Web at scale since April 2013. We completed our first domain crawl in June 2013, collecting 31TB data from over 1.3 billion URLs. We are currently getting ready for our 2014 domain crawl, planned to take place in May.

It is interesting to take a pause on the 25th birthday of the Web, and give some attention to the earliest instance of an archived website in the UK Web Archive. This happens to be a copy of the British Library website from 18 April 1995, not crawled from the live web  at that time using a web crawler but recreated and reassembled in 2011 using files found on a server - I still vividly remember the day when a colleague delivered a dusty box filled with CDs to my office. The notes by the web archivist read as follows:

This is the earliest archival version of the British Library website, showing the Library's first explorations into hypertext and embedded images from collection material with links to larger images, sound files and further information. "Portico" was a brand or service name for the British Library website which was replaced a few years later. In 2011 zip files making up the website were discovered containing a testing copy of the Library's 1995 website. After decompressing the files, the resulting directory structure was used to create a representation of the original site's layout for ingest into the Web Archive. This representation does not include the complete dataset. Links to information hosted then on a Gopher server are broken. Gopher is a predecessor of and later an alternative to the World Wide Web.

To my (pleasant) surprise,  the recording of a nightingale, embedded on the page which features John Keats' `Ode to a nightingale', in Au file format, played beautifully on my machine in Chrome, Firefox as well as Internet Explorer - I do wonder if this qualifies as the earliest "tweet" on the Web?


A Web archive not only contains historical copies of individual websites, when viewed in its entirety, it also provides a bigger picture and allows analysis and data mining which can lead to undiscovered patterns and trends. We blogged previously about  Austrian researcher Rainer Simon's analysis and visualisation of the 1996 UK Web, using our UK Host-Level Link Graph (1996-2010) dataset. Our effort in data analytics will continue in the  Big UK Domain Data for Arts and Humanities project, funded by the Arts and Humanities Research Council to develop both a methodological framework and tools to support the analysis of the UK Web Archive by researchers in the arts and humanities. The project aims to deliver a major study of the history of UK Web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on. 

Tim Berners-Lee, the World Wide Web Foundation and the World Wide Web Consortium are inviting everyone, everywhere to wish the Web a happy birthday using #web25. They have also joined forces to create, a site where users can leave birthday greetings for the Web, view greetings from others and find out more about the Web’s history. 

Please join in.

Helen Hockx-Yu
Head of Web Archiving 

19 February 2014

Jorge Luis Borges and Twitter

Add comment Comments (0)

[A guest post from writer and Museum Studies tutor Rebecca Reynolds]

When I first heard that the British Library was archiving every webpage with a .uk domain name, I immediately thought of Borges's short story Funes the Memorious, about a man who can forget nothing. 'I have more memories in myself alone than all men have had since the world was a world', Funes says; 'my memory, Sir, is like a garbage disposal'.

I spoke to Helen Hockx-Yu, Head of Web Archiving at the British Library, about this, focusing on Twitter pages. Will ephemera in such quantities be truly useful to researchers of the future?

Helen commented that this was up to researchers to decide but was clear that as many webpages as possible needed to be kept. 'When you research a person's life, or history, you don't have everything - you piece it together.' she said. 'Hopefully what we're doing would form part of those pieces.' She gave as an example Antony Gormley's 2009 One and Other art project in which members of the public took turns to stand on the fourth plinth in Trafalgar Square and say whatever they wanted. The website recording these people is no longer available but is in the UK Web Archive. For some websites, Helen said, 'being ephemeral is exactly their significance'.

And what about privacy? Would you like researchers of the future poring over one of your ill-considered blog posts or tweets? Webpages can be withdrawn only under certain circumstances such as defamation or breaches of confidentiality. Helen's advice here was simply to be careful what you put in the public domain.

I also spoke to Jonathan Fryer, Liberal Democrat Euro-candidate for London, two of whose Twitter pages have been put in a UK Web Archive collection devoted to blogs and bloggers. He thought archiving Twitter feeds was a good idea: 'Twitter has taken over from letters and other forms of exchange of information and ideas. Forms of communication such as blogs and Twitter need to be kept instead.'

Jonathan Fryer

Back to Borges's story. The narrator doubts that Funes can think, despite his prodigious memory: 'To think is to forget a difference, to generalise, to abstract. In the overly replete world of Funes there were nothing but details, almost contiguous details.' Perhaps the Twittersphere is another 'overly replete' world. In any case, here are some 'contiguous details' from Jonathan Fryer's Twitter page in the archive. Which, if any, do you think might be worth keeping?

Just purged 8 American floozies from my followers. How do they get to latch onto one like limpets?

David Cameron is 'very relaxed' about Andy Coulson and allegations of bugging and blagging. He shouldn't be.

Went to see 'Bruno'; a real curate's egg, but two or three brilliant scenes.

Jonathan Fryer's Twitter page will appear in a book I am currently working on, exploring unusual museum objects from around the UK, using interviews with people from inside and outside museums. Other ephemera in the book are a 19th-century leaflet advertising a live mermaid from Reading University's Centre for Ephemera Studies, and toilet paper from The Land of Lost Content museum in Shropshire.

Rebecca Reynolds (Twitter: @rebrey)

07 February 2014

New research project: Big UK Domain Data for the Arts and Humanities

Add comment Comments (0)

We are delighted to have been awarded Arts and Humanities Research Council funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project, one of 21 to be funded as part of the AHRC’s Big Data Projects call, is led by the Institute of Historical Research (University of London), in collaboration with ourselves at the British Library, the Oxford Internet Institute and Aarhus University.

Here are some details, from the project blog:

"The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analysis.

"A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.