When Tim Berners-Lee submitted a proposal in March 1989 for a "distributed hypertext system", his supervisor Mike Sendall commented: "Vague, but exciting". The Web is 25 years old today, no longer vague, still exciting.
We feel a sense of pride being one of those tasked with the mission of keeping a history of the Web. The British Library did not get involved in Web archiving until 2004, and our early efforts were done selectively, under licence. Supported by the Legal Deposit Act and Regulations we have been permitted to archive the UK Web at scale since April 2013. We completed our first domain crawl in June 2013, collecting 31TB data from over 1.3 billion URLs. We are currently getting ready for our 2014 domain crawl, planned to take place in May.
It is interesting to take a pause on the 25th birthday of the Web, and give some attention to the earliest instance of an archived website in the UK Web Archive. This happens to be a copy of the British Library website from 18 April 1995, not crawled from the live web at that time using a web crawler but recreated and reassembled in 2011 using files found on a server - I still vividly remember the day when a colleague delivered a dusty box filled with CDs to my office. The notes by the web archivist read as follows:
This is the earliest archival version of the British Library website, showing the Library's first explorations into hypertext and embedded images from collection material with links to larger images, sound files and further information. "Portico" was a brand or service name for the British Library website which was replaced a few years later. In 2011 zip files making up the website were discovered containing a testing copy of the Library's 1995 website. After decompressing the files, the resulting directory structure was used to create a representation of the original site's layout for ingest into the Web Archive. This representation does not include the complete dataset. Links to information hosted then on a Gopher server are broken. Gopher is a predecessor of and later an alternative to the World Wide Web.
To my (pleasant) surprise, the recording of a nightingale, embedded on the page which features John Keats' `Ode to a nightingale', in Au file format, played beautifully on my machine in Chrome, Firefox as well as Internet Explorer - I do wonder if this qualifies as the earliest "tweet" on the Web?
A Web archive not only contains historical copies of individual websites, when viewed in its entirety, it also provides a bigger picture and allows analysis and data mining which can lead to undiscovered patterns and trends. We blogged previously about Austrian researcher Rainer Simon's analysis and visualisation of the 1996 UK Web, using our UK Host-Level Link Graph (1996-2010) dataset. Our effort in data analytics will continue in the Big UK Domain Data for Arts and Humanities project, funded by the Arts and Humanities Research Council to develop both a methodological framework and tools to support the analysis of the UK Web Archive by researchers in the arts and humanities. The project aims to deliver a major study of the history of UK Web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on.
Tim Berners-Lee, the World Wide Web Foundation and the World Wide Web Consortium are inviting everyone, everywhere to wish the Web a happy birthday using #web25. They have also joined forces to create webat25.org, a site where users can leave birthday greetings for the Web, view greetings from others and find out more about the Web’s history.
Please join in.
Head of Web Archiving