UK Web Archive blog

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

24 June 2014

Your Web Archive Needs You!

Add comment Comments (0)

With the centenary of the outbreak of World War One taking place this summer the British Library’s Web Archiving team has been working with colleagues across the Library and beyond to initiate a ‘First World War Centenary Special Collection’ of websites.

The collection is part of a wide range of centenary projects under way at the Library including:

These projects will enable thousands of people to engage with the centenary and to showcase the many significant items held by the Library relating to the war.

The Special Collection
The web archive collection will include a huge variety of websites related to the centenary including the various events which will be taking place; resources about the history of the war; academic sites on the meaning of the conflict in modern memory and patterns of memorialisation and critical reflections on British involvement in armed conflict more generally.

The collection will help researchers find out how the First World War shaped our society and continues to touch our lives at a personal level in our local communities and as a nation.

Archiving began in April 2014 and will continue until 2019. Some examples of websites archived so far include:

We need your help!
Do you know of a website which may be suitable for the First World War Centenary Collection? If so, we would love to hear from you, particularly if you edit or publish a WW1 themed website yourself.

Websites could include those created by museums, archives, libraries, special interest groups, universities, performing arts groups, schools and community groups, family and local history societies or individual publications. It does not cost anything to have your website archived by the British Library and involves no work on your part once nominated.

Please nominate UK based WW1 related websites through our nominate form.

If you have HLF funding for a First World War Centenary project, please send the URL (web address) to [email protected] with your project reference number.

See what we have in the WW1 special collection so far.

Written by Nicola Bingham, Web Archivist, British Library

23 June 2014

Researcher in focus: Paul Thomas - UK and Canadian Parliamentary Archives

Add comment Comments (0)

At the UK Web Archive, we’re always delighted to learn about specific uses that researchers have been able to make of our data. One such case is from the work of Paul Thomas, a doctoral student in political science at the University of Toronto.

Paul writes:

‘The UK Web Archive has been a huge asset to my dissertation. My research examines how backbench parliamentarians in Canada, the UK and Scotland are increasingly cooperating across party lines through a series of informal organizations known as All-Party Groups (APGs). For the UK, the most important source for my research is the registry of APGs that is regularly produced by the House of Commons. The document, which is published in both web and PDF formats, provides details on the more than 500 groups that are in operation, including which MPs and Peers are involved, and what funding groups have received from outside bodies like lobbyists or charities.

‘A key part of the study involved using the registries to construct a dataset that tracked membership patterns across the various groups, and how they changed over time. Unfortunately, each time a new version of the registry is produced, the previous web copy is taken down.’

While the Parliamentary Archives keep old copies of the registry on file, they only do so in PDF – a format that is not so conducive to the extraction of information into a dataset. Paul was able to find and use successive versions from the UK Web Archive going back to 2006, including a number that were missing from the Internet Archive. Paul was also able to obtain pre-2006 versions from the Internet Archive. ‘Without the UK Web Archive, I would have first needed to purchase the past registries in PDF from the Parliamentary Archives and then painstakingly copy the details on each group into a dataset.’ Overall, Paul writes, ‘the UK Web Archive saved me an enormous amount of time in compiling my data'.

Paul recently gave a paper drawing on this data at the Annual Conference of the Canadian Political Science Association:
http://pauledwinjames.files.wordpress.com/2014/05/paul-thomas-cpsa2014v2.pdf

12 June 2014

How big is the UK web?

Add comment Comments (0)

The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

Red-button

You’ve probably played the time honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets….the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one 'host'.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions) 

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.

11 March 2014

‘Vague, but exciting’ - #web25

Add comment Comments (1)

When Tim Berners-Lee submitted a proposal in March 1989 for a "distributed hypertext system", his supervisor Mike Sendall commented: "Vague, but exciting". The Web is 25 years old today, no longer vague, still exciting.

We feel a sense of pride being one of those tasked with the mission of keeping a history of the Web. The British Library did not get involved in Web archiving until 2004, and our early efforts were done selectively, under licence. Supported by the Legal Deposit Act and Regulations we have been permitted to archive the UK Web at scale since April 2013. We completed our first domain crawl in June 2013, collecting 31TB data from over 1.3 billion URLs. We are currently getting ready for our 2014 domain crawl, planned to take place in May.

It is interesting to take a pause on the 25th birthday of the Web, and give some attention to the earliest instance of an archived website in the UK Web Archive. This happens to be a copy of the British Library website from 18 April 1995, not crawled from the live web  at that time using a web crawler but recreated and reassembled in 2011 using files found on a server - I still vividly remember the day when a colleague delivered a dusty box filled with CDs to my office. The notes by the web archivist read as follows:

This is the earliest archival version of the British Library website, showing the Library's first explorations into hypertext and embedded images from collection material with links to larger images, sound files and further information. "Portico" was a brand or service name for the British Library website which was replaced a few years later. In 2011 zip files making up the website were discovered containing a testing copy of the Library's 1995 website. After decompressing the files, the resulting directory structure was used to create a representation of the original site's layout for ingest into the Web Archive. This representation does not include the complete dataset. Links to information hosted then on a Gopher server are broken. Gopher is a predecessor of and later an alternative to the World Wide Web.

To my (pleasant) surprise,  the recording of a nightingale, embedded on the page which features John Keats' `Ode to a nightingale', in Au file format, played beautifully on my machine in Chrome, Firefox as well as Internet Explorer - I do wonder if this qualifies as the earliest "tweet" on the Web?

BL

A Web archive not only contains historical copies of individual websites, when viewed in its entirety, it also provides a bigger picture and allows analysis and data mining which can lead to undiscovered patterns and trends. We blogged previously about  Austrian researcher Rainer Simon's analysis and visualisation of the 1996 UK Web, using our UK Host-Level Link Graph (1996-2010) dataset. Our effort in data analytics will continue in the  Big UK Domain Data for Arts and Humanities project, funded by the Arts and Humanities Research Council to develop both a methodological framework and tools to support the analysis of the UK Web Archive by researchers in the arts and humanities. The project aims to deliver a major study of the history of UK Web space from 1996 to 2013, including language, file formats, the development of multimedia content, shifts in power and access, and so on. 

Tim Berners-Lee, the World Wide Web Foundation and the World Wide Web Consortium are inviting everyone, everywhere to wish the Web a happy birthday using #web25. They have also joined forces to create webat25.org, a site where users can leave birthday greetings for the Web, view greetings from others and find out more about the Web’s history. 

Please join in.

Helen Hockx-Yu
Head of Web Archiving 

19 February 2014

Jorge Luis Borges and Twitter

Add comment Comments (0)

[A guest post from writer and Museum Studies tutor Rebecca Reynolds]

When I first heard that the British Library was archiving every webpage with a .uk domain name, I immediately thought of Borges's short story Funes the Memorious, about a man who can forget nothing. 'I have more memories in myself alone than all men have had since the world was a world', Funes says; 'my memory, Sir, is like a garbage disposal'.

I spoke to Helen Hockx-Yu, Head of Web Archiving at the British Library, about this, focusing on Twitter pages. Will ephemera in such quantities be truly useful to researchers of the future?

Helen commented that this was up to researchers to decide but was clear that as many webpages as possible needed to be kept. 'When you research a person's life, or history, you don't have everything - you piece it together.' she said. 'Hopefully what we're doing would form part of those pieces.' She gave as an example Antony Gormley's 2009 One and Other art project in which members of the public took turns to stand on the fourth plinth in Trafalgar Square and say whatever they wanted. The website recording these people is no longer available but is in the UK Web Archive. For some websites, Helen said, 'being ephemeral is exactly their significance'.

And what about privacy? Would you like researchers of the future poring over one of your ill-considered blog posts or tweets? Webpages can be withdrawn only under certain circumstances such as defamation or breaches of confidentiality. Helen's advice here was simply to be careful what you put in the public domain.

I also spoke to Jonathan Fryer, Liberal Democrat Euro-candidate for London, two of whose Twitter pages have been put in a UK Web Archive collection devoted to blogs and bloggers. He thought archiving Twitter feeds was a good idea: 'Twitter has taken over from letters and other forms of exchange of information and ideas. Forms of communication such as blogs and Twitter need to be kept instead.'

Jonathan Fryer

Back to Borges's story. The narrator doubts that Funes can think, despite his prodigious memory: 'To think is to forget a difference, to generalise, to abstract. In the overly replete world of Funes there were nothing but details, almost contiguous details.' Perhaps the Twittersphere is another 'overly replete' world. In any case, here are some 'contiguous details' from Jonathan Fryer's Twitter page in the archive. Which, if any, do you think might be worth keeping?

Just purged 8 American floozies from my followers. How do they get to latch onto one like limpets?

David Cameron is 'very relaxed' about Andy Coulson and allegations of bugging and blagging. He shouldn't be.

Went to see 'Bruno'; a real curate's egg, but two or three brilliant scenes.

Jonathan Fryer's Twitter page will appear in a book I am currently working on, exploring unusual museum objects from around the UK, using interviews with people from inside and outside museums. Other ephemera in the book are a 19th-century leaflet advertising a live mermaid from Reading University's Centre for Ephemera Studies, and toilet paper from The Land of Lost Content museum in Shropshire.

Rebecca Reynolds (Twitter: @rebrey)

07 February 2014

New research project: Big UK Domain Data for the Arts and Humanities

Add comment Comments (0)

We are delighted to have been awarded Arts and Humanities Research Council funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project, one of 21 to be funded as part of the AHRC’s Big Data Projects call, is led by the Institute of Historical Research (University of London), in collaboration with ourselves at the British Library, the Oxford Internet Institute and Aarhus University.

Here are some details, from the project blog:

"The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analysis.

"A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.

 

15 January 2014

RESAW: Research infrastructure for the Study of Archived Web materials

Add comment Comments (0)

[Helen Hockx-Yu, Head of Web Archiving at the British Library, writes:]

Two scholars at Aarhus University, Denmark, Niels Brugger and Niels Ole Finneman, organised a workshop in December for potential partners of RESAW, an initiative aimed at building a pan-European research infrastructure for the study of web archives. An important element of the infrastructure is existing national web archives, often underpinned by legal frameworks such as legal deposit or copyright law but not fully available publicly. To make use of such archives, researchers have to be present physically at archiving institutions’ premises.

A research infrastructure, however, is more than isolated national web archives with restricted access, often referred to as “dark archives”. The goal is to find ways to link these together and offer seamless access to distributed web archives. The Mementos Service developed by the UK Web Archive, which allows discovery and delivery of archived web pages from multiple web archives, is a good example of how this could be done. Anat Ben David of the University of Amsterdam, associated with the WebArt project, presented impressive and promising search and visualisation approaches, which significantly improve access to large scale, closed national web archives.

Awareness and understanding of the characteristics of archived web material, and the development of appropriate research methods to study it, are equally indispensable elements of RESAW. It is not surprising that, in addition to a number of national web archives, there was strong representation of researchers at the workshop, from universities and research institutions across Europe. In his keynote, Niels Ole Finneman analysed the particularities of archived web material against the context of the live web as well as in the study of other digital sources. He argued that the archived web is “re-born” digital content, and differs from the live web in many ways. RESAW does not have a particular disciplinary focus but aims to allow for all kinds of epistemological and methodological approaches, whether rooted within the sciences, the social sciences or the humanities.

I was honoured to be able to present the perspectives of web archiving institutions, and was given a brief to focus on our interactions with scholars. I reported on our earlier work on scholarly feedback and highlighted an increasing amount of interactions with scholars in recent years, with a number of research groups emerging, which devote effort and attention to web archives. UK institutions among these include the Institute of Historical Research based at the University of London, and the Oxford Internet Institute. Both have recently been funded by the Joint Information Systems Committee (JISC) to carry out research projects using web archives, in partnership with the British Library. A general trend with three phases can be observed with regard to scholarly interaction with web archives:

Phase 1: Building collections
Scholars are involved in scoping collections, selecting and describing websites relevant to research interests. This effort often ended up with the creation of specific, if sometimes narrow, topical collections.

Phase 2: Formulating research questions
This often takes the forms of brain-storming sessions, workshops and projects, where researchers are made aware of web archives and asked the question: which research questions might web archives help you answer? This is a much more bilateral interaction and represents a shift of focus to web archives in their entirety. It however suffers from being required to define the unknown, and is also time- and resource-consuming.

Phase 3: independent use of web archives
This type of interaction has just begun to emerge. It is the desired “go-to” state, where interfaces to web archives already meet the most common scholarly requirements. Scholars are able to use web archives without having to depend on (personal) interactions with providers. This requires user interfaces to be self-explanatory, jargon-free and to contain base-line information about the archive. This includes information on the scope of the archive, its coverage and lacunae, how it was collected, and how a particular website was crawled.

RESAW is aiming to apply for funding from the European Commission under the Horizon 2020 Framework. The workshop was an opportunity to identify issues and discuss a plan. It produced a list of work, which RESAW will tackle and address, as well as the steps towards a funding application.

As one of the providers of the UK’s national web archive, we are pleased to be involved as we see RESAW as an important initiative which will help connect scholars with web archives and with each other in new ways.

11 December 2013

Political party web archives

Add comment Comments (0)

There's been some news coverage in the last few weeks of the decision of the Conservative Party to reorganise their website, removing an archive of speeches up to 2010. The original report appeared in Computer Weekly (here) and subsequently the story was picked up by media including The Guardian, the Financial Times and Channel 4 News. In the subsequent debate there were a few factual inaccuracies, and so we thought it worth blogging about archival copies of these pages, and of other UK political party content.

Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted - all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page.  The Internet Archive policy extends the same courtesy to playback.

At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively,  it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.

Whatever the details of this particular case, it's worth noting that the Internet Archive's playback policy is not widely known. Most webmasters only consider search engine crawlers when they configure their robot rules. For example, it is not uncommon to use this mechanism in order to prevent crawlers from creating lots of 'Not Found' errors as they follow incoming links to content that is not longer available.

For our own part, we had been archiving the whole Conservative Party site since 2004, by the express permission of the party, and those archived copies are available in the public UK Web Archive (UKWA). We have also archived the sites of the Labour party and the Liberal Democrats since around the same time. In contrast with the Internet Archive, we do not use recent changes to robots.txt to determine access to archived sites.

There are many other sites for which we do not have the same permission. However, since the advent of Non-Print Legal Deposit in April 2013, we may archive any site from within the UK, although users must visit one of the six legal deposit libraries for the UK in order to see the archived copy.

It isn't only the sites of the main political parties that we archive. Also in UKWA are extensive collections for the 2005 and 2010 general elections and the 2009 elections to the European Parliament. As well as the sites of the main parties, these include the sites of local party organisations and individual candidates, as well as news media coverage, opinion polls and the contributions of interested groups and individuals. There are also many websites of sitting MPs, many of which have since disappeared from the live web as the member lost their seat. Examples of these include Kitty Ussher, minister in the Labour government between 2007 and 2009, and the Conservative former minister Peter Bottomley.

We have also archived materials relating to major changes in public administration, such as the abolition of the police authorities in England and Wales in 2012, and the reorganisation of the NHS (also in England and Wales) in April 2013.