UK Web Archive blog

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

29 May 2015

Beginners Guide to Web Archives Part 1

Add comment Comments (0)

Arriving at the British Library as an intern, one of the tasks laid out before me was to create and curate a special collection for the UK web archive. To some readers of this blog this activity may seem fairly self-explanatory. However, before arriving at the library I had never even heard of web archiving, let alone considered why we do it and who it could be useful for. In a short series of blogs I will explore these questions from the novice’s point of view, both my own and that of academic researchers hoping to use the resource. I hope to convey the new user’s perceptions of the challenges and opportunities of the archive, as well as providing an introduction for interested beginners.

Spiders spinning furiously

The web is a vast resource. In 2008 Google had found 1012 URLs online. It has been suggested that the web represents a rapid expansion in human knowledge. Certainly it enables greater access to human knowledge for billions of people. It is also a place where a huge range of opinions are openly expressed. However, the content of the web has a very rapid turnover, with around 40 % of websites changing their content within a week. Without web archiving (the practice of collecting and storing websites), many human writings are inevitably - often accidentally - lost.

The UK web archive now collects almost the entire UK web-space. One of the problems facing users of the archive is the astounding amount of data through which to sift. One way of getting around this problem is to create so-called ‘special collections’, groups of websites that fall under a particular theme. This enables the curator to provide the user with a set of data that is easier to sort and search.  

My special collection

WalkAgainstWarming
https://www.flickr.com/photos/erlandh/270904893/

As a science PhD student, I felt my special collection should be built with the aim of answering research questions related to a scientific topic. I specialise in oceanography and past climate changes and I am aware of the almost constant debate that occurs on hundreds of climate related websites about climate science, the social impacts of climate change and the policies that should be enforced. A special collection on these issues might be useful for answering questions such as: How has the web influenced public opinion on climate change? As new science rolls in, how do viewpoints expressed on the web change? How do different organisations use the web as a platform for promoting their beliefs?  

Global warming in perspective
https://www.flickr.com/photos/wheatfields/4688140998/in/photolist-2XsBdQ-92Bik-7u74nu-a55CZL-s6TSND-89gWZd-8FJLyQ

To provide a resource for answering these questions I plan to select webpages from organisations including environmental charities, climate sceptic think-tanks, energy companies and government; and yet more pages of blogs, articles and discussion. I hope that this collection will become a useful resource for anyone interested in the climate change issue. But would this resource be something researchers might actually use? And how might they go about using it? Find out in my next post.

PeteSpooner
Peter Spooner, Science Policy Intern

 

23 April 2015

Web archiving as a challenging business

Add comment Comments (0)

My internship here at the British Library’s Web Archiving team comes to an end and I try to sum up my impressions. I would say, I have been somewhat stricken by how a daunting task web archiving is, and how much challenges it creates for professionals.

Displaying an open collection

The British Library provides the public with an open collection of websites, accessible from anywhere. These open collections are resource heavy, being enriched with metadata and descriptions. This task is done by web curators and web archivists. The latter are also in charge of quality assurance, they check if the harvest was done properly by the web crawling software. Giving open access means asking permission from the website owners. This is a very labour intensive and slow process, which would easily require two or three times the current available resources. To face the emergency of some events, such as next General Election, the selection is done now, while the permission requests have to be postponed to a less busy time. For some resources, open access is not an option as for example some news websites who charge for access to their own archives.

  French in London
Providing searching tools

You’d think things should get easier since the 2013 Legal Deposit Libraries (Non-Print Works) Regulations have allowed British Library to collect and preserve UK websites without asking permission. But new issues arise: collecting a huge quantity of data, indexing it, preserving it on a long term perspective, dealing with the fact that the appearance of an archived website may not be the same as its live version. And then all this content must be made available for users (restricted to the reading rooms for websites without permission).

  LDUKWA-AT

But how does one search a web archive? Anyone who tried once probably had this annoying sense that there is definitely too much data to deal with. One of the challenges is consequently to provide users with efficient tools enabling them to find their way through this maze of data. Consequently users need to learn how to use these tools, bearing in mind their expectations may be shaped by the habit of using Google. Yet, using the web archive for scholarly purposes is a completely different approach. A historical search engine must meet specific requirements. No Google-like relevance sorting here but a mere chronological ranking enhanced with powerful results refine functionalities like events or time line. This research project from the L3S Research Centre in Germany is one amongst other involving web archive, showing that the tool building is made hand in hand with researchers who use web archive as a material for their work.

  Graph

Being involved in web archiving today is really fascinating. It means observing and being part of an emerging field. This was also discussed at the opening presentation of 2014 IIPC General Assembly.

A new job?

Web archiving is not really part of librarians’ training yet, and professionals have to learn by doing. At this moment in time web archiving only concerns few people, not more than a handful mostly based in national libraries (this becomes less true over time as can be seen in the composition of IIPC).

  Gallica

But issues arising with web archiving are in line with general trends for libraries. It concerns electronic journals management, mostly bought and displayed as packages, or mass digitisation projects. The new challenge consists in dealing with scale matters. The core business of librarians is seemingly shifting from selecting to highlighting resources. Social media channels are one of the new librarian’s tricks to do so. Most of digital libraries have a twitter account (see the often humorous @GallicaBnF) as well as the web archives (@internetarchive, @UKWebArchive@DLWebBnF). 

BLReadingRoom

 

 

 

 

 

 

 

 

 

 

Apart from archiving work these teams of specialists are doing, one other task is the promotion of web archives inside the libraries themselves. The reference staff may not be comfortable yet with this new material, and still very few readers use the web archive. Another challenge to come!

Clémence Agostini (intern at the BL Web Archiving team from ENSSIB)

25 March 2015

Political parties in the UK Web Archive

Add comment Comments (0)

As there’s only six weeks to go until the General Election, it might be a good time to look back on the previous elections web sphere, through the 2005 and 2010 General Election websites collected by the UK Web Archive.

Political parties’ websites currently in Parliament

The Conservative party: http://www.webarchive.org.uk/ukwa/target/101940/source/search
The Labour Party: http://www.webarchive.org.uk/ukwa/target/101311/source/search
Liberal democrats: http://www.webarchive.org.uk/ukwa/target/102621/source/search
UKIP: http://www.webarchive.org.uk/ukwa/target/109998/source/search
Green Party: http://www.webarchive.org.uk/ukwa/target/108088/source/search

  Conservatives

Scottish parties

Scottish National Party (SNP): http://www.webarchive.org.uk/ukwa/target/30441472/source/search Scottish Socialist Party: http://www.webarchive.org.uk/ukwa/target/99112/source/search

Welsh parties

Plaid Cymru - The Party of Wales: http://www.webarchive.org.uk/ukwa/target/102036/source/search

Northern Ireland parties

Democratic Unionist Party (DUP): http://www.webarchive.org.uk/ukwa/target/106592/source/search
Sinn Fein : http://www.webarchive.org.uk/ukwa/target/106020/source/search
Ulster Unionist Party (UUP): http://www.webarchive.org.uk/ukwa/target/105944/source/search
Social Democratic and Labour Party (SDLP): http://www.webarchive.org.uk/ukwa/target/107880/source/search
Alliance Party of Northern Ireland: http://www.webarchive.org.uk/ukwa/target/106002/source/search

  SinnFein

Other parties

Respect Party: http://www.webarchive.org.uk/ukwa/target/40632374/source/search
British National Party (BNP): http://www.webarchive.org.uk/ukwa/target/106040/source/search
The Liberal party: http://www.webarchive.org.uk/ukwa/target/40632386/source/search
Socialist labour party: http://www.webarchive.org.uk/ukwa/target/107243/source/search

LoonyParty

English Democrats: http://www.webarchive.org.uk/ukwa/target/29261833/source/search
The Christian party: http://www.webarchive.org.uk/ukwa/target/43810817/source/search
Health Concern (Independent Community & Health Concern): http://www.webarchive.org.uk/ukwa/target/37617688/source/search
Monster raving loony party: http://www.webarchive.org.uk/ukwa/target/110017/source/search

Candidates

You can also find former candidacy websites on the UK Web Archive. This might be interesting to check if old promises have been fulfilled. Below are some examples, but you can also try any other candidate by typing his or her name in the quick search box:  http://www.webarchive.org.uk/ukwa/subject/89/page/1

David Miliband (2010) http://www.webarchive.org.uk/ukwa/target/49905672/source/search
Nick Clegg (2010): http://www.webarchive.org.uk/ukwa/target/43188235/source/search
Nigel Farage (2010): http://www.webarchive.org.uk/ukwa/target/44695591/source/search
Caroline Lucas (2010): http://www.webarchive.org.uk/ukwa/target/44695599/source/search
David Cameron (2005): http://www.webarchive.org.uk/wayback/archive/20050524120000/http://www.votedavidcameron.com/index.html

  NicholasClegg

 Enjoy !

Clémence Agostini (intern at the BL Web Archiving team from ENSSIB)

 

13 March 2015

France - UK: complementary views on web archiving

Add comment Comments (0)

Flags_of_France_and_the_UK

Considering the nature of the web, it is fairly impossible to archive all of it, and choices have to be made. Usually two strategies are combined. The first one aims at being representative, by collecting a sample of everything without discrimination. The second one selects websites in order to build a collection, the way libraries are used to do with more traditional material. UK and France both combine the two methods.

UK has recently changed its legislation (6th April 2013) to embrace non-print resources in the legal deposit scope, including web sites. France had already done that shift in 2006.

Both national libraries use robots to broadly crawl the national web every year. In UK the crawling is done by the British Library. The National Archive also collects web sites related to government (UK Government web archive), but this comes under another regulation, the Public Records Act. In France, INA (Institut National de l’Audiovisuel) archives all the websites related to radio and television, while BnF (Bibliothèque nationale de France) is in charge of all the rest.

To complete this broad harvesting both countries create collections on specific topics, made of websites collected by curators in their area of expertise. To do so, national libraries may be helped by partners: researchers, associations but mostly other libraries. In UK five other libraries are in charge of legal deposit and participate in web archiving. In France a similar partnership goes on with the network of regional libraries, also contributing to legal deposit.

In BnF, the Digital Legal Deposit Department coordinates a network of correspondents in each department, where specific policies have been developed through the years. What’s happening now is that the global BnF’s selection policy is being updated and will include websites, considering they are not different from any other material, which makes sense.

Breadth vs openness

The websites collected for legal deposit purpose can only be consulted in the libraries reading rooms, for copyright reasons. But while all the websites collected by BnF are only accessible in the reading rooms dedicated to researchers, British Library gives access on the UK Web Archive to a part of its collections. This showcases websites for which permission has been obtained. This process is of course very time consuming and frustrating, for only 30% of the permission requests receive a positive answer and the vast majority receive no answer at all.

Exploring the collections

BnF proposes a research through URL and a guiding approach through specific topics, in order to give an overview of the collections. For example, one of its remarkable selections is related to private diaries on the web. Others may concern elections, sustainable development, science, and many others themes.

BnF_Screenshot

 It’s similar in the open UK web archive where you can browse the archive by special collection (Queen Jubilee, Northern Ireland…). As in France, the choice of a topic if often related to current affairs. At the moment, a collection about Magna Carta is being developed regarding the exhibition to come, as well as one concerning the next General Election.

Openness seems to be a good goal for highlighting the collection. The Open UK Web Archive is promoted via British Library’s website, this blog, Twitter… It provides fine visualisation tools and most importantly pretty good research functionalities. They’re based on title, URL and dates. There’s a full text index too for the massive legal deposit crawl and this is quite remarkable. (To give an idea of the magnitude of the task, it will take about six months to generate the 2014 crawling’s index). Then, when you type a research, you sometimes get really a lot of results and it can be far from easy to go through them, but this is another issue.

UKWA-Specialcollections_Screenshot

6-03-2015 Clémence Agostini (intern at the BL Web Archiving team from ENSSIB)

06 March 2015

2015 UK General Election Web Archive Special Collections

Add comment Comments (0)

With just over 9 weeks to go until the UK General Election, the Web Archiving team together with curators in four Legal Deposit Libraries (the British Library, The National Library of Wales, the National Library of Scotland and the Bodleian Library) have been busy archiving websites for a special collection about this significant national event. 

It is a daunting task, but we are fairly experienced in this area having put together similar collections for the two past general elections, 2005 and 2010.

  UK_General-Election

Sampling approach

We cannot predict the size of the UK political Web Sphere, however there are 650 parliamentary constituencies, 422 registered political parties (December 2014 the Electoral Commission) and several thousand prospective parliamentary candidates standing for election in 2015.

The vast majority of Parties and candidates are likely to have social media channels in addition to their ‘official’ websites, therefore, rather than attempting the impossible task of identifying every single political website, a sampling approach of has been applied. All major and minor UK Parties will be collected along with a representative sample of c. 120 candidates taken from one urban conurbation and one Shire County per region. For London, we have selected constituencies covering six boroughs, three inner London and three outer London. As we covered the same constituencies for the 2005 and 2010 elections we will have a time-series which will give future researchers a sense of how the Web was used by politicians across the decade.

Political landscape

In addition, the collection will comprise of a large number of news, commentary, opinion polls, research centres, think tanks and interest groups as well as some more entertaining sites such the Bus Pass Elvis Party aka Church of the Militant Elvis.

Inevitably the political landscape as well as the world of web archiving has changed in the ten years since we started archiving UK general elections. Firstly, the date of the 2015 General Election was fixed in advance following the Fixed-term Parliaments Act 2011, meaning that campaigning started much earlier than with previous elections. This year we started collecting in January whereas in previous years it has been a bit later into the year.

Of even more significance from the web archivist’s point of view is that Legal deposit Legislation was introduced in April 2013 enabling us to archive pretty much everything we want within the UK web sphere, although permission must still be sought to make content publically accessible.

One million tweets

In terms of the content of the collection, we are certainly archiving much more social media than in previous elections.  Much has been written about the uptake in social media among politicians as they increasingly try to reach voters over the internet

Twitter_cloud

MPs sent almost one million tweets in 2013 up 28 per cent on the previous year and 230 per cent from 2011. It is crucial that we work to overcome the technical and legal challenges involved in archiving social media as it is one of the most important channels for scholars studying our times and is one of the most demanded types of content by researchers.

Visual tools

UK-WA_visualisation
The resulting collection will be available online through the UK Web Archive  in the case of content for which we have permission from website publishers, and in the reading rooms of the Legal Deposit Libraries for all other material we have collected. We also hope to continue improving access to our collections by way of data-based visual tools to access the archive's content as alternatives to the standard search and browse functions.

 In 2005, for example,  we implemented a  Word Cloud generator for websites belonging to key political parties which shows the most frequently used words in the websites during the 2005 election campaign.

Nominate

We would be delighted to hear about websites related to the UK general election and would encourage readers to submit suggestions on our nomination form at http://www.webarchive.org.uk/ukwa/info/nominate

 

Nicola Bingham, Web Archivist

02/03/15

UK-WA_Nominate

05 March 2015

Happy Birthday Magna Carta! All the best from the Web Archive xxx

Add comment Comments (0)

MagnaCarta

With the opening of the great exhibition here at the British Library just days away, I have been working on the Magna Carta special collection for the Web Archive.

Media Coverage

By coincidence I started a couple of days after the magnificent discovery of a copy found in Sandwich in a bundle of Victorian documents. The media coverage was enormous from the leading broadsheets to the satirical Daily Mash, which claimed ‘Magna Carta gives England back to France’ in the title. Just looking through the headlines, it is quite interesting to see how the media are using the Magna Carta. The term familiar from the schooldays is used in every possible way: from the actual coverage of the 800th anniversary, auction of a copy in the US in 2007, political analysis, legal impact, British values, TV reviews of David Starkey’s programme to criticising the Prime Minister David Cameron for his performance on David Letterman’s talk show, when he could not remember what Magna Carta actually means. If that was not bad enough, the media ‘re-printed’ Boris Johnson’s defence that the PM ‘feigned ignorance’ on American TV.

MyDigitalRights

Digital Magna Carta

More recently the media picked on Tim Berners-Lee’s idea of the Magna Carta for the Internet and the political idea of the new Magna Carta of the devolution of power for the regions. The online newspapers (and other websites including the Salisbury Cathedral) also wrote about Jay-Z’s album ‘Magna Carta - Holy Grail’. As a selector I am not sure whether to include the last three Magna Cartas (Internet, devolution and the album) into the collection. Is it going too far? If not, where to stop?

 Searching for a fairly popular term always brings the sigh of relief: Soooo many results – great! And at the same time the sigh of worry: Soooo many results – what am I going to do with all the material?! Also it is interesting to see the number of results – some publishers use the term ‘Magna Carta’ in many contexts hoping to attract readers, some on contrary just report the facts. The numbers of urls vary, not only because the type of audiences, but simply because the open online archives of the newspapers cover different time periods. It is also good to see how much reporting is done on the local level, particularly in the cities owning the copies of the historic document.

GoogleSearchMagnaCarta

Soooo many results

The selections for the collection cover not only the media, but also social media coverage, arts and humanities, involvement of the church and local authorities in the celebrations, higher education events, school and research programmes, the underpinning organisation Magna Carta 800th, civil rights groups, tourist information and attractions, including the Magna Carta pub and the Magna Carta barge hotel.

There is also the coverage of the Magna Carta cake, Magna Carta chutney, Magna Carta ale, Magna Carta inspired garden for a flower show and celebrating the 800th anniversary with a #jelfie!

UK-WA_Nominate
Surely there is more to come and I am quite curious what else the online world will say on the Magna Carta.


If you know of an event near you (no matter how low key), or you have read something interesting or just think something should be included in the collection please nominate a site here http://www.webarchive.org.uk/ukwa/info/nominate

Dorota Walker, Assistant Web-Archivist

19 February 2015

Building a 'Historical Search Engine' is no easy thing

Add comment Comments (0)

Over the last year the UK Web Archive has been part of the Big UK Domain Data for the Arts and Humanities project, with the ambitious goal of building a ‘historical search engine’ covering the early history of the UK web. This continues the work of the Analytical Access to the Domain Dark Archive project but at a greater scale, and moreover, with a much more challenging range of use cases. We presented the current prototype at the International Digital Curation Conference last week (written up by the DCC), and received largely positive feedback, at least in terms of how we have so far handled the scale of the collection.

What the researchers found
However, we are eagerly awaiting the results of the real test of this system, from the project’s bursary holders. Ten researchers have been funded as ‘expert users’ of the system, each with a genuine historical research question in mind. Their feedback will be critical in helping us understand the successes and failures of the system, and how it might be improved.

One of those bursary holders, Gareth Millward, has already talked about his experience, including this (somewhat mis-titled but otherwise excellent) Washington Post article “I tried to use the Internet to do historical research. It was nearly impossible.” Based on that, it seems like the results are something of a mixed bag (and from our informal conversations with the other bursary holders, we suspect that Gareth’s experiences are representative of the overall outcome). But digging deeper, it seems that this situation arises not simply because of problems with the technical solution, but because of conflicting expectations of how the search should behave.

For example, as Gareth states, if you search for RNIB using Google, the RNIB site and information about it is delivered right at the top of the results.

But does this reflect what our search engine should do?

Is a historical search engine like Google?
When Google ranks its results, it is making many assumptions. About the most important meanings of terms, the current needs of its users and the information interests of specific users (also known as the filter bubble). What assumptions should we make? Are we even playing the same game?

One of the most important things we have learned so far is that we are not playing the same game, and the information needs of our researchers might be very different to those of a normal search (and indeed different between different users). When a user searches for ‘iphone’, Google might guess that you care about the popular one, but perhaps a historian of technology might mean the late 1990’s Internet Phone by VocalTec. Terms change their meaning over time, and we must enable our researchers to discover and distinguish the different usages. As Gareth says “what is ‘relevant’ is completely in the eye of the beholder.”

Moreover, in a very fundamental way, the historians we have worked with are not searching for the one top document, or a small set of documents about a specific topic. They look to the web archive as a refracting lens onto the society that built it, and are using these documents as intermediaries, carrying messages from the past and about the past. In this sense, caring about the first few hits makes no sense. Every result is equally important.

How results are sorted
To help understand these whole sets of results, we have endeavoured to add appropriate filtering and sorting options that can be used to ‘slice and dice’ the data down into more manageable chunks. At the most basic level (and contrary to the Washington Post article), the results are sorted, and the default is to sort by ascending harvest date. The contrast with a normal search engine is perhaps no more stark than here – where BING or Google will generally seek to bring you the most recent hits, we focus on the past, something that is very difficult to achieve using a normal search engine.

With so many search options, perhaps the biggest challenge has been to present them to our users in a comprehensible way. For example, the problem where the RNIB advertisements for a talking watch were polluting the search results can be easily remedied if you combine the right search terms. The text of the advert is highly consistent, and therefore it is possible to precisely identify those advertisements by searching for the text “in associate with the RNIB”. This means it is possible to refine a search for RNIB to make sure we exclude those results (as you can see below).

Shine-rnib-no-watch


The problems are even more marked when it comes to trying to allow network analysis to be exploited. We do already extract links from the documents, and so it is already possible to show how the number of sites linking to the RNIB has changed over time, but it is not yet clear how best to expose and utilize that information. At the moment, the best solution we have found is to present this network links as additional search facets. For example, here are the results for the sites that linked to rnib.org.uk in 2000, which you can contrast with those for 2010.

Refining searches further
Currently, we expect that refining a search on the web archive will involve a lot this kind of operation, combining new search terms and clauses to help focus in on the documents of interest. Therefore, looking further ahead, we envisage that future iterations of this kind of service might take the research queries and curatorial annotations we collect and start to try to use that information to semi-automatically classify resources and better predict user needs.

A ‘Macroscope’ rather than a search engine
Despite the fact that it helps get the overall idea across, calling this system a ‘historical search engine’ turns out to be rather misleading. The actual experience and ‘information needs’ of our researchers are very different from that case. This is why we tend to refer to this system as a Macroscope (see here for more on macroscopes), or as a Web Observatory. Sometimes a new tool needs a new term.

Throughout all of this, the most crucial part has been to find ways of working closely with our users, so we can all work together to understand what a ‘Macroscope’ might mean. We can build prototypes, and use our users’ feedback to guide us, but at the same time those researchers have had to learn how to approach such a complex, messy dataset.
Both the questions and the answers have changed over time, and all parties have had their expectations challenged. We look forward to continuing to build a better Macroscope, in partnership with that research community.

By Dr Andrew Jackson, Web Archiving Technical Lead, The British Library

30 January 2015

Collecting Data To Improve Tools

Add comment Comments (0)

Like many other institutions, we are heavily dependent on a number of open source tools. We couldn’t function without them, and so we like to find ways to give back to those communities. We don’t have a lot of spare time or development capacity to contribute, but recently we have found another way to provide useful feedback.

ApacheTika

Large-scale extraction

At the heart of our discovery stack lies Apache Tika, the piece of software we use to try to parse the myriad of data formats in our collection in order to extract the textual representation (along with any useful metadata) that goes into our search indexes. Consequently, we have now executed Apache Tika on many billions of distinct resources, dating from 1995 to the present day. Due to the age and variablity of the content, this often tests Tika to it’s limits. As well as failing to identify many formats, it sometimes simply fails, throwing out an unexpected error, or by getting locked in a infinite loop.

Logging losses

Each of those failures represents a loss – a resource that may never be discovered because we can’t understand it. This may be because it’s malformed, perhaps even damaged during download. It may also be an sign of obsolescence, in that it may indicate the presence of data formats that are poorly understood, and are therefore likely to present a challenge to our discovery and access systems. So, instead of ignoring these errors, we decided to remember them. Specifically, each is logged as a facet of our full-text index, alongside the identity of the resource that caused the problem.

Sharing the results

We’ve been collecting this data for a while, in order to help us tell a broken bitstream from a forgotten format. However, in a recent discussion with the Apache Tika developers, they have indicated that they would also find this data useful as a way of improving the coverage and robustness of their software.

This turns out to be a win-win situation. We store the data we were intending to store anyway, but also share it with the tool developers, who get to improve their software in ways we will be able to take direct advantage of as we run later versions of the tool over our archives in the future.

And it feels good to give a little something back.

– by Andy Jackson

@anjacks0n