Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

13 March 2015

France - UK: complementary views on web archiving

Add comment Comments (0)

Considering the nature of the web, it is fairly impossible to archive all of it, and choices have to be made. Usually two strategies are combined. The first one aims at being representative, by collecting a sample of everything without discrimination. The second one selects websites in order to build a collection, the way libraries are used to do with more traditional material. UK and France both combine the two methods.

UK has recently changed its legislation (6^th April 2013) to embrace non-print resources in the legal deposit scope, including web sites. France had already done that shift in 2006.

Both national libraries use robots to broadly crawl the national web every year. In UK the crawling is done by the British Library. The National Archive also collects web sites related to government (UK Government web archive), but this comes under another regulation, the Public Records Act. In France, INA (Institut National de l’Audiovisuel) archives all the websites related to radio and television, while BnF (Bibliothèque nationale de France) is in charge of all the rest.

To complete this broad harvesting both countries create collections on specific topics, made of websites collected by curators in their area of expertise. To do so, national libraries may be helped by partners: researchers, associations but mostly other libraries. In UK five other libraries are in charge of legal deposit and participate in web archiving. In France a similar partnership goes on with the network of regional libraries, also contributing to legal deposit.

In BnF, the Digital Legal Deposit Department coordinates a network of correspondents in each department, where specific policies have been developed through the years. What’s happening now is that the global BnF’s selection policy is being updated and will include websites, considering they are not different from any other material, which makes sense.

Breadth vs openness

The websites collected for legal deposit purpose can only be consulted in the libraries reading rooms, for copyright reasons. But while all the websites collected by BnF are only accessible in the reading rooms dedicated to researchers, British Library gives access on the UK Web Archive to a part of its collections. This showcases websites for which permission has been obtained. This process is of course very time consuming and frustrating, for only 30% of the permission requests receive a positive answer and the vast majority receive no answer at all.

Exploring the collections

BnF proposes a research through URL and a guiding approach through specific topics, in order to give an overview of the collections. For example, one of its remarkable selections is related to private diaries on the web. Others may concern elections, sustainable development, science, and many others themes.

It’s similar in the open UK web archive where you can browse the archive by special collection (Queen Jubilee, Northern Ireland…). As in France, the choice of a topic if often related to current affairs. At the moment, a collection about Magna Carta is being developed regarding the exhibition to come, as well as one concerning the next General Election.

Openness seems to be a good goal for highlighting the collection. The Open UK Web Archive is promoted via British Library’s website, this blog, Twitter… It provides fine visualisation tools and most importantly pretty good research functionalities. They’re based on title, URL and dates. There’s a full text index too for the massive legal deposit crawl and this is quite remarkable. (To give an idea of the magnitude of the task, it will take about six months to generate the 2014 crawling’s index). Then, when you type a research, you sometimes get really a lot of results and it can be far from easy to go through them, but this is another issue.

6-03-2015 Clémence Agostini (intern at the BL Web Archiving team from ENSSIB)

Posted by Sabine Hartmann at 10:45 AM

06 March 2015

2015 UK General Election Web Archive Special Collections

Add comment Comments (0)

With just over 9 weeks to go until the UK General Election, the Web Archiving team together with curators in four Legal Deposit Libraries (the British Library, The National Library of Wales, the National Library of Scotland and the Bodleian Library) have been busy archiving websites for a special collection about this significant national event.

It is a daunting task, but we are fairly experienced in this area having put together similar collections for the two past general elections, 2005 and 2010.

Sampling approach

We cannot predict the size of the UK political Web Sphere, however there are 650 parliamentary constituencies, 422 registered political parties (December 2014 the Electoral Commission) and several thousand prospective parliamentary candidates standing for election in 2015.

The vast majority of Parties and candidates are likely to have social media channels in addition to their ‘official’ websites, therefore, rather than attempting the impossible task of identifying every single political website, a sampling approach of has been applied. All major and minor UK Parties will be collected along with a representative sample of c. 120 candidates taken from one urban conurbation and one Shire County per region. For London, we have selected constituencies covering six boroughs, three inner London and three outer London. As we covered the same constituencies for the 2005 and 2010 elections we will have a time-series which will give future researchers a sense of how the Web was used by politicians across the decade.

Political landscape

In addition, the collection will comprise of a large number of news, commentary, opinion polls, research centres, think tanks and interest groups as well as some more entertaining sites such the Bus Pass Elvis Party aka Church of the Militant Elvis.

Inevitably the political landscape as well as the world of web archiving has changed in the ten years since we started archiving UK general elections. Firstly, the date of the 2015 General Election was fixed in advance following the Fixed-term Parliaments Act 2011, meaning that campaigning started much earlier than with previous elections. This year we started collecting in January whereas in previous years it has been a bit later into the year.

Of even more significance from the web archivist’s point of view is that Legal deposit Legislation was introduced in April 2013 enabling us to archive pretty much everything we want within the UK web sphere, although permission must still be sought to make content publically accessible.

One million tweets

In terms of the content of the collection, we are certainly archiving much more social media than in previous elections. Much has been written about the uptake in social media among politicians as they increasingly try to reach voters over the internet.

MPs sent almost one million tweets in 2013 up 28 per cent on the previous year and 230 per cent from 2011. It is crucial that we work to overcome the technical and legal challenges involved in archiving social media as it is one of the most important channels for scholars studying our times and is one of the most demanded types of content by researchers.

Visual tools

The resulting collection will be available online through the UK Web Archive in the case of content for which we have permission from website publishers, and in the reading rooms of the Legal Deposit Libraries for all other material we have collected. We also hope to continue improving access to our collections by way of data-based visual tools to access the archive's content as alternatives to the standard search and browse functions.

In 2005, for example, we implemented a Word Cloud generator for websites belonging to key political parties which shows the most frequently used words in the websites during the 2005 election campaign.

Nominate

We would be delighted to hear about websites related to the UK general election and would encourage readers to submit suggestions on our nomination form at http://www.webarchive.org.uk/ukwa/info/nominate

Nicola Bingham, Web Archivist

02/03/15

Posted by Sabine Hartmann at 3:11 PM

05 March 2015

Happy Birthday Magna Carta! All the best from the Web Archive xxx

Add comment Comments (0)

With the opening of the great exhibition here at the British Library just days away, I have been working on the Magna Carta special collection for the Web Archive.

Media Coverage

By coincidence I started a couple of days after the magnificent discovery of a copy found in Sandwich in a bundle of Victorian documents. The media coverage was enormous from the leading broadsheets to the satirical Daily Mash, which claimed ‘Magna Carta gives England back to France’ in the title. Just looking through the headlines, it is quite interesting to see how the media are using the Magna Carta. The term familiar from the schooldays is used in every possible way: from the actual coverage of the 800^th anniversary, auction of a copy in the US in 2007, political analysis, legal impact, British values, TV reviews of David Starkey’s programme to criticising the Prime Minister David Cameron for his performance on David Letterman’s talk show, when he could not remember what Magna Carta actually means. If that was not bad enough, the media ‘re-printed’ Boris Johnson’s defence that the PM ‘feigned ignorance’ on American TV.

Digital Magna Carta

More recently the media picked on Tim Berners-Lee’s idea of the Magna Carta for the Internet and the political idea of the new Magna Carta of the devolution of power for the regions. The online newspapers (and other websites including the Salisbury Cathedral) also wrote about Jay-Z’s album ‘Magna Carta - Holy Grail’. As a selector I am not sure whether to include the last three Magna Cartas (Internet, devolution and the album) into the collection. Is it going too far? If not, where to stop?

Searching for a fairly popular term always brings the sigh of relief: Soooo many results – great! And at the same time the sigh of worry: Soooo many results – what am I going to do with all the material?! Also it is interesting to see the number of results – some publishers use the term ‘Magna Carta’ in many contexts hoping to attract readers, some on contrary just report the facts. The numbers of urls vary, not only because the type of audiences, but simply because the open online archives of the newspapers cover different time periods. It is also good to see how much reporting is done on the local level, particularly in the cities owning the copies of the historic document.

Soooo many results

The selections for the collection cover not only the media, but also social media coverage, arts and humanities, involvement of the church and local authorities in the celebrations, higher education events, school and research programmes, the underpinning organisation Magna Carta 800^th, civil rights groups, tourist information and attractions, including the Magna Carta pub and the Magna Carta barge hotel.

There is also the coverage of the Magna Carta cake, Magna Carta chutney, Magna Carta ale, Magna Carta inspired garden for a flower show and celebrating the 800^th anniversary with a #jelfie!

Surely there is more to come and I am quite curious what else the online world will say on the Magna Carta.

If you know of an event near you (no matter how low key), or you have read something interesting or just think something should be included in the collection please nominate a site here http://www.webarchive.org.uk/ukwa/info/nominate.

Dorota Walker, Assistant Web-Archivist

Posted by Sabine Hartmann at 3:54 PM

19 February 2015

Building a 'Historical Search Engine' is no easy thing

Add comment Comments (0)

Over the last year the UK Web Archive has been part of the Big UK Domain Data for the Arts and Humanities project, with the ambitious goal of building a ‘historical search engine’ covering the early history of the UK web. This continues the work of the Analytical Access to the Domain Dark Archive project but at a greater scale, and moreover, with a much more challenging range of use cases. We presented the current prototype at the International Digital Curation Conference last week (written up by the DCC), and received largely positive feedback, at least in terms of how we have so far handled the scale of the collection.

What the researchers found
However, we are eagerly awaiting the results of the real test of this system, from the project’s bursary holders. Ten researchers have been funded as ‘expert users’ of the system, each with a genuine historical research question in mind. Their feedback will be critical in helping us understand the successes and failures of the system, and how it might be improved.

One of those bursary holders, Gareth Millward, has already talked about his experience, including this (somewhat mis-titled but otherwise excellent) Washington Post article “I tried to use the Internet to do historical research. It was nearly impossible.” Based on that, it seems like the results are something of a mixed bag (and from our informal conversations with the other bursary holders, we suspect that Gareth’s experiences are representative of the overall outcome). But digging deeper, it seems that this situation arises not simply because of problems with the technical solution, but because of conflicting expectations of how the search should behave.

For example, as Gareth states, if you search for RNIB using Google, the RNIB site and information about it is delivered right at the top of the results.

But does this reflect what our search engine should do?

Is a historical search engine like Google?
When Google ranks its results, it is making many assumptions. About the most important meanings of terms, the current needs of its users and the information interests of specific users (also known as the filter bubble). What assumptions should we make? Are we even playing the same game?

One of the most important things we have learned so far is that we are not playing the same game, and the information needs of our researchers might be very different to those of a normal search (and indeed different between different users). When a user searches for ‘iphone’, Google might guess that you care about the popular one, but perhaps a historian of technology might mean the late 1990’s Internet Phone by VocalTec. Terms change their meaning over time, and we must enable our researchers to discover and distinguish the different usages. As Gareth says “what is ‘relevant’ is completely in the eye of the beholder.”

Moreover, in a very fundamental way, the historians we have worked with are not searching for the one top document, or a small set of documents about a specific topic. They look to the web archive as a refracting lens onto the society that built it, and are using these documents as intermediaries, carrying messages from the past and about the past. In this sense, caring about the first few hits makes no sense. Every result is equally important.

How results are sorted
To help understand these whole sets of results, we have endeavoured to add appropriate filtering and sorting options that can be used to ‘slice and dice’ the data down into more manageable chunks. At the most basic level (and contrary to the Washington Post article), the results are sorted, and the default is to sort by ascending harvest date. The contrast with a normal search engine is perhaps no more stark than here – where BING or Google will generally seek to bring you the most recent hits, we focus on the past, something that is very difficult to achieve using a normal search engine.

With so many search options, perhaps the biggest challenge has been to present them to our users in a comprehensible way. For example, the problem where the RNIB advertisements for a talking watch were polluting the search results can be easily remedied if you combine the right search terms. The text of the advert is highly consistent, and therefore it is possible to precisely identify those advertisements by searching for the text “in associate with the RNIB”. This means it is possible to refine a search for RNIB to make sure we exclude those results (as you can see below).

The problems are even more marked when it comes to trying to allow network analysis to be exploited. We do already extract links from the documents, and so it is already possible to show how the number of sites linking to the RNIB has changed over time, but it is not yet clear how best to expose and utilize that information. At the moment, the best solution we have found is to present this network links as additional search facets. For example, here are the results for the sites that linked to rnib.org.uk in 2000, which you can contrast with those for 2010.

Refining searches further
Currently, we expect that refining a search on the web archive will involve a lot this kind of operation, combining new search terms and clauses to help focus in on the documents of interest. Therefore, looking further ahead, we envisage that future iterations of this kind of service might take the research queries and curatorial annotations we collect and start to try to use that information to semi-automatically classify resources and better predict user needs.

A ‘Macroscope’ rather than a search engine
Despite the fact that it helps get the overall idea across, calling this system a ‘historical search engine’ turns out to be rather misleading. The actual experience and ‘information needs’ of our researchers are very different from that case. This is why we tend to refer to this system as a Macroscope (see here for more on macroscopes), or as a Web Observatory. Sometimes a new tool needs a new term.

Throughout all of this, the most crucial part has been to find ways of working closely with our users, so we can all work together to understand what a ‘Macroscope’ might mean. We can build prototypes, and use our users’ feedback to guide us, but at the same time those researchers have had to learn how to approach such a complex, messy dataset.
Both the questions and the answers have changed over time, and all parties have had their expectations challenged. We look forward to continuing to build a better Macroscope, in partnership with that research community.

By Dr Andrew Jackson, Web Archiving Technical Lead, The British Library

Posted by Jason Webber at 4:20 PM

Tags

Collections, Web/Tech

30 January 2015

Collecting Data To Improve Tools

Add comment Comments (0)

Like many other institutions, we are heavily dependent on a number of open source tools. We couldn’t function without them, and so we like to find ways to give back to those communities. We don’t have a lot of spare time or development capacity to contribute, but recently we have found another way to provide useful feedback.

ApacheTika

Large-scale extraction

At the heart of our discovery stack lies Apache Tika, the piece of software we use to try to parse the myriad of data formats in our collection in order to extract the textual representation (along with any useful metadata) that goes into our search indexes. Consequently, we have now executed Apache Tika on many billions of distinct resources, dating from 1995 to the present day. Due to the age and variablity of the content, this often tests Tika to it’s limits. As well as failing to identify many formats, it sometimes simply fails, throwing out an unexpected error, or by getting locked in a infinite loop.

Logging losses

Each of those failures represents a loss – a resource that may never be discovered because we can’t understand it. This may be because it’s malformed, perhaps even damaged during download. It may also be an sign of obsolescence, in that it may indicate the presence of data formats that are poorly understood, and are therefore likely to present a challenge to our discovery and access systems. So, instead of ignoring these errors, we decided to remember them. Specifically, each is logged as a facet of our full-text index, alongside the identity of the resource that caused the problem.

Sharing the results

We’ve been collecting this data for a while, in order to help us tell a broken bitstream from a forgotten format. However, in a recent discussion with the Apache Tika developers, they have indicated that they would also find this data useful as a way of improving the coverage and robustness of their software.

This turns out to be a win-win situation. We store the data we were intending to store anyway, but also share it with the tool developers, who get to improve their software in ways we will be able to take direct advantage of as we run later versions of the tool over our archives in the future.

And it feels good to give a little something back.

– by Andy Jackson

@anjacks0n

Posted by Sabine Hartmann at 1:44 PM

28 January 2015

Spam as a very ephemeral (and annoying) genre…

Add comment Comments (0)

Spam is a part of modern life. Who hasn’t received any recently, is a lucky person indeed. But only try to put your email out there in the open and you’ll be blessed with endless messages you don’t want, from people you don’t know, from places you’ve never heard about! And then just delete, de-le-te, block sender command…

Imagine though someone researching our web lives in say 50 years and this part of our daily existence is nowhere to be found. Spam is the ugly sister of the Web Archive, it is unlikely we’ll keep spam messages in our inboxes, and almost certainly no institution will keep them for posterity. And yet they are such great research materials. They vary in topics, they can be funny, they can be dangerous (especially to your wallet), and they make you shake your head in disbelief…

We all know the spam emails about people who got stuck somewhere and they can’t pay the bill and ask for a modest sum of £2,500 or so. Theses always make me think: if I had spare £2,500, it’d be Bora Bora here I come, but that’s just selfish me! Now these are taken to a new level. It’s about giving us the money that is inconveniently placed in a bank somewhere far, far away:

Charity spree

From Mrs A.J., a widow of a Kuwait embassy worker in Ivory Coast with a very English surname:

…Currently, this money is still in the bank. Recently, my doctor told me I would not last for the next eight months due to cancer problem. What disturbs me most is my stroke sickness. Having known my condition I decided to donate this fund to a charity or the man or woman who will utilize this money the way I am going to instruct here godly.

Strangely two weeks a Libyan lady, who is also a widow, is writing to me that she also suffered a stroke and all she wants to shower me with money as part of her charity spree:

Having donated to several individuals and charity organization from our savings, I have decided to anonymously donate the last of our family savings to you. Irrespective of your previous financial status, please do accept this kind and peaceful offer on behalf of my beloved family.

Spam

Mr. P. N. ‘an accountant with the ministry of Energy and natural resources South Africa’ was straight to the point:

… presently we discovered the sum of 8.6 million British pounds sterling, floating in our suspense Account. This money as a matter of fact was an over invoiced Contract payment which has been approved for payment Since 2006, now we want to secretly transfer This money out for our personal use into an overseas Account if you will allow us to use your account to Receive this fund, we shall give you 30% for all your Effort and expenses you will incure if you agree to Help.

My favourite is quite light-hearted. Got it from a 32 year old Swedish girl:

My aim of writing you is for us to be friends, a distance friend and from there we can take it to the next level, I writing this with the purest of heart and I do hope that it will your attention. In terms of what I seek in a relationship, I'd like to find a balance of independence and true intimacy, two separate minds and identities forged by trust and open communication. If any of this strikes your fancy, do let me know...

So what I’m a girl too, with a husband and a kid? You never know what may be handy…

Blog post by Dorota Walker
Assistant Web Archivist

@DorotaWalker

Further reading: Spam emails received by [email protected]. Please note that the quotations come from the emails and I left the original spelling intact.

Posted by Sabine Hartmann at 1:14 PM

Tags

Collections

11 November 2014

Collecting First World War Websites – November 2014 update

Add comment Comments (0)

Earlier in 2014 we blogged about the new Special Collection of websites related to World War One that we’ve put together to mark the Centenary. As today is Armistice Day, commemorating the cessation of hostilities on the Western Front, it seems fitting to see what we have collected so far.

The collection has been growing steadily over the past few months and now totals 111 websites. A significant subset of the WW1 special collection comes from the output of the Heritage Lottery Funded projects. The collection also includes websites selected by subject specialists at the British Library and nominations from members of the public.

A wide variety of websites have been archived so far which can broadly be categorised into a few different types:

Critical reflections
They include critical reflections on British involvement in armed conflict more generally, for example the Arming All Sides website, which features a discussion of the Arms trade around WW1 and Naval-History.net, an invaluable academic resource on the history of naval conflict in the First and Second World Wars.

Artistic and literary
The First World War inspired a wealth of artistic and literary output. For example the website dedicated to Eugene Burnand (1850-1921) a Swiss artist who created a series of pencil and pastel portraits depicting various ‘military types’ of all races and nationalities drawn into the conflict on all sides. Burnand was a man of great humanity and his subjects included typical men and women who served in the War as well as those of more significant military rank.

The Collection also includes websites of contemporary artists who in connection with the Centenary are creating work reflecting on the history of the conflict. One such artist is Dawn Cole whose work on WW1 has focused on the archive of WW1 VAD Nurse Clarice Spratling’s diaries, creating a project of live performance, spoken word and art installations.

Similar creative reflections from the world of theatre, film and radio can be seen in the archive. See for example Med Theatre: Dartmoor in WW1, an eighteen-month project investigating the effect the First World War had on Dartmoor and its communities. Pals for Life is a project based in the north-west aiming to create short films enabling local communities to learn about World War One. Subterranean Sepoys, is a radio play resulting from the work of volunteers researching the forgotten stories of Indian soldiers and their British Officers in the trenches of the Western Front in the first year of the Great War.

Community stories
The largest number of websites archived so far comprise projects produced by individuals or local groups telling stories of the War at a community level across the UK. The Bottesford Parish 1st World War Centenary Project focusses on 220 local recruits who served in the War using wartime biographies, memorabilia and memories still in the community to tell their stories.

The Wylye Valley 1914 project has been set up by a Wiltshire-based local history group researching the Great War and the sudden dramatic social and practical effects this had on the local population. In 1914 24,000 troops descended suddenly on the Wylye Valley villages, the largest of which had a population of 500, in response to Kitcheners’ appeals for recruits. These men arrived without uniform, accommodation or any experience of organisation. The project explores the effects of the War on these men and the impact on the local communities.

An important outcome of commemorations of the Centenary of WW1 has been the restoration and transcription of war memorials across the UK. Many local projects have used the opportunity to introduce the stories of those who were lost in the conflict. Examples include the Dover War Memorial Project; the Flintshire War Memorials Project ; Leicester City, County and Rutland War Memorials project and St. James Toxteth War memorials project.

Collecting continues
This shows just some of the many ways people are choosing to commemorate the First World War and demonstrates the continued fascination with it.

We will continue collecting First World War websites through the Centenary period to 2018 and beyond. If you own a website or know of a website about WW1 and would like to nominate it for archiving then we would love to hear from you. Please submit the details on our nominate form.

By Nicola Bingham, Web Archivist, The British Library

Posted by Jason Webber at 9:34 AM

Tags

Collections

03 November 2014

Powering the UK Web Archive search with Solr

Add comment Comments (0)

When you have hundreds of millions of webpages to search, what technologies do we use at the UK Web Archive to ensure the best service?

Solr
At the core of the UK Web Archive is the open source tool Apache Solr. To quote from their own website, ‘Solr is a very popular open source enterprise search platform that provides full-text and faceted searching’.

It is built using scalable and fault tolerant technologies, providing distributed indexing, automated failover and recovery, and centralised configuration management. And lots more besides – put simply, Solr is proactively pushing towards all aspects of big data search indexing and querying.

Open UK Web Archive
The UKWA website provides public access to more than 200 million UK selected webpages (the selection process includes gaining the permission to publish the archived site from the website owner, and you can nominate a website to archive via our Nominate a Site page.)

Once a site is harvested it is stored internally on several systems to ensure the safe keeping of the data. From these stores the data is ingested into the Solr service, which analyses the metadata and content, primarily to enable the fast querying of the service. Much of Solr’s speed comes from its way of indexing this data, which is called reverse-indexing.

Capable servers
To support these archived websites and provide the UK Web Archive search, we run the service on two dual Xeon servers – an HP ProLiant DL580 G5 with 96GB of RAM and an HP ProLiant DL380 G5 with 64GB of RAM. The data is stored on a Storage Area Network (SAN) using fibre channel connections.

The Solr service itself runs under the Apache Tomcat Java management service, and is split between the two physical servers as a master and slave setup – one provides the data ingest mechanism, the other provides the data querying mechanism for the public website.

Scalability
One of the benefits of using Apache Solr is that it is fairly simple to grow a system, in terms of both speed and data capacity. As the amount of web content increases, we can add more hardware to handle the extra load as Solr is designed from the outset as a distributed service.

By Gil Hoggarth, Web Archiving Technical Services Engineer, The British Library

Posted by Jason Webber at 10:35 AM

Tags

Web/Tech