THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

19 February 2015

Building a 'Historical Search Engine' is no easy thing

Add comment Comments (0)

Over the last year the UK Web Archive has been part of the Big UK Domain Data for the Arts and Humanities project, with the ambitious goal of building a ‘historical search engine’ covering the early history of the UK web. This continues the work of the Analytical Access to the Domain Dark Archive project but at a greater scale, and moreover, with a much more challenging range of use cases. We presented the current prototype at the International Digital Curation Conference last week (written up by the DCC), and received largely positive feedback, at least in terms of how we have so far handled the scale of the collection.

What the researchers found
However, we are eagerly awaiting the results of the real test of this system, from the project’s bursary holders. Ten researchers have been funded as ‘expert users’ of the system, each with a genuine historical research question in mind. Their feedback will be critical in helping us understand the successes and failures of the system, and how it might be improved.

One of those bursary holders, Gareth Millward, has already talked about his experience, including this (somewhat mis-titled but otherwise excellent) Washington Post article “I tried to use the Internet to do historical research. It was nearly impossible.” Based on that, it seems like the results are something of a mixed bag (and from our informal conversations with the other bursary holders, we suspect that Gareth’s experiences are representative of the overall outcome). But digging deeper, it seems that this situation arises not simply because of problems with the technical solution, but because of conflicting expectations of how the search should behave.

For example, as Gareth states, if you search for RNIB using Google, the RNIB site and information about it is delivered right at the top of the results.

But does this reflect what our search engine should do?

Is a historical search engine like Google?
When Google ranks its results, it is making many assumptions. About the most important meanings of terms, the current needs of its users and the information interests of specific users (also known as the filter bubble). What assumptions should we make? Are we even playing the same game?

One of the most important things we have learned so far is that we are not playing the same game, and the information needs of our researchers might be very different to those of a normal search (and indeed different between different users). When a user searches for ‘iphone’, Google might guess that you care about the popular one, but perhaps a historian of technology might mean the late 1990’s Internet Phone by VocalTec. Terms change their meaning over time, and we must enable our researchers to discover and distinguish the different usages. As Gareth says “what is ‘relevant’ is completely in the eye of the beholder.”

Moreover, in a very fundamental way, the historians we have worked with are not searching for the one top document, or a small set of documents about a specific topic. They look to the web archive as a refracting lens onto the society that built it, and are using these documents as intermediaries, carrying messages from the past and about the past. In this sense, caring about the first few hits makes no sense. Every result is equally important.

How results are sorted
To help understand these whole sets of results, we have endeavoured to add appropriate filtering and sorting options that can be used to ‘slice and dice’ the data down into more manageable chunks. At the most basic level (and contrary to the Washington Post article), the results are sorted, and the default is to sort by ascending harvest date. The contrast with a normal search engine is perhaps no more stark than here – where BING or Google will generally seek to bring you the most recent hits, we focus on the past, something that is very difficult to achieve using a normal search engine.

With so many search options, perhaps the biggest challenge has been to present them to our users in a comprehensible way. For example, the problem where the RNIB advertisements for a talking watch were polluting the search results can be easily remedied if you combine the right search terms. The text of the advert is highly consistent, and therefore it is possible to precisely identify those advertisements by searching for the text “in associate with the RNIB”. This means it is possible to refine a search for RNIB to make sure we exclude those results (as you can see below).

Shine-rnib-no-watch


The problems are even more marked when it comes to trying to allow network analysis to be exploited. We do already extract links from the documents, and so it is already possible to show how the number of sites linking to the RNIB has changed over time, but it is not yet clear how best to expose and utilize that information. At the moment, the best solution we have found is to present this network links as additional search facets. For example, here are the results for the sites that linked to rnib.org.uk in 2000, which you can contrast with those for 2010.

Refining searches further
Currently, we expect that refining a search on the web archive will involve a lot this kind of operation, combining new search terms and clauses to help focus in on the documents of interest. Therefore, looking further ahead, we envisage that future iterations of this kind of service might take the research queries and curatorial annotations we collect and start to try to use that information to semi-automatically classify resources and better predict user needs.

A ‘Macroscope’ rather than a search engine
Despite the fact that it helps get the overall idea across, calling this system a ‘historical search engine’ turns out to be rather misleading. The actual experience and ‘information needs’ of our researchers are very different from that case. This is why we tend to refer to this system as a Macroscope (see here for more on macroscopes), or as a Web Observatory. Sometimes a new tool needs a new term.

Throughout all of this, the most crucial part has been to find ways of working closely with our users, so we can all work together to understand what a ‘Macroscope’ might mean. We can build prototypes, and use our users’ feedback to guide us, but at the same time those researchers have had to learn how to approach such a complex, messy dataset.
Both the questions and the answers have changed over time, and all parties have had their expectations challenged. We look forward to continuing to build a better Macroscope, in partnership with that research community.

By Dr Andrew Jackson, Web Archiving Technical Lead, The British Library

30 January 2015

Collecting Data To Improve Tools

Add comment Comments (0)

Like many other institutions, we are heavily dependent on a number of open source tools. We couldn’t function without them, and so we like to find ways to give back to those communities. We don’t have a lot of spare time or development capacity to contribute, but recently we have found another way to provide useful feedback.

ApacheTika

Large-scale extraction

At the heart of our discovery stack lies Apache Tika, the piece of software we use to try to parse the myriad of data formats in our collection in order to extract the textual representation (along with any useful metadata) that goes into our search indexes. Consequently, we have now executed Apache Tika on many billions of distinct resources, dating from 1995 to the present day. Due to the age and variablity of the content, this often tests Tika to it’s limits. As well as failing to identify many formats, it sometimes simply fails, throwing out an unexpected error, or by getting locked in a infinite loop.

Logging losses

Each of those failures represents a loss – a resource that may never be discovered because we can’t understand it. This may be because it’s malformed, perhaps even damaged during download. It may also be an sign of obsolescence, in that it may indicate the presence of data formats that are poorly understood, and are therefore likely to present a challenge to our discovery and access systems. So, instead of ignoring these errors, we decided to remember them. Specifically, each is logged as a facet of our full-text index, alongside the identity of the resource that caused the problem.

Sharing the results

We’ve been collecting this data for a while, in order to help us tell a broken bitstream from a forgotten format. However, in a recent discussion with the Apache Tika developers, they have indicated that they would also find this data useful as a way of improving the coverage and robustness of their software.

This turns out to be a win-win situation. We store the data we were intending to store anyway, but also share it with the tool developers, who get to improve their software in ways we will be able to take direct advantage of as we run later versions of the tool over our archives in the future.

And it feels good to give a little something back.

– by Andy Jackson

@anjacks0n 

 

28 January 2015

Spam as a very ephemeral (and annoying) genre…

Add comment Comments (0)

Spam is a part of modern life. Who hasn’t received any recently, is a lucky person indeed. But only try to put your email out there in the open and you’ll be blessed with endless messages you don’t want, from people you don’t know, from places you’ve never heard about! And then just delete, de-le-te, block sender command…

Imagine though someone researching our web lives in say 50 years and this part of our daily existence is nowhere to be found. Spam is the ugly sister of the Web Archive, it is unlikely we’ll keep spam messages in our inboxes, and almost certainly no institution will keep them for posterity. And yet they are such great research materials. They vary in topics, they can be funny, they can be dangerous (especially to your wallet), and they make you shake your head in disbelief…

We all know the spam emails about people who got stuck somewhere and they can’t pay the bill and ask for a modest sum of £2,500 or so. Theses always make me think: if I had spare £2,500, it’d be Bora Bora here I come, but that’s just selfish me! Now these are taken to a new level. It’s about giving us the money that is inconveniently placed in a bank somewhere far, far away:

Charity spree

From Mrs A.J., a widow of a Kuwait embassy worker in Ivory Coast with a very English surname:

…Currently, this money is still in the bank. Recently, my doctor told me I would not last for the next eight months due to cancer problem. What disturbs me most is my stroke sickness. Having known my condition I decided to donate this fund to a charity or the man or woman who will utilize this money the way I am going to instruct here godly.

Strangely two weeks a Libyan lady, who is also a widow, is writing to me that she also suffered a stroke and all she wants to shower me with money as part of her charity spree:

Having donated to several individuals and charity organization from our savings, I have decided to anonymously donate the last of our family savings to you. Irrespective of your previous financial status, please do accept this kind and peaceful offer on behalf of my beloved family.

Spam


Mr. P. N. ‘an accountant with the ministry of Energy and natural resources South Africa’ was straight to the point:

… presently we discovered the sum of 8.6 million British pounds sterling, floating in our suspense Account. This money as a matter of fact was an over invoiced Contract payment which has been approved for payment Since 2006, now we want to secretly transfer This money out for our personal use into an overseas Account if you will allow us to use your account to Receive this fund, we shall give you 30% for all your Effort and expenses you will incure if you agree to Help.

My favourite is quite light-hearted. Got it from a 32 year old Swedish girl:

My aim of writing you is for us to be friends, a distance friend and from there we can take it to the next level, I writing this with the purest of heart and I do hope that it will your attention. In terms of what I seek in a relationship, I'd like to find a balance of independence and true intimacy, two separate minds and identities forged by trust and open communication. If any of this strikes your fancy, do let me know...

So what I’m a girl too, with a husband and a kid? You never know what may be handy…

Blog post by Dorota Walker 
Assistant Web Archivist

@DorotaWalker 

 

Further reading: Spam emails received by web-archivist@bl.uk. Please note that the quotations come from the emails and I left the original spelling intact.