Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

23 September 2015

British Stand-Up Comedy Archive Special Collection

Add comment Comments (0)

The British Stand-Up Comedy Archive was established at the University of Kent in 2013, following the deposit of personal archive of the stand-up comedian, writer and broadcaster Linda Smith (1958-2006). Even prior to this deposit the University already had a longstanding interest in stand-up comedy and comic performance through teaching (at both BA and MA levels) and research (at PhD level and through the research interests of School of Arts staff). After this initial deposit other comedians were approached to see whether there was a demand from comedians, agents and venues to archive their material, and others who deposited material early in the life of the archive included the comedian and political activist Mark Thomas, and Tony Allen, one of the pioneers of the alternative cabaret/comedy movement in the late 1970s and 1980s.

Promotional publication for 'Tuff Lovers'. This is a four paged pamphlet including photographs, achievements, reviews and contact details. The other side to this pamphlet is shown in image BSUCA/LS/3/2/1/010(1). (c) Linda Smith estate. Photos by Pat McCarthy, design by Stephen Houfe.

In 2014 the University of Kent offered funding for a number of projects to celebrate its 50^th anniversary and one of these became the British Stand-Up Comedy Archive (BSUCA). The BSUCA has a number of aims: to ensure that the archives and records of stand-up comedy in the UK are cared for in order to permanently preserve them; to ensure that these archives are universally accessible, discoverable and available; that the archives are actually used, and used in a variety of ways (popular culture, academic research, teaching, journalism, general enjoyment); and to acquire more offers of appropriate deposits. We also have an internal goal, which is to establish standards, workflows, and policies (with regards to digitisation, digital preservation and deposit negotiations) which aim to inform the future collecting activities of the University’s Special Collections & Archives department.

One of the things I was keen to do when I was appointed as Archivist in January 2015 was to ensure that websites and social media relating to stand-up comedy were being archived. So much of how comedians promote and publicise themselves today, and interact with their audience, is done through social media and websites, and I’ve already noticed that websites referenced in material in the BSUCA collections have already disappeared without being captured. So I was delighted that the UK Web Archive team were happy for me to curate a special ‘British Stand-Up Comedy Archive’ collection for the UK Web Archive! My approach so far has been two-fold.

Approach 1: filling the gaps

One focus has been on nominating websites for inclusion which relate to collections that we already have within the British Stand-Up Comedy Archive. For example, I have been nominating the websites and social media accounts of those whose work we have been physically and digitially archiving at the University, such as Attila the Stockbroker’s website and twitter account. I have also been nominating sites which complement the collections we have. For example, within The Mark Thomas Collection we have copies of articles he has written, but only those which he collected himself. In fact there are many more which he has written which are only available online. The idea behind this approach is that we can ‘fill the gaps’ for researchers interested in those whose archives we have, by ensuring that other material relevant to that comedian/performer is being archived. These websites are provided in sub-categories with the name of the collection they relate to (i.e. Linda Smith Collection, Mark Thomas Collection).

Approach 2: providing an overview of stand-up comedy in the UK today

As we are trying to collect material related to stand-up comedy in the UK I think that it is really important to try to capture as much information as possible about current comedians and the current comedy scene, nationally and locally. So my second focus has been on nominating websites which provide an overview of stand-up comedy in the UK today. Rather than initially focussing on nominating the websites of individual comedians (which would be an enormous task!) I have instead been nominating websites which are dedicated to comedy in the UK, both at a national level, such as Chortle and Beyond the Joke, and those at a regional level such as Giggle Beats (for comedy in the north of England) and London is Funny. I’ve also nominated the comedy sections in national news outlets like the Guardian and The Huffington Post (UK), as well as in regional news outlets such as The Skinny (Scotland and the north west of England), The Manchester Evening News, and The York Press. These websites include news, interviews with comedians and others involved in comedy, as well as reviews and listings of upcoming shows. The idea was that capturing these sorts of websites would help to demonstrate which comedians were performing, where they were performing, and perhaps some of the themes discussed by comedians in their shows. These websites have been categorised into the sub-category 'Stand-up news, listings and reviews'.

I’ve also been focusing on the websites of comedy venues in order to document the variety of comedy clubs there are, to provide an overview of the comedians who are performing, as well as to document other issues like the cost of attending a comedy club night. Many of the clubs whose websites have been nominated are quite longstanding venues, such as Downstairs at the Kings Head (founded in 1981), the Banana Cabaret Club in Balham (established 1983), and The Stand Comedy Club (established in Edinburgh in 1995). And of course I’ve also been focusing on comedy festivals around the UK. Much material relating to the Edinburgh Festival Fringe had already been included in the UK Web Archive but websites for Free Fringe events (which many see as important for the Edinburgh Festival Fringe*), such as the Free Festival and PBH’s Free Fringe, have now been nominated. I’ve also been nominating websites for comedy festivals around the UK, ranging from large established festivals such has the (Dave) Leicester Comedy Festival and the Machynlleth Comedy Festival, to smaller festivals such as the Croydon Comedy Festival and Argcomfest (Actually Rather Good Comedy Festival). The sub-category of 'Venues and festivals' is by far the largest sub-category so far!

Other features of current stand-up comedy that have been captured include organisations such as the Comedy Support Act (a charity funded by benefit shows which aims to provide emergency funds and assistance to professional comedians who find themselves in financial hardship through serious illness or accident) and organisations and events which celebrate and promote women in comedy such as What The Frock!, Laughing Cows Comedy, and the Women in Comedy Festival.

Next steps:

For me, the idea behind the special collection has been to (begin) to ensure that websites and social media relating to stand-up comedy in the UK are being archived for current and future researchers (and others) interested in stand-up comedy. But there are so many more websites that I haven't yet been able to nominate, particularly those of individual comedians or performers. But, the UK Web Archive is open to all (as long as the website is part of the UK web domain), so if there are websites relating to UK stand-up comedy that you want to be archived in the UK Web Archive please nominate them here http://www.webarchive.org.uk/ukwa/info/nominate!

* Luke Toulson, 'Why free is the future of the fringe...and 7 more ways to improve the festival', http://www.chortle.co.uk/correspondents/2013/08/04/18425/why_free_is_the_future_of_the_fringe; and Nick Awde, 'Free shows are ringing the Edinburgh Fringe changes', https://www.thestage.co.uk/opinion/2015/setting-theatre-free-edinburgh/

For further information about the British Stand-Up Comedy Archive find out more at these links:

Blog http://blogs.kent.ac.uk/standupcomedyarchive/

Twitter https://twitter.com/unikentstandup

Flickr https://www.flickr.com/photos/britishstandupcomedyarchive/albums

Soundcloud https://soundcloud.com/stand-up-comedy-archive

by Elspeth Millar, Project Archivist, British Stand-Up Comedy Archive at University of Kent

Posted by Sabine Hartmann at 3:09 PM

18 September 2015

Ten years of the UK web archive: what have we saved?

Add comment Comments (0)

I gave the following presentation at the 2015 IIPC GA. If you prefer, you can read the rough script with slides below rather than watch the video.

We started archiving websites by permission towards the end of 2004 (e.g. The Hutton Enquiry), building up what we now call the Open UK Web Archive. In part, this was considered a long-term investment, helping us build up the skills and infrastructure we need to support large-scale domain crawls under non-print Legal Deposit legislation, which we’ve been performing since 2013.

Furthermore, to ensure we have as complete a record as possible, we also hold a copy of the Internet Archives’ collection of .uk domain web material up until the Legal Deposit regulations were enacted. However these regulations are going to be reviewed, and could, in principle, be withdrawn. So, what should we do?

To ensure the future of these collections, and to reach our goals, we need to be able to articulate the value of what we’ve saved. And to do this, we need a better understanding of our collections and how they can be used.

Understanding Our Collections

So, if we step right back and just look at those 8 billion resources, what do we see? Well, the WARCs themselves are just great big bundles of crawled resources. They reflect the harvester and the bailer, not the need.

So, at the most basic level, we need to be able to find things and look at them, and we use OpenWayback to do that. This example shows our earliest archived site, reconstructed from the server-side files of the British Library’s first web server. But you can only find it if you know that the British Library web site used to be hosted at “portico.bl.uk”.

But that mode of access requires you to know what URLs you are interested in, so we have also built up various themed collections of resources, making the archive browsable.

However, we’re keenly aware that we can’t catalog everything.

To tackle this problem, have also built full-text indexes of our collections. In effect, we’ve built an historical search engine, and having invested in that level of complexity, it has opened up a number of different ways of exploring our archives. The “Big Data Research” panel later today will explore this in more detail, but for now here’s a very basic example.

This graph shows the fraction of URLs from ac.uk hosts and co.uk hosts over time. We can see that back in 1996, about half the UK domain was hosted on academic servers, but since then co.uk has come to dominate the picture. Overall, in absolute terms, both have grown massively during that period, but as a fraction of the whole, ac.uk is much diminished. This is exactly the kind of overall trend that we need to be aware of when we are trying to infer something from a more specific trend, such as the prevalence of medical terms on the uk web.

However, these kinds of user interfaces are hard to build and are forced to make fairly strong assumptions about what the user wants to know. So, to complement our search tools, we also generate various secondary datasets from the content so more technically-adept users can explore our data using their own tools. This provides a way of handing rich and interesting data to researchers without handing over the actual copyrighted content, and has generated a reasonable handful of publications so far.

This process also pays dividends directly to us, in that the way researchers have attempted to exploit our collections has helped us understand how to do a better job when we crawl the web. As a simple example, one researcher used the 1996 link graph to test his new graph layout algorithm, and came up with this visualization.

For researchers, the clusters of connectivity are probably the most interesting part, but for us, we actually learned the most from this ‘halo’ around the edge. This halo represents hosts that a part of the UK domain, but are only linked to from outside the UK domain. Therefore, we cannot build a truly representative picture of the UK domain unless we allow ourselves to stray outside it.

The full-text indexing process also presents an opportunity to perform deeper characterization of our content, such as format and feature identification and scanning for preservation risks. This has confirmed that the vast majority of the content (by volume) is not at risk of obsolescence at the format level, but has also illustrated how poorly we understand the tail of the format distribution and the details of formats and features that are in use.

For example, we also build an index that shows which tags are in use on each HTML page. This means we can track the death and birth of specific features like HTML elements. Here, we can see the death of the <applet>, <blink> and <font> tags, and the massive explosion in the usage of the <script> tag. This helps us understand the scale of the preservation problems we face.

Putting Our Archives In Context

But all this is rather inward looking, and we wanted to find ways of complementing these approaches by comparing our collections with others and especially with the live web. This is perhaps the most fundamental way of stating the value of what we’ve collected as it addresses the basic quality of the web that we need to understand - it’s volatility.

How has our archival sliver of the web changed? Are the URLs we’ve archived still available on the live web? Or are they long since gone? If those URLs are still working, is the content the same as it was?

One option would be to go through our archives and exhaustively examine every single URL to work out what has happened to it. However, the Open UK Web Archive contains many millions of archived resources, and even just checking their basic status would be very time-consuming, never mind performing any kind of detailed comparison of the content of those resources.

Sampling The URLs

Fortunately, to get a good idea of what has happened, we don’t need to visit every single item. We can use our index to randomly sample a 1000 URLs from each year the archive has been in operation. We can then try to download those URLs again, and use the results to build up a picture that compares our archival holdings to the current web.

As we download each URL, if the host has disappeared, or the server is unreachable, we say its GONE. If the server responds with an ERROR, we record that. If the server responds but does not recognize the URL, we classify it as MISSING, but if the server does recognize the URL, we classify it as MOVED or OK depending on whether a chain of redirects was involved. Note that we did look for “Soft 404s” at the same time, but found that these are surprisingly rare on the .uk domain.

Plotting the outcome by year, we find this result:

The overall trend clearly shows how the items we have archived have disappeared from the live web, with individual URLs being forgotten as time passes. Looking at 2013, even after just two years, 40% of the URLs are GONE or MISSING.

Is OK okay?

However, so far, this only tells us what URLs are still active - the content of those resources could have changed completely. To explore this issue, we have to dig a little deeper by downloading the content and trying to compare what’s inside.

We start by looking at a simple example - this page from the National Institute for Health and Care Excellence. If we want to compare this page with an archived version, one simple option is to ignore the images and tags, and just extract all the text.

However, comparing these big text chunks is still rather clumsy and difficult scale, so we go one step further and reduce the text to a fingerprint¹.

A fingerprint is conceptually similar to the hashes and digests that most of us are familiar with, like MD5 or SHA-256, but with one crucial difference. When you change the input to an cryptographic hash, the output changes completely - there’s no way to infer any relationship between the two, and indeed it is that very fact that makes these algorithms suitable for cryptography.

For a fingerprint, however, if the input changes a little, then the output only changes a little, and so it can be used to bring similar inputs together. As an example, here are our fingerprints for out test page – one from earlier this year and another from the archive. As you can see, this produces two values that are quite similar, with the differences highlighted in red. More precisely, they are 50% similar as you’d have to edit half of the characters to get from one to the other.

To understand what these differences mean, we need to look at the pages themselves. If we compare the two, we can see two small changes, one to the logo and one to the text in the body of the page.

But what about all the differences at the end of the fingerprint? Well, if we look at the whole page, we can see that there are major differences in the footer. In fact, it seems the original server was slightly mis-configured when we archived it in 2013, and has accidentally injected a copy of part of the page inside overall page HTML.

So, this relatively simple text fingerprint does seem to reliably reflect both the degree of changes between versions of pages, and also where in the pages those changes lie.

Processing all of the ‘MOVED’ or ‘OK’ URLs in this way, we find:

We can quickly see that for those URLs that appeared to be okay, the vast majority have actually changed. Very few are binary identical, and while about half of the pages remain broadly similar after two years, that fraction tails off as we go back in time.

We can also use this tactic to compare the OK and MOVED resources.

For resources that are two years old, we find that URLs that appear to be OK are only identical to the archived versions one third of the time, similar another third of the time, but the remaining third are entirely dissimilar. Not surprisingly, the picture is much worse for MOVED URLs, which are largely dissimilar, with less than a quarter being similar or identical.

The URLs Ain’t Cool

Combining the similarity data with the original graph, get this result:

Shown in this way, it is clear that very few archived resources are still available, unchanged, on the current web. After just two years, 60% have gone or have changed into something unrecognizable.²

This rot rate is significantly higher than I expected, so I began to wonder whether this a kind collection bias. The Open UK Web Archive often prioritized sites known to be at risk, and that selection criteria seems likely to affect the overall trends. So, to explore this issue, I also ran the same analysis over a randomly sampled subset of our full, domain-scale Legal Deposit collection.

However, the results came out almost exactly the same. After two years, about 60% of the content has GONE or is unrecognizable. Furthermore, looking at the 2014 data, we can see that after just one year, although only 20% of the URLs themselves have rotted, a further 30% of the URLs are unrecognizable. We’ve lost half the UK web in just one year.

This raised the question of whether this instability can be traced to specific parts of the UK web. Is ac.uk more stable than co.uk, for example?

Looking at those results showed that, in fact, there’s not a great deal to choose between them. The changes to the NHS during 2013 seem to have had an impact on the number of identical resources, with perhaps a similar story for the restructuring of gov.uk, but there’s not that much between all of them.

What We’ve Saved (2004-2014)

Pulling the Open and Legal Deposit data together, we can get an overview of the situation across the whole decade. For me, this big, black hole of content lost from the live web is a powerful way of visualizing the value of what we’ve saved over those ten years.

Summary

I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.

We’ve also found that this relatively simple text fingerprint provides some useful insight. It does ignore a lot, and is perhaps overly sensitive to changes in the ‘furniture’ of a web site, but it’s useful and importantly, scalable.

There are a number of ways we might take this work forward, but I’m particularly interested in looking for migrated content. These fingerprints and hashes are in our full-text index, which means we can search for similar content that has moved from one URL to another even if the was never any redirect between them. Studying content migration in this way would allow us to explore how popular content moves around the web.

I’d also like to extend the same sampling analysis in order to compare our archives with those of other institutions via the Memento protocol.

Thank you, and are there any questions?

Addendum

If you’re interested in this work you can find:

The video of my presentation here.
The slides on slideshare or on speakerdeck.
The source code for the data generation, halflife.
The iPython Notebook used to generate the graphs, half-life.ipynb.

This technique has been used for many years in computer forensics applications, such as helping to identify ‘bad’ software, and here we adapt the approach in order to find similar web pages. ↩
Or, in other words, very few of our archived URLs are cool. ↩

Posted by Sabine Hartmann at 2:27 PM

02 September 2015

2015 UK Domain Crawl has started

Add comment Comments (0)

We are proud to announce that the 2015 UK Domain Crawl has started !

Over the next weeks our web crawler will visit every website in the UK, download and keep it safe on the British Library archive servers.

https://commons.wikimedia.org/wiki/File%3ARobot_icon.svg By Bilboq (Own work) [Public domain], via Wikimedia Commons

Previous crawls

The first ever UK Domain crawl was run in 2013 it resulted in:

3.8 million seeds (starting URLs)
31TB data
1.9 billion web pages and other assets

The 2014 built on experiences and yielded:

20 million seeds
Geo IP check of UK hosted websites (2.5 million seeds)
56TB data
2.5 billion webpages and other assets
including: 4.7GB of viruses and 3.2TB of screenshots

Guesswork

What will the 2015 crawl be like? Will we find more urls? Surely the web grows every day, but how much? Will there be more data? Will we have more virus content?

Tweet your suggestions and thoughts about the UK Domain @UKWebArchive or use the #UKWebCrawl2015

Homepage Crawl Log Flypast © Andy Jackson

Posted by Sabine Hartmann at 12:03 PM

27 August 2015

10 Years of the Web Archive - What have we saved - video

Add comment Comments (0)

Talk given by Andy Jackson, Web Archiving Technical Lead at the IIPC General Assembly 2015

Posted by Jason Webber at 10:43 AM

Tags

Web/Tech

13 August 2015

Characterisations of Climate Change

Add comment Comments (0)

If you have read any of my previous blogs (Beginner’s Guide to Web Archives 1,2,3) you will know that as part of my work at the British Library I have been curating a special web archive collection on climate change. But why did I choose this subject?

World-changing issue

Having begun as a topic of scientific interest, the threat of climate change has developed into a potentially world-changing issue with major implications for how we live our lives. The projected impacts of climate change have profound impacts on things like food, water, human health; and therefore on national and international policy and the ‘business as usual’ world economy. Naturally therefore, the topic is heavily debated in the public arena, from the science of global warming and its associated effects to the policies designed to mitigate or adapt to it.

Screen shot of www.eci.ox.ac.uk

We might expect different individuals and organisations – as for any topic – to portray the issue in different ways. But how exactly is climate change characterised on the internet? For instance, while there are many websites that accept the current understanding of climate science and actively promote action to limit global warming, there are many others that partially or completely deny the science. How is the issue portrayed by these different groups? Or another example: how is the issue portrayed by renewable energy companies compared to fossil fuel companies, two groups with very conflicting interests? As climate change progresses, how will its online characterisation change? I wanted to build a collection that could help to answer some of these questions.

Special interest groups

The collection consists of websites from different societal groups that have an active interest in the subject: for example academics; the energy sector; policy makers; special interest groups; the media and some members of the public. Websites generally fall into one of the following categories: personal blog pages/twitter feeds, non-governmental organisations/coalitions, news, government, energy companies, religious organisations, educational websites, learned societies and university institutions. The proportion of each website devoted to climate change ranges from almost 100 % (some blogs/specialist websites) to more limited coverage. Some websites may be notable for the complete absence of climate change references. For example, after discussions in Cardiff, I have included each of the main UK energy companies, even when their websites do not mention climate change. Such information was considered to be useful in terms of the questions posed above.

Screen shot of twitter.com/ClimateCabaret

The collection is an evolving beast, so if you have any suggestions regarding extra websites we could include, please fill in the online form here. We are hoping to make as many of the websites openly available as possible, but don’t forget that if you want to view the whole collection, you will need to head to your nearest legal deposit library to do so.

Peter Spooner, Science Policy Intern

Posted by Sabine Hartmann at 12:15 PM

10 August 2015

Beginner’s Guide to Web Archives Part 3

Add comment Comments (0)

Coming to the end of his short time working on web archives at the British Library, science-policy intern Peter Spooner reflects on the process of creating a web archive special collection.

Some issues with ‘Special Collections’

In my previous blog entry, I covered why we might want to create special collections. Here, I would like to examine the pros and cons of these collections in more detail.

In order for an archivist to create a special collection, he/she must come up with a subject, refine the scope of the topic to prevent the collection from becoming too large, and then collect websites. In my case - climate change – I decided to collect websites to show how climate change is portrayed across society (by charities, the energy sector, interested individuals, learned societies etc.) with a focus on the portrayal of climate science and policy. Whilst I hope such a collection will be interesting and useful, problems do exist.

In July, the British Library team headed to meet some environmental psychologists from Cardiff University. The major success of the meeting was to inform the researchers about web archiving and our climate change special collection. The resource was well received and was seen as being potentially useful. However, a number of issues came up before and during the discussion:

Each of the five researchers who attended had slightly different research interests;
How can we integrate these interests when creating archive resources?
How can the climate change collection be kept relevant as the subject evolves?
Who should be responsible for sustaining and updating the special collection?
What kinds of research question can be asked?

Widening the net

The last of these points I addressed in a previous blog entry, but the remainder are worth commenting on here. As I highlighted above, special collections are designed to be small and easy to use. However, such limited scope may not meet the needs of different researchers. There are several approaches one could take in order to try and resolve this issue. In some cases, collections may focus on a particular, event, such as a general election. The web content associated with these collections is often short-lived and after the event the collection would not need much updating. However, for collections on long-lasting themes, more involvement is required.

In one instance, thematic special collections could remain under the control of dedicated archivists. In this case, collection users could send in suggestions of websites to include when important events occur or new web material is created. Collections could be slightly expanded to be broad enough for a variety of user interests. However, the number of collections is necessarily limited by the time commitment of the web archivists.

Another possibility is that the archivists act as technical support whilst researchers create their own collections. This approach requires a greater input on the part of the researcher, but allows more collections to be created and maintained. Since they are designed by the users, each collection should be exactly fit for purpose. However, since each researcher is likely to have slightly different interests or questions in mind, the number of collections may be very large and some collections may closely mirror one another.

Listening to talks by academics involved in the British Library’s BUDDHA project, a common starting point for research was to create a corpus: a collection of written texts – in this case websites – of interest that could then be used to inform the research question. This approach is just what I have described above. A large number of corpora created by researchers could be stored by housing different groups of collections under common themes; so the theme of climate change could contain a number of collections on different aspects of the issue.

Moving forward

Perhaps the ideal model that the British Library could adopt is something of a combination of the above ideas. The Library may want to preserve the integrity of its existing special collections, which are carefully curated and designed for a wide range of users. These ‘Special Collections’ could remain under archivist control as described above, with contributions from user feedback. Alongside this core set of special collections could exist the more specific and numerous ‘Research Collections’ - those collections created by researchers. In this way the Library could make available a variety of resources that may be of interest to different users, combining the work of researchers and archivists to accommodate the limited time of both.

One thing we need to do in order to ensure the success of this combined approach is to get more and more researchers involved with creating collections. More projects like BUDDHA and further visits to interested academics will help to increase awareness of the web archive as a research resource, to grow it and turn it into an invaluable tool.

Peter Spooner, Science Policy Intern

Posted by Sabine Hartmann at 10:16 AM

05 August 2015

Viral Content in the UK Domain

Add comment Comments (0)

https://commons.wikimedia.org/wiki/File:Virus_ordinateur.jpg

Why?

"The term 'malware' is commonly used as a catch-all phrase for unwanted software designed to infiltrate a computer...without the owner's informed consent. It includes but is not limited to viruses, Trojan horses, malware."

"Whilst highly undesirable for most contemporary web users, malware is a pervasive feature of the Internet. Many archives choose to scan harvests and identify malware but prefer not to exclude or delete them from ingest into their repositories, as exclusion threatens the integrity of a site and their prevalence across the web is a valid research interest for future users."
DPC Technology Watch Report, March 2013

The above hopefully goes some way to illustrating our concerns regarding 'viral' content in the data we archive. If overlooked or ignored, such content has the potential to prove hazardous in the future but similarly, they do form an integral part of the Web as we know it (Professor Stephen Hawking famously stated that he thought that "computer viruses should count as life" and who are we to argue?).

How?

Faced with such considerations, there were several options available:

We could simply not store any content flagged as containing a virus. The problem here is the effect is unpredictable—what if the content in question is the front-page of a website? It effectively means that site cannot be navigated as intended.
We could store the content but make it inaccessible.
We could postpone the scan for viruses until after the crawl. However, this would require amending the output files to either remove or alter infected records.
We could 'nullify' the content, making it unreadable but potentially reversible such that the original data can be read if required.

The latter option was chosen. The specific implementation was that of a XOR Cipher , wherein the individual bytes of the viral content and logically XOR'd with a known byte-length key. Applying the same cipher using the same key reverses the operation. Essentially this turns any record flagged as containing viral content into (theoretically safe) pseudo-gibberish.

To quickly illustrate that in Python:

key = "X"

message = "This is a secret message. Shhhhh!"

encoded = [ord(m)^ord(key) for m in message]

print(encoded)

"""

The value of 'encoded' here is just a list of numbers; attempting to convert

it to a string actually broke my Putty session.

"""

decoded = "".join([chr(e^ord(key)) for e in encoded])

print(decoded)

https://commons.wikimedia.org/wiki/File:Virus_Blaster.jpg

Heritrix & ClamAV

For all our crawling activities we use the Internet Archive's Heritrix crawler. Part of the ethos behind Heritrix's functionality is that content is processed and written to disk as quickly as possible; ideally you should be utilising all available bandwidth. With that it mind the options for virus-scanners were few. While there are many available few offer any kind of API and even fewer have the ability to parse streamed content and must instead scan content on disk. Given that disk-writes are often the slowest part of the process this was not ideal and left us with only one obvious choice: ClamAV .

We created a ViralContentProcessor module which interacts with ClamAV, streaming every downloaded resource to the running daemon and receiving the result. Anything which is found to contain a virus:

...is annotated with the output from ClamAV (this then appears in the log file).
...is bytewise XOR'd as previously mentioned and the amended content written to a different set of WARC files than non-viral content.

It is worth noting that ClamAV does, in addition to scanning for various types of malware, have the option to identify phishing attempts. However, we disabled this early on in our crawls when we discoverd that it was identifying various examples of phishing emails provided by banks and similar websites to better educate their customers.

During the crawl the resources—memory usage, CPU, etc.—necessary for ClamAV are similar to those required by the crawler itself. That said, the virus-scanning is seldom the slowest part of the crawl.

WARCs

All web content archived by the British Library is stored in WARC format (ISO 28500). A WARC file is essentially a series of concatenated records, each of a specific type. For instance an average HTML page might look like this:

WARC/1.0

WARC-Type: response

WARC-Target-URI: https://www.gov.uk/licence-finder/activities?activities=158_45_196_63§ors=183

WARC-Date: 2015-07-05T08:54:13Z

WARC-Payload-Digest: sha1:ENRWKIHIXHDHI5VLOBACVIBZIOZWSZ5L

WARC-IP-Address: 185.31.19.144

WARC-Record-ID: <urn:uuid:2b437331-684e-44a8-b9cd-9830634b292e>

Content-Type: application/http; msgtype=response

Content-Length: 23174

HTTP/1.1 200 OK

Server: nginx

Content-Type: text/html; charset=utf-8

Cache-Control: max-age=1800, public

...

<!DOCTYPE html>

...

The above essentially contains the raw HTTP transaction plus additional metadata. There is also another type of record: a conversion:

A 'conversion' record shall contain an alternative version of another record's content that was created as the result of an archival process.
ISO 28500

It's this type of record we use to store our processed viral content. A record converted as per the above might appear thusly:

WARC/1.0

WARC-Type: conversion

WARC-Target-URI: https://www.gov.uk/licence-finder/activities?activities=158_45_196_63§ors=183

WARC-Date: 2015-04-20T11:03:11Z

WARC-Payload-Digest: sha1:CWZQY7WV4BJZRG3XHDXNKSD3WEFNBDJD

WARC-IP-Address:185.31.19.144

WARC-Record-ID: <urn:uuid:e21f098e-18e4-45b9-b192-388239150e76>

Content-Type: application/http; encoding=bytewise_xor_with_118

Content-Length: 23174

>""&YGXGVDFFV9={

...

The two records' metadata do not differ drastically—the main differences being the specified WARC-Type and the Content-Type. In this latter field we include the encoding as part of the MIME. The two records' content, however, appear drastically different: the former record contains valid HTML while the latter contains a seemingly random series of bytes.

Access

In order the access content stored in WARC files we typically create an index, identifying the various URLs and recording their particular offset within a given WARC file. As mentioned earlier, content identified as containing a virus is stored in a different series of files to those of 'clean' content. Currently we do not provide access to viral content but by doing the aforementioned separation this means that firstly we can easily index the regular content and omit the viral and secondly, it means we can, should the demand arise, easily identify and index the viral content.

The software used to replay our WARC content—OpenWayback—is capable of replaying WARCs of all types. While there would be additional step wherin we reverse the XOR cipher, access to the content should not prove problematic.

Results

Frequent Crawls

In addition to the annual crawl of the UK domain, we also undertake more frequent crawls of a smaller set of sites. These site are crawled on a daily, weekly, etc. basis to capture more frequently-changing content. In the course of roughly 9,000 frequent crawls since April 2013 only 42 have encountered viral content.

2013 Domain Crawl

30TB regular content.
4GB viral content.

2014 Domain Crawl

57TB regular content.
4.7GB viral content.

Looking at the logs from the 2014 Domain Crawl which, as mentioned earlier, contain the results from the ClamAV scan, there were 494 distinct viruses flagged. In terms of the most common, the top ten appear to be:

Html.Exploit.CVE_2014_6342
JS.Obfus-210
PHP.C99-7
JS.Crypt-1
Exploit.URLSpoof.gen
HTML.Iframe-6
JS.Trojan.Iframe-6
Heuristics.Broken.Executable
JS.Obfus-186
Html.Exploit.CVE_2014_0274-4

In total there were 40,203 positive results from ClamAV, with the Html.Exploit.CVE_2014_6342 in top spot above accounting for over a quarter.

Roger G. Coram, Web Crawl Engineer, The British Library

Posted by Sabine Hartmann at 10:12 AM

24 July 2015

Geo-location in the 2014 UK Domain Crawl

Add comment Comments (0)

In April 2013 The Legal Deposit Libraries (Non-Print Works) Regulations 2013 Act was passed and of particular relevance is the section which specifies which parts of that ephemeral place we call the Web are considered to be part of "the UK":

18 (1) “…a work published on line shall be treated as published in the United Kingdom if:

“(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”

In more practical terms, resources are to be considered as being published in the United Kingdom if the server which serves said resources is physically located in the UK. Here we enter the realm of Geolocation.

Comparison satellite navigation orbits" by Cmglee, Geo Swan - Own work.Licensed under CC BY-SA 3.0 via Wikimedia Commons

Heritrix & Geolocation

Geolocation is the practice of determining the "real world" location of something—in our case the whereabouts of a server, given its IP address.

The web-crawler we use, Heritrix, already has many necessary features to accomplish this. Among its many DecideRules (a series of ACCEPT/REJECT rules which determine whether a URL is to be downloaded) is the ExternalGeoLocationDecideRule. This requires:

A list of ISO 3166-1 country-codes to be permitted in the crawl

GB, FR, DE, etc.

An Implementation of ExternalGeoLookupInterface.

This latter ExternalGeoLookupInterface is where our own work lies. This is essentially a basic framework on which you must hang your own implementation. In our case, our implementation is based on MaxMind’s GeoLite2 database. Freely available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, this is a small database which translates IP addresses (or, more specifically, IP address ranges) into country (or even specific city) locations.

Taken from our Heritrix configuration, the below shows how this is included in the crawl:

<!- GEO-LOOKUP: specifying location of external database. -->

<bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup">

<property name="database" value="/dev/shm/geoip-city.mmdb"/>

</bean>

<bean id="externalGeoLookupRule" class="org.archive.crawler.modules.deciderules.ExternalGeoLocationDecideRule">

<property name="lookup">

<ref bean="externalGeoLookup"/>

</property>

<property name="countryCodes">

<list>

<value>GB</value>

</list>

</property>

</bean>

The GeoLite2 database itself is, at around only 30MB, very small. Part of beauty of this implementation is that the entire database can be held comfortably in memory. The above shows that we keep the database in Linux's shared memory, avoiding any disk IO when reading from the database.

Testing

To test the above we performed a short, shallow test crawl of 1,000,000 seeds. A relatively recent addition to Heritrix's DecideRules is this property:

<property name="logToFile" value="true" />

During a crawl, this will create a file, scope.log, containing the final decision for every URI along with the specific rule which made that decision. For example:

2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/

2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ

2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull7

So for the above 2 URLs were rejected outright, while the first was ruled in-scope by theExternalGeoLocationDecideRule.

Parsing the full output from our test crawl, we find:

89,500,755 URLs downloaded in total.
26,072 URLs which were not on .uk domains (and therefore would, ordinarily, not be in scope).

137 distinct hosts.

British Isles Euler diagram 15 by TWCarlson - Own work. Licensed under CC0 via Wikimedia Commons

2014 Domain Crawl

The process for examining the output of our first Domain Crawl is largely unchanged from the above. The only real difference is the size: the scope.log file gets very large when dealing with domain scale data. It logs not only the decision for every URL downloaded but every URL notdownloaded (and the reason why).

Here we can use a simple sed command (admittedly implemented slightly differently via distributed via Hadoop Streaming to cope with the scale) to parse the logs' output:

sed -rn 's@^.+ ExternalGeoLocationDecideRule ACCEPT https?://([^/]+)/.*$@\1@p' scope.log | grep -Ev "\.uk$" | sort -u

This will produce a list of all the distinct hosts which have been ruled in-scope by the ExternalGeoLocationDecideRule (excluding, of course, any .uk hosts which are considered in scope by virtue of a different part of the legislation).

The above produced a list of 2,544,426 hosts ruled in-scope by the above Geolocation process.

By Roger G. Coram, Web Crawl Engineer, The British Library

Posted by Sabine Hartmann at 10:33 AM

Tags

Web/Tech