UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

26 April 2016

Easter Rising 1916 Centenary in Print and Digital

Ireland has been gripped by  commemorations of the Easter Rising in the last month. The Rising took place from the 24th April to the 29th April 1916 in Dublin. A packed programme of events and activities took place across Ireland and in Irish communities further afield to commemorate this centenary.

In March 2016, addressing a colloquium at the Bodleian Library, Oxford, the Irish Ambassador to the United Kingdom, his Excellency Daniel Mulhall, emphasised the transnational and inclusive nature of the commemoration programme in his opening remarks. The 1916 Rising had a global impact with ripples felt as far as Asia and India. This is reflected in the range of events taking place in the United Kingdom, supported by the Irish Embassy.

In military terms the Rising was a failure and had consequences for the people of Dublin with 415 people killed, the majority of whom were civilians.

Print
Turning to the documentation of the Rising, there are a number of interesting documents within the Library’s collections relating to the Rising. The British Library does not hold an original broadside of the Proclamation of an Irish Republic. Nevertheless, later examples of the document were acquired retrospectively.

The earliest example of a version of the proclamation in the British Library’s collections, can be found at C.S.A.24/3.(1.). This is interesting from a bibliographical stand point because it is the first entry under the new heading in the British Library Printed Catalogue to 1975:

ProvisionalGovernmentEntryBLPC

Provisional Government of the Irish Republic 1916. Miscellaneous Public documents. 

That the Library classified this proclamation as a public document and gave the document the C.S.A., official publication pressmark prefix, which originates from the 1890s, is of particular interest.  The third factor which is of interest is that this version of the proclamation is the only item in the green bound guard-book which is embossed on the spine in gold.

Poblacht na heireann1916

IRELAND. PROCLAMATIONS, ETC.

Although the red (purchase) stamp appears on the reverse of the document, because of the way it has been mounted in the volume it is unclear when the item was acquired. It appears to read 15 May ‘59. The volume itself bears the British Museum binders stamp B.M.1961 on the inside of the rear board. These dates indicate that this item, as with other ephemera relating to 1916 Rebellion, was acquired retrospectively. 

Poblacht na heireann 1941

The second example of the proclamation is a more ornate affair. It is a single sheet dating from 1941, measuring approximately 325mm x 255mm. The text of the document is laid out in the same fashion as the original, but the type face has been standardised, removing the anomalies from the original, and the list of signatories has been centred rather than justified to the right as in the original. What is most striking about this item are the portraits of the seven signatories surrounding the text and connected by the decorative boarder. At the bottom centre surround in a circle is the Irish Army sunburst emblem, designed by Eion MacNeill, and interestingly it is reproduced without the inscription "Óglaigh na hÉireann" or Irish Volunteers.

Irish War News Irish War News p4

The third document is a piece of contemporary ephemera which traces its lineage to the focal point of the rebellion. Dated Tuesday April 25 1916, on the last page of the first issue of Irish War News it is an article headed:

“Stop Press (Irish) ‘War News’ is published to-day because a momentous thing has happened. The Irish Republic has been declared in Dublin and a Provisional Government has been appointed to administer it is affairs.”

 The article goes on to name the signatories of the proclamation as the Provisional Government while outlining the situation in Dublin from the rebel prospective.       

Digital
The Rising, or more particularly the centenary of the events in Dublin a hundred years ago, is being explored and represented in new ways thanks to technology and the work of colleagues at Trinity College Dublin and the Bodleian Library Oxford. In the last year they have built and curated a collection of websites related to the commemoration.

These have been archived as part of the open UK Web Archive.  To have the opportunity to build this collection of Irish and UK websites is an exciting prospect for the future of web published content. This endeavour illustrates how the internet is not confined by national boundaries. The work on the Easter Rising collection exemplifies how archivists working together can build a contemporary collection which provides a range of perspectives from all corners of the .uk and .ie domains.   

Archiving websites about anniversaries and centenaries such as Easter 1916 is of prime importance because such sites can be transient and are soon overwritten or taken down. Archiving them creates a research resource for the future which offers scholars and anyone interested the opportunity to explore and examine the response to this centenary on the published web.

The Easter Rising collection is currently a growing part of the UK Web Archive special collections where it can be freely consulted online.

By Jeremy Jenkins, Curator Emerging Media, The British Library
@_jerryjenkins

 

Further Reading

Bouch, Joseph J. “The Republican Proclamation of Easter Monday, 1916,” Bibliographical Society of Ireland, Publications vol.5. no.3 1936. General Reference Collection: Ac.9708/2 [A reissue].

The Easter Proclamation of the Irish Republic, MCMXVI
Dublin : Dolmen Press, 1960. General Reference Collection: Cup.510.ak.37

The Easter Proclamation of the Irish Republic 1916,
[S.l.] : Dolmen Press, 1976. Document Supply Shelfmark: D76/23312

 

 

15 February 2016

Introducing SHINE 2.0 - A Historical Search Engine

Add comment Comments (0)

In 2015, as part of the Big UK Domain Data for the Arts and Humanities project, we released our first ‘historical search engine’ service. We’ve publicised it at IDCC15, the 2015 IIPC GA and at the first RESAW conference, and so far has been very well received. Not only has it lead to some excellent case studies that we can use to improve our services, but other web archives have shown interest in re-using the underlying open source code. In particular, some of our Canadian colleagues have successfully launched webarchives.ca, which lets users search ten years worth of archived websites from Canadian political parties and political interest groups (see here for more details).

Even bigger data!
But we remained frustrated for two reasons. Firstly, when we built that first service, we could not cope with the full scale of the 1996-2013 dataset, and we only managed to index the two billion resources up to 2010. Secondly, we had not yet learned how to cope with more than one or two users at a time, so we were loath to publicise the website too widely in case it crashed. So, over the last six months, and with the guidance of Toke Eskildsen and Thomas Egense at the State Library of Denmark, we’ve been working on resolving these scaling issues (their tech blog is definitely worth a look if you’re into this kind of thing).

Thanks to their input, I’m happy to announce that our historical search prototype now spans the whole period from 1996 to the 6th April 2013, and contains 3,520,628,647 distinct records.

Shine-release-two-total-resources-over-time

Broken down by year, you can see there’s a lot of variation, depending on the timings of the global crawls from which this collection was drawn. This is why our trends visualisation plots query results as a percentage of all the resources crawled in each year rather than absolute figures. However, the overall variation and the fact that the 2013 chunk only covers the first three months should be kept in mind when interpreting the results.

Time travel?
You might also notice there seem to be a few data points from as early as 1938, and even from 2072! This tiny proportion of results correspond to malformed or erroneous records, although currently it’s not clear if the 1,714 results from 1995 are genuine or not. No one ever said Big Data would be Clean Data.

De-duplication of records
Furthermore, we’ve decided to change the way we handle web archiving records that have been ‘de-duplicated’. When the crawler visits a page and finds precisely the same item as before, instead of storing another copy, we can store a so-called “revisit record” and refer to the earlier copy rather than duplicating it. This crude form of data compression can save a lot of disk space for frequently crawled material, and it’s use has grown over time. For example, looking at the historical dataset, you can see that 30% of the 2013 results were duplicates.

Shine-release-two-revisits

However, as these records don’t hold the actual item, our indexing process was not able to index these items properly. Over the next few weeks, we shall scan through these 65 million revisit records and ‘reduplicate’ them. This does mean that, for now, the results from 2013 might be a bit misleading in some cases. We also failed to index the last 11,031 of the 515,031 WARC files that make up this dataset (about 2% of the total, likely affecting the 2010-2013 results only), simply because we ran out of disk space. The index is using up 18.7TB of SSD storage, and if we can find more space, we’ll fill in the rest.

Do try it at home
In the meantime, please explore our historical archive and tell us what you find! It might be slow sometimes (maybe 10-20 seconds), so please be patient, but we’re pretty confident that it will be stable from now on.

Shine-release-two-early-social-media

Shine-release-two-later-social-media

Shine-release-two-austerity

https://www.webarchive.org.uk/shine

By Andy Jackson, British Library Web Archiving Technical Lead

20 November 2015

The Provenance of Web Archives

Add comment Comments (0)

Over the last few years, it’s been wonderful to see more and more researchers taking an interest in web archives. Perhaps we are even teetering into the mainstream when a publication like Forbes carries an article digging into the gory details of how we should document our crawls in How Much Of The Internet Does The Wayback Machine Really Archive?

Even before the data-mining BUDDAH project raised these issues, we’d spent a long time thinking about this, and we’ve tried to our best to capture as much of our own crawl context as we can. We don’t just store the WARC request and response records (which themselves are much better at storing crawl context than the older ARC format), we also store:

  • The list of links that the crawler found when it analysed each resource (this is a standard Heritrix3 feature).
  • The full crawl log, which records DNS results and other situations that may not be reflected in the WARCs.
  • The crawler configuration, including seed lists, scope rules, exclusions etc.
  • The versions of the software we used (in WARC Info records and in the PREMIS/METS packaging).
  • Rendered versions of original seeds and home pages, as PNG and as HTML, and associated metadata.

In principle, we believe that the vast majority of questions about how and why a particular resource has been archived can be answered by studying this additional information. However, it’s not clear how this would really work in practice. Even assuming we have caught the most important crawl information, reconstructing the history behind any particular URL is going to be highly technical and challenging work because you can’t really understand the crawl without understanding the software (to some degree at least).

But there are definitely gaps that remain - in particular, we don’t document absences well. We don’t explicitly document precisely why certain URLs were rejected from the crawl, and if we make a mistake and miss a daily crawl, or mis-classify a site, it’s hard to tell the difference between accident and intent from the data. Similarly, we don’t document every aspect of our curatorial decisions, e.g. precisely why we choose to pursue permissions to crawl specific sites that are not in the UK domain. Capturing every mistake, decision or rationale simply isn’t possible, and realistically we’re only going to record information when the process of doing so can be largely or completely automated (as above, see also You get what you get and you don’t get upset).

And this is all just at the level of individual URLs. When performing corpus analysis, things get even more complex because crawl configurations vary within the crawls and change over time. Right now, it’s not at all clear how best to combine or summarize fine-grained provenance information in order to support data-mining and things like trend analysis. But, in the context of working on the Buddha project, we did start to explore how this might work.

For example, the Forbes article brings up the fact that crawl schedules vary, and so not every site has been crawled consistently, e.g. every day. Of course, we found exactly the same kind of thing when building the Shine search interface, and this is precisely why our trend graphs currently summarize the trends by year. In other words, if you average the crawled pages by year, you can wash out the short-lived variations. Of course, large crawls can last months, so really you want to be able to switch between different sampling parameters (quarterly, six-monthly, or annual, starting at any point in the year, etc.), so that you can check whether any perceptible trend may be a consequence of the sampling strategy (not that we got as far as implementing that, yet).

"Global Financial Crisis"

Similarly, notice that Shine shows you the percentage of matching resources by year, rather than the absolute number of matching documents. This is because showing the fraction of the crawled web that matches your query is generally more useful than just the number of matching resources because in the latter case the crawl scheduling tends to obscure what’s going on (again, it would be even better to be able to switch between the two so you can better understand what any given trend means, although if you download the data for the graph you get the absolute figures as well as the relative ones).

More useful still would be the ability to pick any other arbitrary query to be the normalization baseline, so you could plot matching words against total number of words per year, or matching links per total number of links, and so on. The crucial point is that if your trend is genuine, you can use sampling and normalization techniques to test that, and to find or rule out particular kinds of biases within the data set.

This is also why the trend interface offers to show you a random sample of the results underlying a trend. For example, it makes it much easier to quickly ascertain whether the apparent trend is due to a large number of false-positive hits coming from a small number of hosts, thus skewing the data.

I believe there will be practical ways of summarizing provenance information in order to describe the systematic biases within web archive collections, but it’s going to take a while to work out how to do this, particularly if we want this to be something we can compare across different web archives. My suspicion is that this will start from the top and work down - i.e. we will start by trying different sampling and normalization techniques, and discover what seems to work, then later on we’ll be able to work out how this arises from the fine details of the crawling and curation processes involved.

So, while I hope it is clear that I agree with the main thrust of the article, I must admit I am a little disappointed by its tone.

If the Archive simply opens its doors and releases tools to allow data mining of its web archive without conducting this kind of research into the collection’s biases, it is clear that the findings that result will be highly skewed and in many cases fail to accurately reflect the phenomena being studied.

Kalev Leetaru, How Much Of The Internet Does The Wayback Machine Really Archive?

The implication that we should not enable access to our collections until we have deduced it’s every bias is not at all constructive (and if it inhibits other organisations from making their data available, potentially quite damaging).

No corpus, digital or otherwise, is perfect. Every archival sliver can only come to be understood through use, and we must open up to and engage with researchers in order to discover what provenance we need and how our crawls and curation can be improved.

There are problems we need to document, certainly. Our BUDDAH project is using Internet Archive data, so none of the provenance I listed above was there to help us. And yes, when providing access to the data we do need to explain the crawl dynamics and parameters - you need to know that most of the Internet Archive crawls omit items over 10MB in size (see e.g.here), that they largely obey robots.txt (which is often why mainstream sites are missing), and that right now everyone’s harvesting processes are falling behind the development of the web.

But researchers can’t expect the archives to already know what they need to know, or to know exactly how these factors will influence your research questions. You should expect to have to learn why the dynamics of a web crawler mean that any data-mined ranking is highly unlikely to match up the popularity as defined by Alexa (which is based on web visitors rather than site-to-site links). You should expect to have to explore the data to test for biases, to confirm the known data issues and to help find the unknown ones.

“Know your data” applies to both of us. Meet us half way.

What we do lack, perhaps, is an adequate way to aggregating these experiences so new researchers do not have to waste time re-discovering and re-learning these things. I don’t know exactly what this would look like, but the IIPC Web Archiving Conferences provide a strong starting point and a forum to take these issues forward.

By Andy Jackson, Web Archive Technical Lead, The British Library

30 October 2015

Who is best - Cats or Dogs?

Add comment Comments (0)

Thursday 29 October was #NationalCatDay so the UK Web Archive have taken the opportunity to answer the BIG question that everyone is asking – are cats better than dogs? It is a rivalry as old as time itself and whilst it might be tricky to empirically say who is ‘best’ we can prove who is the most popular in the UK web space.

Using the SHINE interface we can look at trends in all of the .uk websites, based on the number of pages that a certain term is used over the years 1996-2013.

We want to be sure to capture as many cat and dog references as possible so the following term is a good start: cat OR kitten OR moggy OR kitty, dog OR puppy OR mutt

And the winner is [drumroll]…….

CatVSdog04

CATS!

That casual air of superiority that cats have appears to be fully justified.

Also, in 2005, in what we are now calling ‘Peak Cat’, pages with a mention of cats accounted for 4.5% of the ENTIRE .uk domain, as captured by the Internet Archive. Yes indeed, the humble moggy is popular with humans.

Try your own trend analysis: https://www.webarchive.org.uk/shine

By Jason Webber, Web Archiving Engagement and Liaison Manager

16 October 2015

Playing at Web Archiving

Add comment Comments (0)

A few months ago, a colleague suggested that we should come up with ways of helping people learn about the main stages of web archiving, and to help them understand some of the more common technical terminology.

I got a bit carried away…

…because at the same time, I’d been hearing a lot about Twine and about the interactive fiction that people can build using it. So, I thought, why not use a interactive fiction engine to built a ‘web archiving simulator’ that takes you through the core web archiving life-cycle? A way to ‘learn by doing’ without having all the baggage involved in doing it for real?

Well, because it’ll suck up a tonne of time learning about Twine and twinery.org and the two different versions and fiddling about with the structure and with the prose…

Editing the Twine

After a few evenings I ran out of steam, and the experiment has been sitting in browser tab since then, unfinished.

I enjoyed building it, but it’s really not going to get finished any time soon. I’m not even sure what ‘finished’ would look like any more. So I may as well publish it as it is. If you want to play the game of web archiving, click the link below…

Understanding Web Archiving

I’ve also made the source export available, which you should be able to upload at twinery.org if you want to extend it or just see how it works.

Let me know what you think!

Andy Jackson, British Library Web Archiving Technical Lead

x-post from http://anjackson.net/2015/08/19/web-archiving-twine/

 

23 September 2015

British Stand-Up Comedy Archive Special Collection

Add comment Comments (0)

BSUCA logo

The British Stand-Up Comedy Archive was established at the University of Kent in 2013, following the deposit of personal archive of the stand-up comedian, writer and broadcaster Linda Smith (1958-2006). Even prior to this deposit the University already had a longstanding interest in stand-up comedy and comic performance through teaching (at both BA and MA levels) and research (at PhD level and through the research interests of School of Arts staff). After this initial deposit other comedians were approached to see whether there was a demand from comedians, agents and venues to archive their material, and others who deposited material early in the life of the archive included the comedian and political activist Mark Thomas, and Tony Allen, one of the pioneers of the alternative cabaret/comedy movement in the late 1970s and 1980s. 

TuffLovers
Promotional publication for 'Tuff Lovers'. This is a four paged pamphlet including photographs, achievements, reviews and contact details. The other side to this pamphlet is shown in image BSUCA/LS/3/2/1/010(1). (c) Linda Smith estate. Photos by Pat McCarthy, design by Stephen Houfe.

In 2014 the University of Kent offered funding for a number of projects to celebrate its 50th anniversary and one of these became the British Stand-Up Comedy Archive (BSUCA). The BSUCA has a number of aims: to ensure that the archives and records of stand-up comedy in the UK are cared for in order to permanently preserve them; to ensure that these archives are universally accessible, discoverable and available; that the archives are actually used, and used in a variety of ways (popular culture, academic research, teaching, journalism, general enjoyment); and to acquire more offers of appropriate deposits. We also have an internal goal, which is to establish standards, workflows, and policies (with regards to digitisation, digital preservation and deposit negotiations) which aim to inform the future collecting activities of the University’s Special Collections & Archives department.

Screencapture-www-beyondthejoke-co-uk-1441620903810

One of the things I was keen to do when I was appointed as Archivist in January 2015 was to ensure that websites and social media relating to stand-up comedy were being archived. So much of how comedians promote and publicise themselves today, and interact with their audience, is done through social media and websites, and I’ve already noticed that websites referenced in material in the BSUCA collections have already disappeared without being captured. So I was delighted that the UK Web Archive team were happy for me to curate a special ‘British Stand-Up Comedy Archive’ collection for the UK Web Archive! My approach so far has been two-fold. 

Approach 1: filling the gaps

One focus has been on nominating websites for inclusion which relate to collections that we already have within the British Stand-Up Comedy Archive. For example, I have been nominating the websites and social media accounts of those whose work we have been physically and digitially archiving at the University, such as Attila the Stockbroker’s website and twitter account.  I have also been nominating sites which complement the collections we have. For example, within The Mark Thomas Collection we have copies of articles he has written, but only those which he collected himself. In fact there are many more which he has written which are only available online. The idea behind this approach is that we can ‘fill the gaps’ for researchers interested in those whose archives we have, by ensuring that other material relevant to that comedian/performer is being archived. These websites are provided in sub-categories with the name of the collection they relate to (i.e. Linda Smith Collection, Mark Thomas Collection).

Screencapture-www-markthomasinfo-co-uk-1441621124848

Approach 2: providing an overview of stand-up comedy in the UK today

As we are trying to collect material related to stand-up comedy in the UK I think that it is really important to try to capture as much information as possible about current comedians and the current comedy scene, nationally and locally. So my second focus has been on nominating websites which provide an overview of stand-up comedy in the UK today. Rather than initially focussing on nominating the websites of individual comedians (which would be an enormous task!) I have instead been nominating websites which are dedicated to comedy in the UK, both at a national level, such as Chortle and Beyond the Joke, and those at a regional level such as Giggle Beats (for comedy in the north of England) and London is Funny. I’ve also nominated the comedy sections in national news outlets like the Guardian and The Huffington Post (UK), as well as in regional news outlets such as The Skinny (Scotland and the north west of England), The Manchester Evening News, and The York Press. These websites include news, interviews with comedians and others involved in comedy, as well as reviews and listings of upcoming shows. The idea was that capturing these sorts of websites would help to demonstrate which comedians were performing, where they were performing, and perhaps some of the themes discussed by comedians in their shows. These websites have been categorised into the sub-category 'Stand-up news, listings and reviews'. 

I’ve also been focusing on the websites of comedy venues in order to document the variety of comedy clubs there are, to provide an overview of the comedians who are performing, as well as to document other issues like the cost of attending a comedy club night.  Many of the clubs whose websites have been nominated are quite longstanding venues, such as Downstairs at the Kings Head (founded in 1981), the Banana Cabaret Club in Balham (established 1983), and The Stand Comedy Club (established in Edinburgh in 1995). And of course I’ve also been focusing on comedy festivals around the UK. Much material relating to the Edinburgh Festival Fringe had already been included in the UK Web Archive but websites for Free Fringe events (which many see as important for the Edinburgh Festival Fringe*), such as the Free Festival and PBH’s Free Fringe, have now been nominated. I’ve also been nominating websites for comedy festivals around the UK, ranging from large established festivals such has the (Dave) Leicester Comedy Festival and the Machynlleth Comedy Festival, to smaller festivals such as the Croydon Comedy Festival and Argcomfest (Actually Rather Good Comedy Festival). The sub-category of 'Venues and festivals' is by far the largest sub-category so far! 

Screencapture-www-downstairsatthekingshead-com-1441621070602

Other features of current stand-up comedy that have been captured include organisations such as the Comedy Support Act (a charity funded by benefit shows which aims to provide emergency funds and assistance to professional comedians who find themselves in financial hardship through serious illness or accident) and organisations and events which celebrate and promote women in comedy such as What The Frock!, Laughing Cows Comedy, and the Women in Comedy Festival.

Next steps:

For me, the idea behind the special collection has been to (begin) to ensure that websites and social media relating to stand-up comedy in the UK are being archived for current and future researchers (and others) interested in stand-up comedy. But there are so many more websites that I haven't yet been able to nominate, particularly those of individual comedians or performers. But, the UK Web Archive is open to all (as long as the website is part of the UK web domain), so if there are websites relating to UK stand-up comedy that you want to be archived in the UK Web Archive please nominate them here http://www.webarchive.org.uk/ukwa/info/nominate!  

* Luke Toulson, 'Why free is the future of the fringe...and 7 more ways to improve the festival', http://www.chortle.co.uk/correspondents/2013/08/04/18425/why_free_is_the_future_of_the_fringe; and Nick Awde, 'Free shows are ringing the Edinburgh Fringe changes', https://www.thestage.co.uk/opinion/2015/setting-theatre-free-edinburgh/ 

 

For further information about the British Stand-Up Comedy Archive find out more at these links:

Blog http://blogs.kent.ac.uk/standupcomedyarchive/

Twitter https://twitter.com/unikentstandup

Flickr https://www.flickr.com/photos/britishstandupcomedyarchive/albums

Soundcloud https://soundcloud.com/stand-up-comedy-archive

 

by Elspeth Millar, Project Archivist, British Stand-Up Comedy Archive at University of Kent

18 September 2015

Ten years of the UK web archive: what have we saved?

Add comment Comments (0)

I gave the following presentation at the 2015 IIPC GA. If you prefer, you can read the rough script with slides below rather than watch the video.

  01

 

02

We started archiving websites by permission towards the end of 2004 (e.g. The Hutton Enquiry), building up what we now call the Open UK Web Archive. In part, this was considered a long-term investment, helping us build up the skills and infrastructure we need to support large-scale domain crawls under non-print Legal Deposit legislation, which we’ve been performing since 2013.

Furthermore, to ensure we have as complete a record as possible, we also hold a copy of the Internet Archives’ collection of .uk domain web material up until the Legal Deposit regulations were enacted. However these regulations are going to be reviewed, and could, in principle, be withdrawn. So, what should we do?

To ensure the future of these collections, and to reach our goals, we need to be able to articulate the value of what we’ve saved. And to do this, we need a better understanding of our collections and how they can be used.

Understanding Our Collections

So, if we step right back and just look at those 8 billion resources, what do we see? Well, the WARCs themselves are just great big bundles of crawled resources. They reflect the harvester and the bailer, not the need.

  03

So, at the most basic level, we need to be able to find things and look at them, and we use OpenWayback to do that. This example shows our earliest archived site, reconstructed from the server-side files of the British Library’s first web server. But you can only find it if you know that the British Library web site used to be hosted at “portico.bl.uk”.

  04

But that mode of access requires you to know what URLs you are interested in, so we have also built up various themed collections of resources, making the archive browsable.

  05

However, we’re keenly aware that we can’t catalog everything.

To tackle this problem, have also built full-text indexes of our collections. In effect, we’ve built an historical search engine, and having invested in that level of complexity, it has opened up a number of different ways of exploring our archives. The “Big Data Research” panel later today will explore this in more detail, but for now here’s a very basic example.

  06

This graph shows the fraction of URLs from ac.uk hosts and co.uk hosts over time. We can see that back in 1996, about half the UK domain was hosted on academic servers, but since then co.uk has come to dominate the picture. Overall, in absolute terms, both have grown massively during that period, but as a fraction of the whole, ac.uk is much diminished. This is exactly the kind of overall trend that we need to be aware of when we are trying to infer something from a more specific trend, such as the prevalence of medical terms on the uk web.

However, these kinds of user interfaces are hard to build and are forced to make fairly strong assumptions about what the user wants to know. So, to complement our search tools, we also generate various secondary datasets from the content so more technically-adept users can explore our data using their own tools. This provides a way of handing rich and interesting data to researchers without handing over the actual copyrighted content, and has generated a reasonable handful of publications so far.

  07

This process also pays dividends directly to us, in that the way researchers have attempted to exploit our collections has helped us understand how to do a better job when we crawl the web. As a simple example, one researcher used the 1996 link graph to test his new graph layout algorithm, and came up with this visualization.

  08

For researchers, the clusters of connectivity are probably the most interesting part, but for us, we actually learned the most from this ‘halo’ around the edge. This halo represents hosts that a part of the UK domain, but are only linked to from outside the UK domain. Therefore, we cannot build a truly representative picture of the UK domain unless we allow ourselves to stray outside it.

The full-text indexing process also presents an opportunity to perform deeper characterization of our content, such as format and feature identification and scanning for preservation risks. This has confirmed that the vast majority of the content (by volume) is not at risk of obsolescence at the format level, but has also illustrated how poorly we understand the tail of the format distribution and the details of formats and features that are in use.

  09

For example, we also build an index that shows which tags are in use on each HTML page. This means we can track the death and birth of specific features like HTML elements. Here, we can see the death of the <applet>, <blink> and <font> tags, and the massive explosion in the usage of the <script> tag. This helps us understand the scale of the preservation problems we face.

Putting Our Archives In Context

But all this is rather inward looking, and we wanted to find ways of complementing these approaches by comparing our collections with others and especially with the live web. This is perhaps the most fundamental way of stating the value of what we’ve collected as it addresses the basic quality of the web that we need to understand - it’s volatility.

  10

How has our archival sliver of the web changed? Are the URLs we’ve archived still available on the live web? Or are they long since gone? If those URLs are still working, is the content the same as it was?

  11

One option would be to go through our archives and exhaustively examine every single URL to work out what has happened to it. However, the Open UK Web Archive contains many millions of archived resources, and even just checking their basic status would be very time-consuming, never mind performing any kind of detailed comparison of the content of those resources.

Sampling The URLs

Fortunately, to get a good idea of what has happened, we don’t need to visit every single item. We can use our index to randomly sample a 1000 URLs from each year the archive has been in operation. We can then try to download those URLs again, and use the results to build up a picture that compares our archival holdings to the current web.

  12

As we download each URL, if the host has disappeared, or the server is unreachable, we say its GONE. If the server responds with an ERROR, we record that. If the server responds but does not recognize the URL, we classify it as MISSING, but if the server does recognize the URL, we classify it as MOVED or OK depending on whether a chain of redirects was involved. Note that we did look for “Soft 404s” at the same time, but found that these are surprisingly rare on the .uk domain.

Plotting the outcome by year, we find this result:

  13

The overall trend clearly shows how the items we have archived have disappeared from the live web, with individual URLs being forgotten as time passes. Looking at 2013, even after just two years, 40% of the URLs are GONE or MISSING.

Is OK okay?

However, so far, this only tells us what URLs are still active - the content of those resources could have changed completely. To explore this issue, we have to dig a little deeper by downloading the content and trying to compare what’s inside.

  14

We start by looking at a simple example - this page from the National Institute for Health and Care Excellence. If we want to compare this page with an archived version, one simple option is to ignore the images and tags, and just extract all the text.

  15

However, comparing these big text chunks is still rather clumsy and difficult scale, so we go one step further and reduce the text to a fingerprint1.

A fingerprint is conceptually similar to the hashes and digests that most of us are familiar with, like MD5 or SHA-256, but with one crucial difference. When you change the input to an cryptographic hash, the output changes completely - there’s no way to infer any relationship between the two, and indeed it is that very fact that makes these algorithms suitable for cryptography.

  16

For a fingerprint, however, if the input changes a little, then the output only changes a little, and so it can be used to bring similar inputs together. As an example, here are our fingerprints for out test page – one from earlier this year and another from the archive. As you can see, this produces two values that are quite similar, with the differences highlighted in red. More precisely, they are 50% similar as you’d have to edit half of the characters to get from one to the other.

To understand what these differences mean, we need to look at the pages themselves. If we compare the two, we can see two small changes, one to the logo and one to the text in the body of the page.

  17

  18

But what about all the differences at the end of the fingerprint? Well, if we look at the whole page, we can see that there are major differences in the footer. In fact, it seems the original server was slightly mis-configured when we archived it in 2013, and has accidentally injected a copy of part of the page inside overall page HTML.

  19

So, this relatively simple text fingerprint does seem to reliably reflect both the degree of changes between versions of pages, and also where in the pages those changes lie.

Processing all of the ‘MOVED’ or ‘OK’ URLs in this way, we find:

  20

We can quickly see that for those URLs that appeared to be okay, the vast majority have actually changed. Very few are binary identical, and while about half of the pages remain broadly similar after two years, that fraction tails off as we go back in time.

We can also use this tactic to compare the OK and MOVED resources.

  21

For resources that are two years old, we find that URLs that appear to be OK are only identical to the archived versions one third of the time, similar another third of the time, but the remaining third are entirely dissimilar. Not surprisingly, the picture is much worse for MOVED URLs, which are largely dissimilar, with less than a quarter being similar or identical.

The URLs Ain’t Cool

Combining the similarity data with the original graph, get this result:

  22

Shown in this way, it is clear that very few archived resources are still available, unchanged, on the current web. After just two years, 60% have gone or have changed into something unrecognizable.2

This rot rate is significantly higher than I expected, so I began to wonder whether this a kind collection bias. The Open UK Web Archive often prioritized sites known to be at risk, and that selection criteria seems likely to affect the overall trends. So, to explore this issue, I also ran the same analysis over a randomly sampled subset of our full, domain-scale Legal Deposit collection.

  23

However, the results came out almost exactly the same. After two years, about 60% of the content has GONE or is unrecognizable. Furthermore, looking at the 2014 data, we can see that after just one year, although only 20% of the URLs themselves have rotted, a further 30% of the URLs are unrecognizable. We’ve lost half the UK web in just one year.

This raised the question of whether this instability can be traced to specific parts of the UK web. Is ac.uk more stable than co.uk, for example?

  24

Looking at those results showed that, in fact, there’s not a great deal to choose between them. The changes to the NHS during 2013 seem to have had an impact on the number of identical resources, with perhaps a similar story for the restructuring of gov.uk, but there’s not that much between all of them.

What We’ve Saved (2004-2014)

Pulling the Open and Legal Deposit data together, we can get an overview of the situation across the whole decade. For me, this big, black hole of content lost from the live web is a powerful way of visualizing the value of what we’ve saved over those ten years.

  25

Summary

I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.

We’ve also found that this relatively simple text fingerprint provides some useful insight. It does ignore a lot, and is perhaps overly sensitive to changes in the ‘furniture’ of a web site, but it’s useful and importantly, scalable.

There are a number of ways we might take this work forward, but I’m particularly interested in looking for migrated content. These fingerprints and hashes are in our full-text index, which means we can search for similar content that has moved from one URL to another even if the was never any redirect between them. Studying content migration in this way would allow us to explore how popular content moves around the web.

I’d also like to extend the same sampling analysis in order to compare our archives with those of other institutions via the Memento protocol.

  26

Thank you, and are there any questions?

Addendum

If you’re interested in this work you can find:

  1. This technique has been used for many years in computer forensics applications, such as helping to identify ‘bad’ software, and here we adapt the approach in order to find similar web pages. 
  2. Or, in other words, very few of our archived URLs are cool

02 September 2015

2015 UK Domain Crawl has started

Add comment Comments (0)

 

We are proud to announce that the 2015 UK Domain Crawl has started !

Over the next weeks our web crawler will visit every website in the UK, download and keep it safe on the British Library archive servers.

Robot_icon.svg
https://commons.wikimedia.org/wiki/File%3ARobot_icon.svg By Bilboq (Own work) [Public domain], via Wikimedia Commons

Previous crawls

The first ever UK Domain crawl was run in 2013 it resulted in:

  • 3.8 million seeds (starting URLs)
  • 31TB data
  • 1.9 billion web pages and other assets

The 2014 built on experiences and yielded:

  • 20 million seeds
  • Geo IP check of UK hosted websites (2.5 million seeds)
  • 56TB data
  • 2.5 billion webpages and other assets
  • including: 4.7GB of viruses and 3.2TB of screenshots

Guesswork

What will the 2015 crawl be like? Will we find more urls? Surely the web grows every day, but how much? Will there be more data? Will we have more virus content?

Tweet your suggestions and thoughts about the UK Domain @UKWebArchive or use the #UKWebCrawl2015

 

 Homepage Crawl Log Flypast © Andy Jackson