Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

28 January 2015

Spam as a very ephemeral (and annoying) genre…

Add comment Comments (0)

Spam is a part of modern life. Who hasn’t received any recently, is a lucky person indeed. But only try to put your email out there in the open and you’ll be blessed with endless messages you don’t want, from people you don’t know, from places you’ve never heard about! And then just delete, de-le-te, block sender command…

Imagine though someone researching our web lives in say 50 years and this part of our daily existence is nowhere to be found. Spam is the ugly sister of the Web Archive, it is unlikely we’ll keep spam messages in our inboxes, and almost certainly no institution will keep them for posterity. And yet they are such great research materials. They vary in topics, they can be funny, they can be dangerous (especially to your wallet), and they make you shake your head in disbelief…

We all know the spam emails about people who got stuck somewhere and they can’t pay the bill and ask for a modest sum of £2,500 or so. Theses always make me think: if I had spare £2,500, it’d be Bora Bora here I come, but that’s just selfish me! Now these are taken to a new level. It’s about giving us the money that is inconveniently placed in a bank somewhere far, far away:

Charity spree

From Mrs A.J., a widow of a Kuwait embassy worker in Ivory Coast with a very English surname:

…Currently, this money is still in the bank. Recently, my doctor told me I would not last for the next eight months due to cancer problem. What disturbs me most is my stroke sickness. Having known my condition I decided to donate this fund to a charity or the man or woman who will utilize this money the way I am going to instruct here godly.

Strangely two weeks a Libyan lady, who is also a widow, is writing to me that she also suffered a stroke and all she wants to shower me with money as part of her charity spree:

Having donated to several individuals and charity organization from our savings, I have decided to anonymously donate the last of our family savings to you. Irrespective of your previous financial status, please do accept this kind and peaceful offer on behalf of my beloved family.

Spam

Mr. P. N. ‘an accountant with the ministry of Energy and natural resources South Africa’ was straight to the point:

… presently we discovered the sum of 8.6 million British pounds sterling, floating in our suspense Account. This money as a matter of fact was an over invoiced Contract payment which has been approved for payment Since 2006, now we want to secretly transfer This money out for our personal use into an overseas Account if you will allow us to use your account to Receive this fund, we shall give you 30% for all your Effort and expenses you will incure if you agree to Help.

My favourite is quite light-hearted. Got it from a 32 year old Swedish girl:

My aim of writing you is for us to be friends, a distance friend and from there we can take it to the next level, I writing this with the purest of heart and I do hope that it will your attention. In terms of what I seek in a relationship, I'd like to find a balance of independence and true intimacy, two separate minds and identities forged by trust and open communication. If any of this strikes your fancy, do let me know...

So what I’m a girl too, with a husband and a kid? You never know what may be handy…

Blog post by Dorota Walker
Assistant Web Archivist

@DorotaWalker

Further reading: Spam emails received by [email protected]. Please note that the quotations come from the emails and I left the original spelling intact.

Posted by Sabine Hartmann at 1:14 PM

Tags

Collections

11 November 2014

Collecting First World War Websites – November 2014 update

Add comment Comments (0)

Earlier in 2014 we blogged about the new Special Collection of websites related to World War One that we’ve put together to mark the Centenary. As today is Armistice Day, commemorating the cessation of hostilities on the Western Front, it seems fitting to see what we have collected so far.

The collection has been growing steadily over the past few months and now totals 111 websites. A significant subset of the WW1 special collection comes from the output of the Heritage Lottery Funded projects. The collection also includes websites selected by subject specialists at the British Library and nominations from members of the public.

A wide variety of websites have been archived so far which can broadly be categorised into a few different types:

Critical reflections
They include critical reflections on British involvement in armed conflict more generally, for example the Arming All Sides website, which features a discussion of the Arms trade around WW1 and Naval-History.net, an invaluable academic resource on the history of naval conflict in the First and Second World Wars.

Artistic and literary
The First World War inspired a wealth of artistic and literary output. For example the website dedicated to Eugene Burnand (1850-1921) a Swiss artist who created a series of pencil and pastel portraits depicting various ‘military types’ of all races and nationalities drawn into the conflict on all sides. Burnand was a man of great humanity and his subjects included typical men and women who served in the War as well as those of more significant military rank.

The Collection also includes websites of contemporary artists who in connection with the Centenary are creating work reflecting on the history of the conflict. One such artist is Dawn Cole whose work on WW1 has focused on the archive of WW1 VAD Nurse Clarice Spratling’s diaries, creating a project of live performance, spoken word and art installations.

Similar creative reflections from the world of theatre, film and radio can be seen in the archive. See for example Med Theatre: Dartmoor in WW1, an eighteen-month project investigating the effect the First World War had on Dartmoor and its communities. Pals for Life is a project based in the north-west aiming to create short films enabling local communities to learn about World War One. Subterranean Sepoys, is a radio play resulting from the work of volunteers researching the forgotten stories of Indian soldiers and their British Officers in the trenches of the Western Front in the first year of the Great War.

Community stories
The largest number of websites archived so far comprise projects produced by individuals or local groups telling stories of the War at a community level across the UK. The Bottesford Parish 1st World War Centenary Project focusses on 220 local recruits who served in the War using wartime biographies, memorabilia and memories still in the community to tell their stories.

The Wylye Valley 1914 project has been set up by a Wiltshire-based local history group researching the Great War and the sudden dramatic social and practical effects this had on the local population. In 1914 24,000 troops descended suddenly on the Wylye Valley villages, the largest of which had a population of 500, in response to Kitcheners’ appeals for recruits. These men arrived without uniform, accommodation or any experience of organisation. The project explores the effects of the War on these men and the impact on the local communities.

An important outcome of commemorations of the Centenary of WW1 has been the restoration and transcription of war memorials across the UK. Many local projects have used the opportunity to introduce the stories of those who were lost in the conflict. Examples include the Dover War Memorial Project; the Flintshire War Memorials Project ; Leicester City, County and Rutland War Memorials project and St. James Toxteth War memorials project.

Collecting continues
This shows just some of the many ways people are choosing to commemorate the First World War and demonstrates the continued fascination with it.

We will continue collecting First World War websites through the Centenary period to 2018 and beyond. If you own a website or know of a website about WW1 and would like to nominate it for archiving then we would love to hear from you. Please submit the details on our nominate form.

By Nicola Bingham, Web Archivist, The British Library

Posted by Jason Webber at 9:34 AM

Tags

Collections

03 November 2014

Powering the UK Web Archive search with Solr

Add comment Comments (0)

When you have hundreds of millions of webpages to search, what technologies do we use at the UK Web Archive to ensure the best service?

Solr
At the core of the UK Web Archive is the open source tool Apache Solr. To quote from their own website, ‘Solr is a very popular open source enterprise search platform that provides full-text and faceted searching’.

It is built using scalable and fault tolerant technologies, providing distributed indexing, automated failover and recovery, and centralised configuration management. And lots more besides – put simply, Solr is proactively pushing towards all aspects of big data search indexing and querying.

Open UK Web Archive
The UKWA website provides public access to more than 200 million UK selected webpages (the selection process includes gaining the permission to publish the archived site from the website owner, and you can nominate a website to archive via our Nominate a Site page.)

Once a site is harvested it is stored internally on several systems to ensure the safe keeping of the data. From these stores the data is ingested into the Solr service, which analyses the metadata and content, primarily to enable the fast querying of the service. Much of Solr’s speed comes from its way of indexing this data, which is called reverse-indexing.

Capable servers
To support these archived websites and provide the UK Web Archive search, we run the service on two dual Xeon servers – an HP ProLiant DL580 G5 with 96GB of RAM and an HP ProLiant DL380 G5 with 64GB of RAM. The data is stored on a Storage Area Network (SAN) using fibre channel connections.

The Solr service itself runs under the Apache Tomcat Java management service, and is split between the two physical servers as a master and slave setup – one provides the data ingest mechanism, the other provides the data querying mechanism for the public website.

Scalability
One of the benefits of using Apache Solr is that it is fairly simple to grow a system, in terms of both speed and data capacity. As the amount of web content increases, we can add more hardware to handle the extra load as Solr is designed from the outset as a distributed service.

By Gil Hoggarth, Web Archiving Technical Services Engineer, The British Library

Posted by Jason Webber at 10:35 AM

Tags

Web/Tech

16 October 2014

What is still on the web after 10 years of archiving?

Add comment Comments (2)

The UK Web Archive started archiving web content towards the end of 2004 (e.g. The Hutton Enquiry). If we want to look back at the (almost) ten years that have passed since then, can we find a way to see how much we’ve achieved? Are the URLs we’ve archived still available on the live web? Or are they long since gone? If those URLs are still working, is the content the same as it was? How has our archival sliver of the web changed?

Looking Back
One option would be to go through our archives and exhaustively examine every single URL, and work out what has happened to it. However, the Open UK Web Archive contains many millions of archived resource, and even just checking their basic status would be very time-consuming, never mind performing any kind of comparison of the content of those pages.

Fortunately, to get a good idea of what has happened, we don’t need to visit every single item. Our full-text index categorizes our holdings by, among other things, the year in which the item was crawled. We can therefore use this facet of the search index to randomly sample a number of URLs from each year the archive has been in operation, and use those to build up a picture that compares those holdings to the current web.

URLs by the Thousand
Our search system has built-in support for randomizing the order of the results, so a simple script that performs a faceted search was all that was needed to build up a list of one thousand URLs for each year. A second script was used to attempt to re-download each of those URLs, and record the outcome of that process. Those results were then aggregated into an overall table showing how many URLs fell into each different class of outcome, versus crawl date, as shown below:

Here, ‘GONE’ means that not only is the URL missing, but the host that originally served that URL has disappeared from the web. ‘ERROR’, on the other hand, means that a server still responded to our request, but that our once-valid URL now causes the server to fail.

The next class, ‘MISSING’, ably illustrates the fate of the majority of our archived content - the server is there, and responds, but no longer recognizes that URL. Those early URLs have become 404 Not Found (either directly, or via redirects). The remaining two classes show URLs that end with a valid HTTP 200 OK response, either via redirects (‘MOVED’) or directly (‘OK’).

The horizontal axis shows the results over time, since late 2004, broken down by each quarter (i.e. 2004-4 is the fourth quarter of 2004). The overall trend clearly shows how the items we have archived have disappeared from the web, with individual URLs being forgotten as time passes. This is in contrast to the fairly stable baseline of ‘GONE’ web hosts, which reflects our policy of removing dead sites from the crawl schedules promptly.

Is OK okay?
However, so far, this only tells us what URLs are still active - the content of those resources could have changed completely. To explore this issue, we have to dig a little deeper by downloading the content and trying to compare what’s inside.

This is very hard to do in a way that is both automated and highly accurate, simply because there are currently no reliable methods for automatically determining when two resources carry the same meaning, despite being written in different words. So, we have to settle for something that is less accurate, but that can be done automatically.

The easy case is when the content is exactly the same – we can just record that the resources are identical at the binary level. If not, we extract whatever text we can from the archived and live URLs, and compare them to see how much the text has changed. To do this, we compute a fingerprint from the text contained in each resource, and then compare those to determine how similar the resources are. This technique has been used for many years in computer forensics applications, such as helping to identify ‘bad’ software, and here we adapt the approach in order to find similar web pages.

Specifically, we generate ssdeep ‘fuzzy hash’ fingerprints, and compare them in order to determine the degree of overlap in the textual content of the items. If the algorithm is able to find any similarity at all, we record the result as ‘SIMILAR’. Otherwise, we record that the items are ‘DISSIMILAR’.

Processing all of the ‘MOVED’ or ‘OK’ results in this way leads to this graph:

So, for all those ‘OK’ or ‘MOVED’ URLs, the vast majority appear to have changed. Very few are binary identical (‘SAME’), and while many of the others remain ‘SIMILAR’ at first, that fraction tails off as we go back in time.

Summarising Similarity
Combining the similarity data with the original graph, we can replace the ‘OK’ and ‘MOVED’ parts of the graph with the similarity results in order to see those trends in context:

Shown in this way, it is clear that very few archived resources are still available, unchanged, on the current web. Or, in other words, very few of our archived URLs are cool.

Local Vs Global Trends
While this analysis helps us understand the trends and value of our open archive, it’s not yet clear how much it tells us about other collections, or global trends. Historically, the UK Web Archive has focused on high-status sites and sites known to be at risk, and these selection criteria are likely to affect the overall trends. In particular, the very rapid loss of content observed here is likely due to the fact that so many of the sites we archive were known to be ‘at risk’ (such as the sites lost during the 2012 NHS reforms). We can partially address this by running the same kind of analysis over our broader, domain-scale collections. However, that would still bias things towards the UK, and it would be interesting to understand how these trends might differ across countries, and globally.

By Andy Jackson, Web Archiving Technical Lead, The British Library

Posted by Jason Webber at 11:35 AM

Tags

Collections, Web/Tech

07 October 2014

Thoughts on website selecting for the UK Web Archive

Add comment Comments (0)

Hedley Sutton, Asian & African Studies Reference Team Leader at The British Library gives his thoughts and experiences of web archiving.

A Reference Team Leader spends most of their day answering queries sent in by e-mail, fax and letter or manning Reading Room enquiry desks. Some, however, also help with contributing to the selection of sites for inclusion in the UK Web Archive.

The rise of digital
Digital content is of course increasingly important for researchers, and is certain to become ever more so as publishers slowly move away from print to online formats. The Library recognized this when it began to archive websites in 2004, aiming to harvest a segment of the vast national web domain by providing free access both to live sites and to snapshots of existing and defunct sites as they developed over time.

Those which have been fully ‘accessioned’, as it were, are available to view online, and can be found alphabetically by title, or subject/keyword, or in some cases grouped in themed collections such as the 2012 London Olympics or the ‘Credit crunch’.

Websites of interest
I volunteered to become a selector in 2008, planning initially to concentrate on tracing websites within my own specialism of Asian and African studies. I soon discovered, however, that it was more rewarding (addictive, even) to look beyond conventional subject divisions to home in on all and anything that looked of potential interest to present and future users of the archive.

Worthy, unusual and not-quite-believe-it
Over the years this has ranged from the worthy (such as the UK Web Designers’ Association and the Centre for Freudian Analysis and Research), through the unusual (step forward the Federation of Holistic Therapists, the Fellowship of Christian Magicians, and the Society for the Assistance of Ladies in Reduced Circumstances), to the I-see-it-but-do-not-quite-believe-it (yes, I mean you, British Leafy Salads Association; no, don’t try and run away, Ferret Education and Research Trust; all power to you, Campaign Against Living Miserably). Being paid to spend part of your time surfing the web – what’s not to like?

Permission required
The only mildly disappointing aspect of selecting websites is the fact that at present only about 20% of recommended sites actually make it into the Open UK Web Archive. The explanation is simple – the Library requires formal permission from website owners before it can ingest and display their sites.

This is offset in part by the amendment to the Legal Deposit legislation that (since 2013) has allowed The British Library to archive all UK websites. These, however, can only be viewed in the Reading Rooms of the UK Legal Deposit Libraries.

If you know of a website that you feel should be in the Open UK Web Archive, please nominate it.

By Hedley Sutton - Asian & African Studies Reference Team Leader, The British Library

Posted by Jason Webber at 1:54 PM

Tags

Collections, Web/Tech

15 September 2014

Dot Scot: A new domain identity

Add comment Comments (0)

As all thoughts turn to Scotland and the Scottish Referendum which is taking place on the 18th of September it seems appropriate to highlight some recent developments in the digital sphere that will impact the Web Archiving Team over the coming months.

New top level domain (TLD) for Scotland
The Internet Corporation for Assigned Names and Numbers (ICANN) has released a suite of new top level domains (TLDs) this year. One of these is .scot (live since 15 July 2014) allowing organisations and individuals to create websites and email addresses identifying themselves as Scottish. The new TLD follows a near-decade long campaign by the Dot Scot Registry, a not-for-profit company created to apply for and operate the .scot domain as an online identity for Scots worldwide.

Pioneers
.scot is a community domain meaning anyone can apply for it, however for the first 60 days the domain was only available to launch ‘pioneers’, a cross section of organisations based in Scotland or part of the Scottish diaspora community. The first pioneer website to go live on 15th July was calico.scot - a Highlands based Internet Service Provider who offer .scot domain registrations. Over 50 pioneers have signed up including the Scottish government, the Scouts in Scotland, Yes Scotland and Better Together.

Scotland beyond Britain
Individuals and groups outside of Scotland have also taken advantage of the new domain with the Louisiana Scots and the Clan Wallace among the first international organisations to launch websites using the new domain ahead of the general launch on 23rd September.

New-borns get a domain name
The Dot Scot Registry has come up with a unique idea to publicise the .scot TLD by reserving a few domain names for any Scottish baby born on 15 July for free. In a press release on their website, the organisation said: ‘It’s taken nine years to get to this point – and we want to celebrate this “birth” in as many ways as possible. So, if you know someone who had a baby in Scotland on 15 July 2014, contact our press team, and we’ll secure their .scot for them … It’s our little way of saying “welcome to the world and the digital future for Scotland.”’.

In the archive
The UK Web Archiving Team are already collecting .scot websites as part of our annual domain crawl along with .london websites, another of the TLDs released by ICANN this year. A phased release of the .cymru and .wales TLDs was launched this month by the UK internet registry, Nominet, with general availability due in March 2015. These websites will also be picked up by the British Library’s annual domain crawl.

Short lived?
One final point to make is that .scot might be superseded if the Scottish referendum on independence succeeds and Scotland were to leave the United Kingdom as it would get its own two letter country code TLD. Let’s see…..

Nicola Bingham, Web Archivist, The British Library

Posted by Jason Webber at 11:44 AM

Tags

Collections, Web/Tech

27 August 2014

User driven digital preservation with Interject

Add comment Comments (0)

When we archive the web, we want to do our best to ensure that future generations will be able to access the content we have preserved. This isn’t only a matter of keeping the digital files safe and ensuring they don’t get damaged. We also need to worry about the software that is required in order to access those resources.

A Digital Dark Age?
For many years, the spectre of obsolescence and the ensuing digital dark age drove much of the research into digital preservation. However, even back in 2006, Chris Rusbridge was arguing that this concern was overblown, and since at least 2007, David Rosenthal has been arguing that this kind of obsolescence is no longer credible.

What are the risks?
The current consensus among those who care for content seems to have largely (but not universally) shifted away from perceiving obsolescence as the main risk we face. Many of us expect the vast majority of our content to remain accessible for at least one or two decades, and any attempt to predict the future beyond the next twenty years should be taken with a large pinch of salt. In the meantime, we are likely to face much more basic issues concerning the economics of storage, and concerning the need to adopt scalable collection management techniques to ensure the content we have remains safe, discoverable, and is accompanied by the contextual information it depends upon.

This is not to say obsolescence is no risk at all, but rather that the scale of the problem is uncertain. Therefore, in order to know how best to take care of our digital history, we need to find ways of gathering more evidence about this issue.

Understanding our collections
One aspect of this is to analyse the content we have, and try to understand how it has changed over time. Examples of this kind of work include our paper on Formats Over Time (more here), and more recent work on embedding this kind of preservation analysis in our full-text indexing processes so we can explore these issues more readily.

But studying the content can only tell you half the story - for the other half, we need to work out what we mean by obsolescence.

Understanding obsolescence
If there is an open source implementation of a given format, then we might say that format cannot be obsolete. But if 99.9% of the visitors to the web archive are not aware of that fact (and even if they were, would not be able to compile and build the software in order to access the content), is that really an accurate statement? If the members of our designated community can’t open it, then it’s obsolete, whether or not someone somewhere has all the time, skills and resources needed to make it work.

Obsolescence is a user experience problem that ends with frustration. So how can we better understand the needs and capabilities of our users, to enable them to help drive the digital preservation process?

How Interject can help
To this end, and working with the SCAPE Project, we have built a prototype service that is designed to help us find the content that users are having difficulties with, and where possible, to provide alternative ways of accessing that content. This prototype service, called Interject, demonstrates how a mixture of feedback systems and preservation actions can be smoothly integrated into the search infrastructure of the UK Web Archive, by acting as an ‘access helper’ for end users.

ZX Spectrum Software
For example, if you go to our historical search prototype and look for a specific file called ‘lostcave.z80’ you’ll see the Internet Archive has a number of copies of this old ZX Spectrum game but, unless you have an emulator to hand, you won’t be able to use them. However, if you click ‘Use our access helper’, the Interject service will inspect the resource, summarise what we understand about it, and where possible offer transformed versions of that resource. In the case of ‘lostcave.z80’, this includes a full browser-based emulation so that you can actually play the game yourself. (Note that this example was partially inspired by the excellent work on browser-based emulated access being carried out by the Internet Archive).

The Interject service can offer a range of transformation options for a given format. For example, instead of running the emulator in your browser, the service can spin up an emulator in the background, take a screenshot, and then deliver that image back to you, like this:

These simple screenshots are not quite as impressive as the multi-frame GIFs created by Jason Scott’s Screen Shotgun, but they do illustrate the potential a simple web API that transforms content on demand.

Early image formats
As the available development time was relatively short, we were only able to add support for a few ‘difficult’ formats. For example, the X BitMap image format was the first image format on the web. However, despite this early and important role this format and the related X PixMap format (for colour images) are not widely supported today and so may require format conversion in order to enable access. Fortunately, there are a number of open source projects that support these formats, and Interject makes them easy to use. See for example image.xbm, xterm-linux.xpm and this embedded equation image shown below as a more modern PNG:

VRML
We also added support for VRML1 and VRML97, two early web-based formats for 3D environments that required a browser plugin to explore. Those plugins are not available for modern browsers, and the formats have been superseded by the X3D format. Unfortunately these formats are not backward compatible with each other, and tool support for VRML1 is quite limited. However, we were able to find suitable tools for all three formats, and using Interject, we are able to take a VRML1 file, and then combine a two format conversions (VRML1-to-VRML97 and VRML97-to-X3D) before passing the result to a browser-based X3D renderer, like this.

The future of Interject
Each format we decide to support adds an additional development and maintenance burden, and so it is not clear how sustainable this approach will be in the long term. This is one of the reasons why Interject is open source, and we would be very happy to receive ideas or contributions from other individuals and organisations.

Letting users lead the way
But even with a limited number of transformation services, the core of this idea is to find ways to listen to our users, so we have some chance of finding out what content is ‘obsolete’ to them. By listening when they ask for help, and by allowing our visitors to decide between the available options, the real needs of our designated communities can be expressed directly to us and so taken into account as part of the preservation planning process.

By Andy Jackson, Web Archiving Technical Lead, The British Library

Posted by Jason Webber at 3:25 PM

Tags

Web/Tech

15 August 2014

Archiving ‘screenshots’

Add comment Comments (0)

Since the passing of Legal Deposit legislation in April of 2013 the UK Web Archive has been generating screenshots of the front-pages of each visited website. The manner in which we chose to store these has changed over the course of our activities, from simple JPEG files on disk (a silly idea) to HAR-based records with WARC metadata records (hopefully a less silly idea). What follows is our reasoning behind these choices.

What not to do
When Legal Deposit legislation passed in April of 2013 we were experimenting with the asynchronous rendering of web pages in a headless browser (specifically PhantomJS) to avoid some of the limitations we’d encountered with our crawler software. While doing so it occurred to us that if we were taking the time to actually render a page in a browser, why not generate and store screenshots?

As we had no particular use-case in mind they were simply stored in a flat file system, the filename being the Epoch time at which they were rendered, e.g:

13699895733441.jpg

There’s an obvious flaw here: the complete lack of any detail as to the provenance of the image. Unfortunately that wasn’t quite obvious enough and after approximately 15 weeks and 1,118,904 screenshots we changed the naming scheme to something more useful, e.g:

http%3A%2F%2Fwww.bl.uk%2F_1365759028.jpg

The above now includes the encoded URL and the epoch timestamp. This was again changed one final time, replacing the epoch timestamp with a human-readable version, e.g:

http%3A%2F%2Fwww.bl.uk%2F_20130604121723.jpg

Despite a more sensible naming convention we were still left with a large number of files sitting in a directory on disk which could not be stored along with our crawled content and as a consequence of this, could not be accessed by normal channels.

A bit better?
A simple solution was to store the images in WARC files (ISO 28500—the current de facto storage format for web archives). We could then permanently archive them alongside our regular web content and access them in a similar fashion. However, the WARC format is designed specifically for storing web content, i.e. a resource referenced by a URL. Our screenshots, unfortunately, didn’t really fit this pattern. Each represented the Webkit rendering of a web page, actually likely to be the result of any number of web resources, not just the single HTML page referenced by the site’s root URL. We therefore did what anyone does when faced with a conundrum of sufficient brevity: we took to Twitter.

The WARC format contains several types of record which we potentially could have used: response, resource, conversion and metadata. The resource and response record types are intended to store the “output of a URI-addressable service” (thanks, @gojomo). Although our screenshots are actually rendered using a webservice, the URI used to generate them would not necessarily correspond to that of the original site, thus losing the relationship between the site and the image. We could have used a further metadata record to tie the two together but others (thanks to @tef and @junklight) had already suggested that a metadata record might be the best location for the screenshots themselves. Bowing to conventional wisdom, we opted for this latter method.

WARC-type: metadata
As mentioned earlier, screenshots are actually rendered using a PhantomJS-based webservice —the same webservice used to extract links from the rendered page. It is at this point worth noting that when rendering a page and outputting details of all requests/responses made/received during said rendering, PhantomJS by default returns this information in HTTP Archive (HAR) format. Although not specifically a formal standard, HAR has become the almost de facto method for communicating HTTP transaction information (most browsers will export data about a page in this format).

The HAR format permits information to be communicated not only about the page, but about each of the component resources used to create that page (i.e. the stylesheets, Javascript, images, etc.). As part of each response entry—each pertaining to one of the aforementioned resources—the HAR format allows you to store the actual content of that response in a “text” field, e.g:

Unfortunately, this is where the HAR format doesn't quite meet our needs. Rather than storing the content for a particular resource we need to store content for the rendered page—potentially the result of several resources and thus not suited to a response record.

Thankfully though, the HAR format does permit you to record information at a page level:

Better still…
Given the above we decided to leverage the HAR format by combining elements from the content section of the response with those of the page. There were two things we realised we could store:

1. As initially desired, an image of the rendered page.
2. The final DOM of the rendered page.

With this latter addition, it occured to us that the final representation of a HTML page—thanks to client-side Javascript altering the DOM—might differ from that which the server originally returned. As this final representation was the version which PhantomJS used to render the screenshot it made sense to attempt to record that too. In order to distinguish this final, rendered version from content of the HAR’s corresponding response record the name was amended to renderedContent.

Similarly named, we stored the screenshot of the whole, rendered page under a new element, renderedElements:

‘renderedElements’?
Given that we were trying to store a single screenshot, storing it in an element which is firstly, plural and secondly, an array, might seem a questionable choice. PhantomJS has one further ability that we decided to leverage for future use: the ability to render images of specific elements within a page.

In the course of a crawl there are some things we can't (yet) archive—videos, embedded maps, etc. However, if we can identify the presence of some non-archivable section of a page (an iframe which references a Google Map, for instance), we can at least render an image of it for later reference.
For this reason, the whole-page screenshot is simply referenced by its CSS selector (“:root”), enabling us to include a further series of images for any part of the page.

About those jpgs…
Currently, all screenshots are stored in the above format, in WARCs and will be ingested alongside regularly-crawled data. The older set of JPEGs on disk were converted, using the URL and timestamp in the filename, to the above HAR-like format (obviously lacking details of other resources and the final DOM which had been lost) and added to a series of WARC files.

The very earliest set of images, those simply stored using the Epoch time, are regrettably left as an exercise for future workers in Digital Preservation.

by Roger G. Coram, Web Crawl Engineer, The British Library

Posted by Jason Webber at 3:58 PM

Tags

Web/Tech