THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

04 May 2018

Star Wars in the Web Archive

May the fourth be with you!

It's Star Wars day and I imagine that you are curious to know which side has won the battle of the UK web space?

Looking at the trends in our SHINE dataset (.uk websites 1996-2013 collected by Internet Archive) I first looked at the iconic match-up of Luke vs Darth.

Shine-darth-vader

Bad news, evil seems to have won this round mainly, it seems, due to the popularity of Darth Vader costume mentions on retail websites.

How about a more general 'Light Side vs Dark side'? 

Shine-lightside-v-darkside

It appears that discussing the 'dark side' of many aspects in life is a lot more fun and interesting than the 'light side'. 

How about just analysing the phrase 'may the force be with you'?

Shine-may the force be with you

This phrase doesn't seem to have been particularly popular on the UK web until it started to be used a lot on websites offering downloadable ringtones. Go figure.

Try using the trends feature on this dataset yourself here: www.webarchive.org.uk/shine/graph

Happy stars wars day!

by Jason Webber, Web Archive Engagement Manager, The British Library

@UKWebArchive

 

01 February 2018

A New Playback Tool for the UK Web Archive

We are delighted to announce that the UK Web Archive will be working with Rhizome to build a version of pywb (Python Wayback) that we hope will greatly improve the quality of playback for access to our archived content.

What is playback of a web archive?

When we archive the web, just downloading the content is not enough. Data can be copied from the web into an archive in a variety of ways, but to make this archive actually accessible takes more than just opening downloaded files in a web browser. Technical details of pages and scripts coming out of the archive need to be presented in a way that enables them to work just like the originals, although they aren’t located on their actual servers anymore. Today’s web users have come to expect interactive features and dynamic layouts on all types of websites. Faithfully reproducing these behaviors in the archive has become an increasingly complex challenge, requiring web archive playback software that is on-par with the evolution of the web as a whole.

Why change?

Currently, we use the OpenWayback playback system, originally developed by the Internet Archive. But in more recent years, Rhizome have led the development of a new playback engine, called pywb (Python Wayback). This Python toolkit for accessing web archives is part of the Webrecorder project, and provides a modern and powerful alternative implementation that is being run as an open source project. This has led to rapid adoption of pywb, as the toolkit is already being used by the Portuguese Web Archive, perma.cc, the UK National Archives, the UK Parliamentary Archive, and a number of others.

Open development
To meet our needs we need to modify pywb, but as strong believers in open source development, all work will be in the open, and wherever appropriate, we will fold the improvements back into the core pywb project.

If all goes to plan, we expect to contribute the following back to pywb for others to use:

Other UKWA-specific changes, like theming, implementing our Legal Deposit restrictions, and deployment support, will be maintained separately.

Initially we will work with Rhizome to ensure our staff and curators can access our archived material via both pywb and OpenWayback. If the new playback tool performs as expected  we will move towards using pywb to support public access to all our web archives.

23 January 2018

Archiving the UK Copyright Literacy blog

By Louise Ashton, Copyright & Licensing Executive, The British Library
Re-posted (with permission) from copyrightliteracy.org/

We were excited to discover recently that copyrightliteracy.org had been selected for inclusion in the UK Web Archive as an information resource with historical interest. However, even we faced some trepidation when considering the copyright implications of allowing archiving of the site (i.e. not everything on the site is our copyright). Firstly, this allowed us to get our house in order, contact our fellow contributors and ensure we had the correct re-use terms on the site (you can now see a CC-BY-SA licence at the footer of each web page). Secondly, this provided opportunity for another guest blog post and we are delighted that Louise Ashton who works in the Copyright & Licensing Department at The British Library has written the following extremely illuminating post for us. In her current role Louise provides copyright support to staff and readers of the British Library, including providing training, advising on copyright issues in digitisation projects and answering copyright queries from members of the public on any of their 150 million collection items!  Prior to this, Louise began her career in academic libraries, quickly specialising in academic liaison and learning technologist roles. 

Screenshot-beta-home-01

When people think of web archiving their initial response usually focuses on the sheer scale of the challenge. However another important issue to consider is copyright; copyright plays a significant role both in shaping web archives and in determining if and how they can be accessed. Most people in the UK Library and Information Science (LIS) sector are aware that in 2013 our legal deposit legislation was extended to include non-print materials which, as well as e-books and online journal articles, also covers websites, blogs and public social media content. This is known as the snappily titled ‘The Legal Deposit Libraries (Non-Print Works) Regulations 2013’ and is enabling the British Library and the UK’s five other legal deposit libraries to collect and preserve the nation’s online life. Indeed, given that the web will often be the only place where certain information is made available the importance of archiving the online world is clear.

UKWA-poster
UK Web Archive poster © British Library Board

What is less well known is that, unless site owners have given their consent, the Non-Print Legal Deposit Archive is only available within the reading rooms of the legal deposit libraries themselves and even then can only be accessed if using library PCs. Although this mirrors the terms for accessing print legal deposit, because of the very nature of the non-print legal deposit collection (i.e. websites that are generally freely available to anyone with an internet connection) people naturally expect to be able to access the collection off-site. The UK Web Archive offers a solution to this by curating a separate archive of UK websites that can be freely viewed and accessed online by anyone, anywhere, and with no need to travel to a physical reading room. The purpose of the UK Web Archive is to provide permanent online access to key UK websites with regular snapshots of the included websites being taken so that a website’s evolution can be tracked. There are no political agendas governing which sites are included in the UK Web Archive, the aim is simply to represent the UK’s online life as comprehensively and faithfully as possible (inclusion of a site does not imply endorsement).

However, a website will only be added to the (openly-accessible) UK Web Archive if the website owners’ permission has been obtained and if they are willing to sign a licence granting permission for their site to be included in the Archive and allowing for all versions of it to be made publically accessible. Furthermore, the website owner also has to confirm that nothing in their site infringes the copyright or other intellectual property rights of any third party and if their site does contain third party copyright, that they are authorised to give permission on the rights-holders’ behalf. Although the licence has been carefully created to be as user-friendly as possible the presence of any formal legal documentation is often perceived as intimidating. So even if a website owner is confident that their use of third party content is legitimate they may be reluctant to formally sign a licence to this effect – seeing it in black and white somehow makes it more real! Or, despite best efforts, site owners may have been unable to locate the rights-holders of third party content used in their site and although they may have been happy with their own risk assessments, this absence of consent negates them from being able to sign the licence to include the site in the UK Web Archive.

For other website owners this may be the first time they have thought about copyright. Fellow librarians will not be surprised to hear that some people are bewildered to learn that they may have needed to obtain permission to borrow content from elsewhere on the internet for use in their own sites! And then of course there are the inherent difficulties in tracking down rights-holders more generally; unless sites are produced by official bodies it can be difficult to identify who the primary site owners are and in big organisations the request may never make it to the relevant person. Others may receive the open access request but, believing it to be spam, ignore it. And of course site owners are perfectly entitled to refuse the request if they do not wish to take part. Information literacy plays its part and for sites where it is crucial that site visitors access the most recent information and advice (for example websites giving health advice) then for obvious reasons the site owners may not wish for their site to be included.

The reason Jane and Chris asked me to write this blog post is because the UK Copyright Literacy website has been selected for potential inclusion in the UK Web Archive. It was felt important that the Archive should contain a site that documented and discussed copyright issues given that copyright and online ethics are such big topics at the moment, particularly with the new General Data Protection Regulations coming into force next May. Another reason why the curators wanted to include the Copyright Literacy blog is, given that the website isn’t hosted in the UK and therefore does not have a UK top level domain (for example .uk or .scot), it had never been automatically archived as part of the annual domain crawl. This is an unfortunate point which affects many websites as it means that many de facto UK sites are not captured unless manual intervention occurs. To try and minimise the number of UK websites that unwittingly evade inclusion, the UK Web Archive team therefore welcomes site nominations from members of the public. Consequently, if you would like to nominate a site to be added to the archive, and in doing so perhaps help to play a role in preserving UK websites, you can do so via https://www.webarchive.org.uk/ukwa/info/nominate.

As a final note, we are pleased to report that Jane and Chris have happily agreed to their site being included which is great news as it means present day copyright musings will be preserved for years to come!

22 December 2017

What can you find in the (Beta) UK Web Archive?

We recently launched a new Beta interface for searching and browsing our web archive collections but what can you expect to find?

UKWA have been collecting websites since 2005 in a range of different ways, using changing software and hardware. This means that behind the scenes we can't bring all of the material collected into one place all at once. What isn't there currently will be added over the next six months (look out for notices on twitter). 

What is available now?

At launch on 5 December 2017 the Beta website includes all of the websites that have been 'frequently crawled' (collected more often than annually) 2013-2016. This includes a large number of 'Special Collections' selected over this time and a reasonable selection of news media.

DC07-screenshot-brexit

What is coming soon?

We are aiming to add 'frequently crawled' websites from 2005-2013 to the Beta collection in January/February 2018. This will add our earliest special collections (e.g. 2005 UK General Election) and should complete all of the websites that we have permission to publicly display.

What will be available by summer 2018?

The largest and most difficult task for us is to add all of the websites that have been collected as part of the 'Legal Deposit' since 2013. We do a vast 'crawl' once per year of 'everything' that we can identify as being a UK website. This includes all .UK (and .london, .scot, .wales and .cymru) plus any website we identify as being on a server in the UK. This amounts to tens of millions of websites (and billions of individual assets). Due to the scale of this undertaking we thank you for your patience.

We would love to know your experiences of using the new Beta service, let us know here: www.surveymonkey.co.uk/r/ukwasurvey01

 By Jason Webber, Web Archive Engagement Manager, The British Library

 

05 December 2017

A New (Beta) Interface for the UK Web Archive

The UK Web Archive has a new user interface! Try it now: 

beta.webarchive.org.uk/

Screenshot-beta-home-01

What's new?

  • For the first time you can search both the 'Open UK Web Archive'1 and the 'Legal Deposit Web Archive'2 from the same search box
  • We have improved the search and have included faceting so that it's easier to find what you are looking for
  • A simple, clean design that (hopefully) allows the content to be the focus
  • Easily browsable 'Special Collections' (curated groups of websites on a theme, topic or event)

 What next?

This is just the start of what will be a series of improvements, but we need your help! Please use the beta site and tell us how it went (good, bad or meh) by filling in this short 2 minute survey:

www.surveymonkey.co.uk/r/ukwasurvey01

Thank you!

by Jason Webber, Web Archive Engagement Manager, The British Library

1 The Open UK Web Archive was started in 2005 and comprises of approximately 15,000 websites that can be viewed anywhere

2 The Legal Deposit Web Archive was started in 2013 and comprises millions of websites but these can only be viewed in the Reading Rooms of UK Legal Deposit Libraries

10 November 2017

Driving Crawls With Web Annotations

By Dr Andrew Jackson, Web Archive Technical Lead, The British Library

The heart of the idea was simple. Rather than our traditional linear harvesting process, we would think in terms of annotating the live web, and imagine how we might use those annotations to drive the web-archiving process. From this perspective, each Target in the Web Curator Tool is really very similar to a bookmark on an social bookmarking service (like PinboardDiigo or Delicious1), except that as well as describing the web site, the annotations also drive the archiving of that site2.

In this unified model, some annotations may simply highlight a specific site or URL at some point in time, using descriptive metadata to help ensure important resources are made available to our users. Others might more explicitly drive the crawling process, by describing how often the site should be re-crawled, whether robots.txt should be obeyed, and so on. Crucially, where a particular website cannot be ruled as in-scope for UK legal deposit automatically, the annotations can be used to record any additional evidence that permits us to crawl the site. Any permissions we have sought in order to make an archived web site available under open access can also be recorded in much the same way.

Once we have crawled the URLs and sites of interest, we can then apply the same annotation model to the captured material. In particular, we can combine one or more targets with a selection of annotated snapshots to form a collection. These ‘instance annotations’ could be quite detailed, similar to those supported by web annotation services like Hypothes.is, and indeed this may provide a way for web archives to support and interoperate with services like that.3

Thinking in terms of annotations also makes it easier to peel processes apart from their results. For example, metadata that indicates whether we have passed those instances through a QA process can be recorded as annotations on our archived web, but the actual QA process itself can be done entirely outside of the tool that records the annotations.

To test out this approach, we built a prototype Annotation & Curation Tool (ACT) based on Drupal. Drupal makes it easy to create web UIs for custom content types, and we were able to create a simple, usable interface very quickly. This allowed curators to register URLs and specify the additional metadata we needed, including the crawl permissions, schedules and frequencies. But how do we use this to drive the crawl?

Our solution was to configure Drupal so that it provided a ‘crawl feed’ in a machine readable format. This was initially a simple list of data objects (one per Target), containing all the information we held about that Target, and where the list could be filtered by crawl frequency (daily, weekly, monthly, and so on). However, as the number of entries in the system grew, having the entire set of data associated with each Target eventually became unmanageable. This led to a simplified description that just contains the information we need to run a crawl, which looks something like this:

[
    {
        "id": 1,
        "title": "gov.uk Publications",
        "seeds": [
            "https://www.gov.uk/government/publications"
        ],
        "schedules": [
            {
                "frequency": "MONTHLY",
                "startDate": 1438246800000,
                "endDate": null
            }
        ],
        "scope": "root",
        "depth": "DEEP",
        "ignoreRobotsTxt": false,
        "documentUrlScheme": null,
        "loginPageUrl": null,
        "secretId": null,
        "logoutUrl": null,
        "watched": false
    },
    ...



This simple data export became the first of our web archiving APIs – a set of application programming interfaces we use to try to split large services into modular components4.

Of course, the output of the crawl engines also needs to meet some kind of standard so that the downstream indexing, ingesting and access tools know what to do. This works much like the API concept described above, but is even simpler, as we just rely on standard file formats in a fixed directory layout. Any crawler can be used as long as it outputs standard WARCs and logs, and puts them into the following directory layout:

/output/logs/{job-identifer}/{launch-timestamp}/*.log
/output/warcs/{job-identifer}/{launch-timestamp}/*.warc.gz

Where the {job-identifer} is used to specify which crawl job (and hence which crawl configuration) is being used, and the {launch-timestamp} is used to separate distinct jobs launched using the same overall configuration, reflecting repeated re-crawling of the same sites over time.

In other words, if we have two different crawler engines that can be driven by the same crawl feed data and output the same format results, we can switch between them easily. Similarly, we can make any kind of changes to our Annotation & Curation Tool, or even replace it entirely, and as long as it generates the same crawl feed data, the crawler engine doesn’t have to care. Finally, as we’ve also standardised the crawler output, the tools we use to post-process our crawl data can also be independent of the specific crawl engine in use.

This separation of components has been crucial to our recent progress. By de-coupling the different processes within the crawl lifecycle, each of the individual parts is able to be move at it’s own pace. Each can be modified, tested and rolled-out without affecting the others, if we so choose. True, making large changes that affect multiple components does require more careful management of the development process, but this is a small price to pay for the ease by which we can roll out improvements and bugfixes to individual components.

A prime example of this is how our Heritrix crawl engine itself has evolved over time, and that will be the subject of the next blog post.

  1. Although, noting that Delicious is now owned by Pinboard, I would like to make it clear that we are not attempting to compete with Pinboard. 

  2. Note that this is also a feature of some bookmarking sites. But we are not attempting to compete with Pinboard. 

  3. I’m not yet sure how this might work, but some combination of the Open Annotation Specification and Memento might be a good starting point. 

  4. For more information, see the Architecture section of this follow-up blog post 

03 November 2017

Guy Fawkes, Bonfire or Fireworks Night?

What do you call the 5th of November? As a child of the 70s and 80s it was 'Guy Fawkes' night and my friends and I might make a 'guy' to throw on the bonfire. It is interesting to see through an analysis of the UK Web Archive SHINE service that the popularity of the term 'Guy Fawkes' was overtaken by 'Bonfire night' in 2009. I've included 'Fireworks night' too for comparison.

Bonfire-night

Is this part of a trend away from the original anti-catholic remembrance and celebration to a more neutral event?

Examine this (and other) trends on our SHINE service.

By Jason Webber, Web Archive Engagement Manager, The British Library

24 October 2017

Web Archiving Tools for Legal Deposit

By Andy Jackson, Web Archive Technical Lead, The British Library - re-blogged from anjackson.net

Before I revisit the ideas explored in the first post in the blog series I need to go back to the start of this story…

Between 2003 and 2013 – before the Non-Print Legal Deposit regulations came into force – the UK Web Archive could only archive websites by explicit permission. During this time, the Web Curator Tool (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection & Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.

Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The old version of Heritrix embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT.1

The chosen solution was to use Heritrix 3 to perform the domain crawl separately from the selective harvesting process. While this was rather different to Heritrix 1, requiring incompatible methods of set-up and configuration, it scaled fairly effectively, allowing us to perform a full domain crawl on a single server2.

This was the proposed arrangement when I joined the UK Web Archive team, and this was retained through the onset of the Non-Print Legal Deposit regulations. The domain crawls and the WCT crawls continued side by side, but were treated as separate collections. It would be possible to move between them by following links in Wayback, but no more.

This is not necessarily a bad idea, but it seemed to be a terrible shame, largely because it made it very difficult to effectively re-use material that had been collected as part of the domain crawl. For example, what if we found we’d missed an important website that should have been in one of our high-profile collections, but because we didn’t know about it had only been captured under the domain crawl? Well, we’d want to go and add those old instances to that collection, of course.

Similarly, what if we wanted to merge material collected using a range of different web archiving tools or services into our main collections? For example, for some difficult sites we may have to drive the archiving process manually. We need to be able to properly integrate that content into our systems and present them as part of a coherent whole.

But WCT makes these kind of things really hard.

If you look at the overall architecture, the Web Curator Tool enforces what is essentially (despite the odd loop or dead-end) a linear workflow (figure taken from here). First you sort out the permissions, then you define your Target and it’s metadata, then you crawl it (and maybe re-crawl it for QA), then you store it, then you make it available. In that order.

WCT-workflow

But what if we’ve already crawled it? Or collected it some other way? What if we want to add metadata to existing Targets? What if we want to store something but not make it available. What if we want to make domain crawl material available even if we haven’t QA’d it?

Looking at WCT, the components we needed were there, but tightly integrated in one monolithic application and baked into the expected workflow. I could not see how to take it apart and rebuild it in a way that would make sense and enable us to do what we needed. Furthermore, we had already built up a rather complex arrangement of additional components around WCT (this includes applications like SPT but also a rather messy nest of database triggers, cronjobs and scripts). It therefore made some sense to revisit our architecture as a whole.

So, I made the decision to make a fresh start. Instead of the WCT and SPT, we would develop a new, more modular archiving architecture built around the concept of annotations…

  1. Although we have moved away from WCT it is still under active development thanks to the National Library of New Zealand, including Heritrix3 integration! ↩
  2. Not without some stability and robustness problems. I’ll return to this point in a later post. ↩