THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

07 August 2017

The 2016 EU Referendum Debate

 


LEAFLET

Pictured: Official EU referendum campaign leaflets – Remain (left hand side) and Leave (right hand side). Do you see any similarities?

My name is Alexandra Bulat and I am a PhD student at the School of Slavonic and East European Studies, University College London. My research is on attitudes towards EU migrants in the UK, based on fieldwork in Stratford (London) and Clacton-on-Sea.

The 2016 EU referendum campaign represents an important period when attitudes towards the topical ‘uncontrolled EU migration’ were shaped, expressed, and passionately debated. In this context, websites and social media played a key role in presenting the public with arguments about EU migrants and migration. Can we find the same campaign information today by browsing web resources? Some campaign websites have since been amended, renamed, redesigned, or simply disappeared from the visible online space. Here is where the UK Web Archive can help researchers like me who analyse particular events in history, such as the EU referendum.

In June 2017, I started a three month placement with the British Library Contemporary British Collections. The project is titled Researching the EU Referendum through Web Archive and Leaflet Collections. I use the EU referendum web archive and 177 digitised leaflets and pamphlets (available in the LSE Digital Library ‘Brexit’ collection) to answer the following research question: Who is speaking about EU migration and how?

In the first stage of research, I created a spread sheet for the leaflets and pamphlets, recording basic information such as title, organisation, and their position in the campaign. I also included all the content about freedom of movement, migrants, refugees, and closely linked topics. Overall, almost two thirds of the materials supported remaining in the EU, with only five categorised as ‘neutral’ and the rest arguing for leaving the EU. Just under half of these materials mentioned immigration, with more ‘Leave’ than ‘Remain’ sources. About a fourth of the items were clearly targeted to a specific region or town/city, the most common being London, Cambridge and various locations in Wales.

The second stage involved using the UK Web Archive to search for the websites and social media (in particular, Twitter handles) that were explicitly mentioned in the printed material, or that I could easily infer from the information available. Only six leaflets did not mention an online presence and I was unable to find it any evidence of it. However, the large majority of them had website(s) or social media mentioned in the printed publication. I ended up with a list of 49 main websites and social media presence for over half of them. Almost all those websites were archived, so I could see the exact information which had been live during the referendum. Most websites were available in the UK Web Archive, but some archived copies were only found in the Internet Archive. For comparison purposes, I looked at the latest record each website had before June 23rd. For some this was as close as 22 June, offering a real snapshot of the debate right before the polling day, but others were not archived in 2016 at all (but had earlier records).

There is a variety of websites, from the official Vote Leave (www.voteleavetakecontrol.org) and Britain Stronger In Europe (www.strongerin.co.uk), to less familiar campaigns such as University for Europe (www.universitiesforeurope.com) and The Eurosceptic (www.eurosceptic.org.uk). A majority of these websites are in the Library’s ‘EU Referendum’ special collection, which brings together a range of websites such as blogs, opinion polls, interest groups, news, political parties, research centres and think thanks, social media and Government sources, who all wrote about the Referendum. Nevertheless, some smaller campaigns, or websites that are not necessarily dedicated to the Referendum but included some content about it, were not included the special collection.

One example of the importance of archiving the web is www.labourinforbritain.org.uk . Although this is a rather well known campaign (which even has its own Wikipedia page, where this website is quoted), its website is not ‘live’ anymore.

Screen capture 1: ‘Live website’, 1 August 2017

SORRY

The UK Web Archive only started making records of it in 2017, but it had already displayed an error message. However, the Internet Archive has snapshots from before it disappeared from the live web. The Labour In campaign is an important resource for my research – it is one amongst a small number of sources making a more positive case about EU migration, which is essential to compare and contrast to the less favourable arguments made by other campaigners. Although the main Labour Party website had a tab about the Referendum, it did not include the same content as this campaign website, entirely dedicated to referendum issues.

Screen capture 2: ‘Archived website’, 22 June 2016

IMMIGRATION

In addition to finding information that is not ‘live’ anymore, the web archive helps to contextualise the leaflets and complement the information provided in those printed campaign materials. The Bruges Group webpage is a good example in this sense. The digitised leaflet collection has four different leaflets from them. However, a comprehensive list of viewable leaflets is available on the archived website. In this case, the information was still on the live web when I last checked (apart from a slight change in formatting). However, no one knows for how long it will remain there, particularly after ‘Brexit’ is not anymore in the public debate.

By helping recover seemingly ‘lost’ information, complementing other datasets, contextualising the research and possibly many other roles, web archives are valuable resources that researchers should be encouraged to explore in greater depth. To mark the end of my PhD placement, I am helping to put together a roundtable discussion at the British Library with EU referendum collection curators and academics from a number of institutions, to create the space for conversation around future use of web archives in academic research and beyond.

Alexandra Bulat, August 2017

Save

Save

Save

Save

08 June 2017

Revitalising the UK Web Archive

By Andrew Jackson, Web Archiving Technical Lead

It’s been over a year since we made our historical search system available, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the Big UK Domain Data for the Arts and Humanities project and use it to re-develop the UK Web Archive.

Screenshot-ukwa-homepage
UKWA homepage

Our current website has not changed much since 2013, and doesn’t describe who we are and what we do now that the UK Legal Deposit regulations are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.

One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for SHINE represents less than a third of our holdings, so now we need to scale our system up to cope with over ten billion documents, and a growth rate of 2-3 billion resource per year. We will continue working with the open source indexer we have developed, while updating our data processing platform (Apache Hadoop) and dedicating more hardware to the SolrCloud that holds our search indexes. If this all works as planned, we will be able to offer a complete search service that covers our entire archive, from 1995 to yesterday.

Shine-word-home
SHINE search results page

The first release of the new website is not expected to include all of the functionality offered by the SHINE prototype, just the core functionality we need to make our content and collections more available to a general audience. Quite how we bring together these two distinct views of the same underlying search index is an open question at this point it time. Later in the year, we will make the new website available as a public beta, and we’ll be looking for feedback from all our users, to help us decide how things should evolve from here.

As well as scaling up search, we’ve also been working to scale up our access service. While it doesn’t look all that different, our website playback service has been overhauled to cope with the scale of our full collection. This allows us to make our full holdings knowable, even if they aren’t openly accessible, so you get a more informative error message (and HTTP status code) if you attempt to access content that we can only make available on site at the present time. For example, if you look at our archive of google.co.uk, you can see that we have captured the Google U.K. homepage during our crawls but can’t make it openly available due to the legal framework we operate within.

The upgrades to our infrastructure will also allow us update the tools we use to analyse our holdings. In particular, we will be attending the Archives Unleashed 4.0 Datathon and looking at at the Warcbase and ArchiveSpark projects, as they provide a powerful set of open source tools and would enable us to collaborate directly with our research community. A stable data-analysis framework will also provide a platform for automated QA and report generation and make it much easier to update our datasets.

Taken together, we believe these developments will revolutionise the way readers and researchers can use the UK Web Archive. It’s going to be an interesting year.

 

28 April 2017

What websites do we collect during UK General Elections?

The UK Web Archive has been archiving websites connected to General elections since 2005.

During the 2005 and 2010 elections, collecting was done on a permissions-cleared basis requiring curators to make contact with individual website owners requesting permission to archive the website before it was captured and stored. Any site belonging to website publishers who refused permission, did not respond or were not contactable were not archived. The 2015 election was collected following the introduction of new Legal Deposit regulations in 2013 that allow any UK website to be collected without permission.

Although the collections are not comprehensive, due to various factors such as the time consuming permissions process and the ephemeral nature of websites (which often do not include contact details), there are large sections of content relating to the General Elections that could not be covered.

Collection Summary:

2005

The UK General Election 2005 was the first of our Election collections. It includes 139 different items, or ‘Targets’ which cover a wide variety of websites such as those of individual candidates, major political parties, interest groups and a selection of election manifestos. Even though this collection is fairly small it is worth highlighting that until relatively recently election campaigning was predominantly carried out through print media; in 2005 it was by no means the case that all political candidates had a website.

2010

The UK General Election 2010 collection is much bigger totalling 770 items. This collection has eleven sub categories that cover:

Candidates (15 items)

Election Blogs (27 items)

Interest Groups (113 items)

News and Commentary (30 items)

Opinion Polls (7 items)

Other (8 items)

Political Parties - Local (191 items)

Political Parties - National (54 items)

Public and Community Engagement (13 items)

Regulation and Guidance (15 items)

Research Centres and Think Tanks (14 items)

2015

The UK General Election 2015 collection is the biggest collection of its type with 7,861 items. By 2015 we observed that much more, traditionally paper-based content had moved onto the web. This shift in publishing along with the introduction of the Non-Print Legal Deposit Regulations (NPLD) in 2013, which enabled the Legal Deposit Libraries to collect online UK content at scale without seeking explicit permissions, meant that this collection was bigger than those of previous years. This collection has eleven sub categories that cover:

Candidates (1,957 items)

Election Blogs (100 items)

Interest Groups (416 items)

News and Commentary (4,582 items)

Opinion Polls (32 items)

Other (75 items)

Political Parties - Local (442 items)

Political Parties - National (142 items)

Public & Community Engagement (45 items)

Regulation & Guidance (7 items)

Research Centres & Think Tanks (62 items)

 All content archived in 2015 will be available to users later this year either via the UK Web Archive website or through a UK Legal Deposit Library Reading Rooms depending on the permission status of the individual websites.

2017

As the June 2017 general election was called at short notice, the collection will likely be much smaller in size compared to the 2015 collection. However, as a number of the websites in the 2015 collection are still live they will be re-tagged for the 2017 collection which will give the curators more time to focus on selecting the more ephemeral websites and social media content.

By Helena Byrne, Assistant Web Archivist, The British Library

18 April 2017

The Challenges of Web Archiving Social Media

What is the UK Web Archive?
The UK Web Archive aims to archive, preserve and give access (where permissions allow) to the UK web space. It only collects information that is publically available online in the UK. Therefore, any web pages that require a log in such as membership only areas are not captured; neither are emails or private Intranets. As most of the popular social media platforms are not hosted in the UK, being largely based in the US, their public interfaces are not automatically picked up in our annual domain crawl. Thus, all social media sites in the archive have to be manually selected and scoped in so that they are legitimately archived under Non-Print Legal Deposit Regulations.

What Social Media is in the UK Web Archive?
The UK Web Archive selectively collects publically accessible Facebook and Twitter profiles related to thematic collections such as the EU Referendum, or ‘Brexit’, or those accounts of prominent individuals and organisations in the UK, such as the Prime Minister and the main political parties.  In the main, Social media is collected when building special collections on big events that shape society for instance elections and referendums. We collect profiles that are related directly to political parties or interest groups campaigning on relevant issues.  As we can only archive content from the UK web space we cannot crawl individual hashtags like #BBCRecipes and #Brexit as a lot of this content is generated outside the UK, and we cannot ascertain the provenance of 3rd party comments.

Difficulties with web archiving social media
Archiving social media is technically challenging as these platforms are presented in a different way to ‘traditional’ websites. Social media platforms use Application Programming Interfaces (API’s) as a way to ‘enable controlled access their underlying functions and data’ (Day Thomson). In the past we have tried to crawl other platforms such as Instagram and Flickr but have been unsuccessful, due to a combination of technical difficulties and restrictions that are sometimes set to prevent crawler access.

How to access the UK Web Archive
Under the 2013 Non-Print Legal Deposit Regulations the UK Legal Deposit Libraries are permitted to archive UK content published on the web. However, access to this content is limited to Legal Deposit Library premises unless explicit permission is obtained from the site owner to make content available on the UK Web Archive  Open UK Web Archive website. More information on Non-Print Legal Deposit can be found here and information on how to access the UK Web Archive can be found here.

What to expect when using this resource
The success rate of crawling Twitter and Facebook is limited and the quality of the captures varies. In the worst case scenario, what is presented to the user amounts to the date a post was made in a blank white box. There are many reasons why a crawler cannot follow links. One reason is that the user used a Shortened URL that is now broken or couldn’t be read at the time of the crawl. The Internet Archive is currently working with companies that provide this service to ensure the longevity of shortened URL’s. Advertisements on social media and archived websites are not always captured, resulting in either a ‘Resource Not in Archive’ message or leakage to the live web.  More information on this can be found here.

Twitter

1. Unison Scotland Twitter

Unison Scotland –Twitter from April 8th 2016

2. RC of Psychiatrists

RC of Psychiatrists – Twitter from August 2nd 2016

Facebook

Initially when we first started archiving public Facebook pages the crawls were quite successful albeit with the caveat around archiving external links. As you can see from the Unison Scotland example there are white boxes where an external link was shared using a shortened URL which wasn’t captured. In spring 2015 Facebook changed its display settings and we were only able to capture a white screen. However, more recent captures have been successful.

3. Unison Facebook

Unison Scotland –Facebook from April 8th 2016

4. EU Citizens for an Independent Scotland Facebook

EU Citizens for an Independent Scotland- Facebook from 15th November 2014

Conclusion

As you can see from the few samples here the quality of the capture can vary but a lot of valuable information can still be gathered from these instances. In March 2017 the UK Web Archive deployed a new version of their web crawler which will take a screen shot of the home page of websites before they archive the content. Although, it will be sometime in the future when the technology will be available for researchers to view these screenshots it is hoped that it will bridge the gap between what is captured and not captured.

Internationally more research needs to be done on archiving social media along with the assistance of the platform proprietors. No two platforms are the same and require a tailored approach to ensure a successful crawl.

More information about the UK Web Archive can be found here.

20 December 2016

If Websites Could Talk

The UK Web Archive collects a wide variety of websites for future researchers. This made us think…

…IF WEBSITES COULD TALK …

… it’s surely possible that they would debate amongst themselves as to which might be regarded as the most fantastic and extraordinary site of all.

“I’d like to stake my claim,” said the 'British Interplanetary Society'.

A Walk across London - north to south

“Aren’t you just a bit too predictable?”, said the 'British Banjo, Mandolin & Guitar Federation'. “Outer space and all that. Music can be fantastic, in its way.”

“Yes indeed,” said the 'British Association of American Square Dance Clubs'. “Mind you, you could make a case for the 'British Fenestration Rating Council'.”

“Or even the 'Bamboo Bicycle Club',” interjected the 'Dorset Moths Group'. “To say nothing of the 'Association of Approved Oven Cleaners'.”

“Far too tame,” said the 'The Junglie Association'. “No-one has a clue what we’re about, so the title should surely be ours.”

“Not so fast,” countered the *British Wing Chun Kuen Association*. “You’re overlooking us!”

“You two are both too obscure, which isn’t the same as extraordinary,” said the 'Brighton Greyhound Owners Association Trust for Retired Racing Greyhounds'. “Don’t you agree, 'Scythe Association of Great Britain & Ireland'?”

They looked more than a little put out at this, but each came round after receiving a friendly hug from the 'Cuddle Fairy'.

Suddenly 'Dangerous Women' butted in. “May we introduce our friend 'I Hate Ironing'?” There was a pause. “Who is it making all that noise?”

“Oh, that’ll be the 'Society of Sexual Health Advisers',” said the 'Teapot Trust'. “No doubt sharing a joke with 'You & Your Hormones'. Where is the 'National Poisons Information Service' when you need it?”

“Now now,” tutted the 'A Nice Cup of Tea and a Sit Down', “No need for that. Like the 'Grateful Society', we should just give thanks that they’re here.”

At this point a site which had hitherto been silent spoke up. “With the utmost respect, I reckon I am what you are looking for.”

“Really?” chorused the others. “And your name is … ?”

“The 'Eccentric Club'.”

Silence fell. They knew that, for the time being, the title had been won …

By Hedley Sutton, Asian & African Studies Reference Team Leader, The British Library

18 November 2016

Explore Your Archives Week at the UK Web Archive

The UK Web Archive is talking part in the annual Explore Your Archives week organised by The National Archives (TNA) and the Archives and Records Association (ARA). There are different hashtags to use on social media during the week. The UK Web Archive will be tweeting throughout the week using the various hashtags. There is also a chance for you to join in on the conversation on Wednesday 23rd as we reflect on the work we have done in 2016.

How will the UK Web Archive Participate?

Saturday 19 November and Sunday 20 November
#ExploreArchives

This weekend we will be tweeting about the UK Web Archive’s aims and objectives as well as some FAQ’s that come up around copyright and preservation.

Monday 21 November 2016
#Archivepioneers

We will be tweeting about web archiving pioneers

Tuesday 22 November 2016
#hairyarchives

We will try and uncover some of the most interesting hair related pictures from our archive. Also have you ever wondered how many times the words moustache and hipster appears online together? Keep an eye out for all hair related tweets on Tuesday.

Wednesday 23 November 2016
#YearInArchives

2016 has been a very eventful year in politics and in the passing of so many celebreties. Let us know the moments that were important to you?

Tune in for a live chat 1300-1400 (GMT) with the web archivists from the British Library and National Library of Scotland to find out the latest news on the 2016 collections.

The British Library:

Nicola Bingham – Lead Curator of Web Archives – @NicolaJBingham

Jason Webber – Engagement Manager – @UKWebArchive

Helena Byrne – Assistant Web Archivist – @HBee2015

The National Library of Scotland:

Eilidh MacGlone - Web Archivist – @dalmailing

Thursday 24 November 2016
#autoarchives
A key day for transport enthusiasts, keep an eye out for polls on different types of transport and some pictures of some unusual forms of transport.

Friday 25 November 2016
#ArchiveAnimals

The crucial question of cats vs. dogs on the internet will finally be answered.

Saturday 26 and Sunday 27 November 2016
#ExploreArchives

To finish off the week we will have a few more fun facts about the UK Web Archive.

Get tweeting and don’t forget to use the designated hashtags for each day. If you know of any UK based websites that cover these topics, why don’t you nominate them to the archive?

Nominate websites

More information on this event

22 September 2016

Web Archiving Rio 2016 Olympic and Paralympic Games

‘For the Olympics, the whole world is captivated, turns on its television and supports their country’

Introduction
The Olympic and Paralympic Games in Rio de Janeiro, Brazil may be over but it will be some time before they are forgotten about in the press and social media. Web archives play a vital role in preserving the narratives that have come out of these Games. The Content Development Group (CDG) at the International Internet Preservation Consortium (IIPC) has been archiving both the Winter and Summer Games since 2010 and the Rio 2016 Collection will be available in October 2016.

Rio-world-map

Rio 2016 is the first time the CDG has archived events both on and off the playing field making this its biggest collection so far in terms of the number of nominations and geographical coverage. The CDG also enlisted the help of subject experts as well as the general public to nominate sites from countries not usually covered in IIPC collections. As the IIPC only has members in around 33 countries public nominations played an important role in filling this void.

What’s involved?
But what’s involved in web archiving the Olympics? CDG members the British Library and the National Library of Scotland co-hosted a Twitter chat on 10th August 2016 to give an insight on what’s involved. The Twitter chat was based on set questions published in an IIPC blog post with a Q&A session and some time for live nominations. This was an international chat with participants from the USA, Ireland, England, Scotland, Serbia and even Australia. The chat was added to Storify as well as the final archived collection of the Games. Even though the chat was small it helped us to connect with a wider audience and increase the number of public nominations. You can follow updates on this project on Twitter by using the collection hashtag #Rio2016WA.

How can you get involved?
There is still time for you to get involved in web archiving the Olympics and Paralympics. The public nomination form will be open till 23rd September 2016. If you would like to make a nomination you can follow these guidelines. As Carly Lloyd stated above the whole world is captivated by the Olympics now is your opportunity to be part of it.

By Helena Byrne, Assistant Web Archivist, The British Library

15 September 2016

Commemorating the Battle of the Somme in the UK Web Archive

On the 15 September 1916 the the Battle of Flers Courcelette (a phase of the greater Battle of the Somme) commenced. It is mostly famous for the introduction of the tank into battle (to mixed results). Less well known now is that it was the day that the Prime Ministers own son Lt. Raymond Asquith was killed when he went into action with his unit, the 3rd Grenadier Guards. It turned out to be the battalion's bloodiest single day of the war. Asquith's death is recorded in the battalion war diary that I transcribed while I was researching my own Great Grandfather. This website is now saved as part of the UK Web Archive and will be available for future research even if the original goes offline.

THE BATTLE OF THE SOMME, JULY-NOVEMBER 1916 THE BATTLE OF THE SOMME, JULY-NOVEMBER 1916© IWM (CO 802)

Commemorating the Somme and the First World war
The UK Web Archive has been collecting websites about the First World war since 2014 and will continue to do so until at least 2019. So far we have 726 individual websites in the collection, 128 of which are available to view through the public website.

There is already a great range of websites in the collection. Many of them look at memorials linked to places (e.g. Crich parish roll of honour) or individual units (e.g. 36th Ulster Division). Others commemorate individual family members such as William Thomas Clarke.

The home front is not forgotten in projects such as 'A Year in the Life of Avon Dassett' or 'Sunderland in the First World war'.

We need your help!
We welcome any suggestions for making this collection as complete as possible. If you have a UK website that relates to the First World War (or know of one), please let us know through twitter (@ukwebarchive) or our nomination form.

Online resources often only last a few years and the UK Web Archive aims to keep copies of these First World War centenary websites in perpetuity. Help us keep these memories alive.

By Jason Webber, Web Archiving Engagement Manager, The British Library