THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

18 April 2017

The Challenges of Web Archiving Social Media

What is the UK Web Archive?
The UK Web Archive aims to archive, preserve and give access (where permissions allow) to the UK web space. It only collects information that is publically available online in the UK. Therefore, any web pages that require a log in such as membership only areas are not captured; neither are emails or private Intranets. As most of the popular social media platforms are not hosted in the UK, being largely based in the US, their public interfaces are not automatically picked up in our annual domain crawl. Thus, all social media sites in the archive have to be manually selected and scoped in so that they are legitimately archived under Non-Print Legal Deposit Regulations.

What Social Media is in the UK Web Archive?
The UK Web Archive selectively collects publically accessible Facebook and Twitter profiles related to thematic collections such as the EU Referendum, or ‘Brexit’, or those accounts of prominent individuals and organisations in the UK, such as the Prime Minister and the main political parties.  In the main, Social media is collected when building special collections on big events that shape society for instance elections and referendums. We collect profiles that are related directly to political parties or interest groups campaigning on relevant issues.  As we can only archive content from the UK web space we cannot crawl individual hashtags like #BBCRecipes and #Brexit as a lot of this content is generated outside the UK, and we cannot ascertain the provenance of 3rd party comments.

Difficulties with web archiving social media
Archiving social media is technically challenging as these platforms are presented in a different way to ‘traditional’ websites. Social media platforms use Application Programming Interfaces (API’s) as a way to ‘enable controlled access their underlying functions and data’ (Day Thomson). In the past we have tried to crawl other platforms such as Instagram and Flickr but have been unsuccessful, due to a combination of technical difficulties and restrictions that are sometimes set to prevent crawler access.

How to access the UK Web Archive
Under the 2013 Non-Print Legal Deposit Regulations the UK Legal Deposit Libraries are permitted to archive UK content published on the web. However, access to this content is limited to Legal Deposit Library premises unless explicit permission is obtained from the site owner to make content available on the UK Web Archive  Open UK Web Archive website. More information on Non-Print Legal Deposit can be found here and information on how to access the UK Web Archive can be found here.

What to expect when using this resource
The success rate of crawling Twitter and Facebook is limited and the quality of the captures varies. In the worst case scenario, what is presented to the user amounts to the date a post was made in a blank white box. There are many reasons why a crawler cannot follow links. One reason is that the user used a Shortened URL that is now broken or couldn’t be read at the time of the crawl. The Internet Archive is currently working with companies that provide this service to ensure the longevity of shortened URL’s. Advertisements on social media and archived websites are not always captured, resulting in either a ‘Resource Not in Archive’ message or leakage to the live web.  More information on this can be found here.

Twitter

1. Unison Scotland Twitter

Unison Scotland –Twitter from April 8th 2016

2. RC of Psychiatrists

RC of Psychiatrists – Twitter from August 2nd 2016

Facebook

Initially when we first started archiving public Facebook pages the crawls were quite successful albeit with the caveat around archiving external links. As you can see from the Unison Scotland example there are white boxes where an external link was shared using a shortened URL which wasn’t captured. In spring 2015 Facebook changed its display settings and we were only able to capture a white screen. However, more recent captures have been successful.

3. Unison Facebook

Unison Scotland –Facebook from April 8th 2016

4. EU Citizens for an Independent Scotland Facebook

EU Citizens for an Independent Scotland- Facebook from 15th November 2014

Conclusion

As you can see from the few samples here the quality of the capture can vary but a lot of valuable information can still be gathered from these instances. In March 2017 the UK Web Archive deployed a new version of their web crawler which will take a screen shot of the home page of websites before they archive the content. Although, it will be sometime in the future when the technology will be available for researchers to view these screenshots it is hoped that it will bridge the gap between what is captured and not captured.

Internationally more research needs to be done on archiving social media along with the assistance of the platform proprietors. No two platforms are the same and require a tailored approach to ensure a successful crawl.

More information about the UK Web Archive can be found here.

20 December 2016

If Websites Could Talk

The UK Web Archive collects a wide variety of websites for future researchers. This made us think…

…IF WEBSITES COULD TALK …

… it’s surely possible that they would debate amongst themselves as to which might be regarded as the most fantastic and extraordinary site of all.

“I’d like to stake my claim,” said the 'British Interplanetary Society'.

A Walk across London - north to south

“Aren’t you just a bit too predictable?”, said the 'British Banjo, Mandolin & Guitar Federation'. “Outer space and all that. Music can be fantastic, in its way.”

“Yes indeed,” said the 'British Association of American Square Dance Clubs'. “Mind you, you could make a case for the 'British Fenestration Rating Council'.”

“Or even the 'Bamboo Bicycle Club',” interjected the 'Dorset Moths Group'. “To say nothing of the 'Association of Approved Oven Cleaners'.”

“Far too tame,” said the 'The Junglie Association'. “No-one has a clue what we’re about, so the title should surely be ours.”

“Not so fast,” countered the *British Wing Chun Kuen Association*. “You’re overlooking us!”

“You two are both too obscure, which isn’t the same as extraordinary,” said the 'Brighton Greyhound Owners Association Trust for Retired Racing Greyhounds'. “Don’t you agree, 'Scythe Association of Great Britain & Ireland'?”

They looked more than a little put out at this, but each came round after receiving a friendly hug from the 'Cuddle Fairy'.

Suddenly 'Dangerous Women' butted in. “May we introduce our friend 'I Hate Ironing'?” There was a pause. “Who is it making all that noise?”

“Oh, that’ll be the 'Society of Sexual Health Advisers',” said the 'Teapot Trust'. “No doubt sharing a joke with 'You & Your Hormones'. Where is the 'National Poisons Information Service' when you need it?”

“Now now,” tutted the 'A Nice Cup of Tea and a Sit Down', “No need for that. Like the 'Grateful Society', we should just give thanks that they’re here.”

At this point a site which had hitherto been silent spoke up. “With the utmost respect, I reckon I am what you are looking for.”

“Really?” chorused the others. “And your name is … ?”

“The 'Eccentric Club'.”

Silence fell. They knew that, for the time being, the title had been won …

By Hedley Sutton, Asian & African Studies Reference Team Leader, The British Library

18 November 2016

Explore Your Archives Week at the UK Web Archive

The UK Web Archive is talking part in the annual Explore Your Archives week organised by The National Archives (TNA) and the Archives and Records Association (ARA). There are different hashtags to use on social media during the week. The UK Web Archive will be tweeting throughout the week using the various hashtags. There is also a chance for you to join in on the conversation on Wednesday 23rd as we reflect on the work we have done in 2016.

How will the UK Web Archive Participate?

Saturday 19 November and Sunday 20 November
#ExploreArchives

This weekend we will be tweeting about the UK Web Archive’s aims and objectives as well as some FAQ’s that come up around copyright and preservation.

Monday 21 November 2016
#Archivepioneers

We will be tweeting about web archiving pioneers

Tuesday 22 November 2016
#hairyarchives

We will try and uncover some of the most interesting hair related pictures from our archive. Also have you ever wondered how many times the words moustache and hipster appears online together? Keep an eye out for all hair related tweets on Tuesday.

Wednesday 23 November 2016
#YearInArchives

2016 has been a very eventful year in politics and in the passing of so many celebreties. Let us know the moments that were important to you?

Tune in for a live chat 1300-1400 (GMT) with the web archivists from the British Library and National Library of Scotland to find out the latest news on the 2016 collections.

The British Library:

Nicola Bingham – Lead Curator of Web Archives – @NicolaJBingham

Jason Webber – Engagement Manager – @UKWebArchive

Helena Byrne – Assistant Web Archivist – @HBee2015

The National Library of Scotland:

Eilidh MacGlone - Web Archivist – @dalmailing

Thursday 24 November 2016
#autoarchives
A key day for transport enthusiasts, keep an eye out for polls on different types of transport and some pictures of some unusual forms of transport.

Friday 25 November 2016
#ArchiveAnimals

The crucial question of cats vs. dogs on the internet will finally be answered.

Saturday 26 and Sunday 27 November 2016
#ExploreArchives

To finish off the week we will have a few more fun facts about the UK Web Archive.

Get tweeting and don’t forget to use the designated hashtags for each day. If you know of any UK based websites that cover these topics, why don’t you nominate them to the archive?

Nominate websites

More information on this event