UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

04 August 2020

Twit twoo: International Owl Awareness Day 2020

By Helena Byrne, Curator of Web Archives, The British Library
 
 
 
An illustration of four owls perched on a branch with the moonlight behind them
British Library digitised image from page 271 of "Madeline Power [A novel] https://www.flickr.com/photos/britishlibrary/11121066504

 

The 4th of August is International Owl Awareness Day. This is the perfect time to reflect on owl related content in the UK Web Archive. 

There are five native species of owls’ resident year-round in the UK, namely the Tawny Owl, Barn Owl, Long-eared Owl, Short-eared Owl and Little Owl. Also, the Snowy Owl is an is an occasional winter visitor to the Outer Hebrides, Shetland and the Cairngorms in Scotland.

Owls online

We were wondering, out of these six owl species, which one is the most popular on the archived .uk domain?

 

UK Owl Species Shine Trends
A graph showing how many mentions the six owl species have on the archived .uk web

 

In order to answer this question, the Shine graph may prove useful. Shine was developed as part of the Big UK Data Arts and Humanities project funded by the AHRC. The data was acquired by JISC from the Internet Archive and includes all .uk websites in the Internet Archive web collection crawled between 1996 and April 2013. The collection comprises over 3.5 billion items (URLs, images and other documents) and has been full-text indexed by the UK Web Archive. Every word of every website in the collection can be searched for and analysed.

The most popular owl species referenced in the Shine dataset is the Barn Owl. Despite the curve in the graph being at its peak in 2011, the most popular year for the Barn Owl was 2012. This is because the graph shows the percentage of resources archived for each year and some years have more resources than others. In 2011 there were 66,034 of 288,809,412 archived resources that mention Barn Owl, while in 2012 there were 94,990 of 463,367,189 resources. These numbers are too big to review manually but by clicking at a single point on the graph, Shine will generate a random sample of up to 100 references to the search term. The sample displays a sentence were the term appears, as well as a link out to the Internet Archive so that you can review the archived website.

 

Get creative with owls at the British Library

Video created by Carlos Lelkes-Rarugal, using Tawny Owl hoots recorded by Richard Margoschis in Gloucestershire, England (BL ref 09647). British Library digitised image from page 272 of "The Works of Alfred Tennyson, etc" 

 

Curious about what some of these owls’ sound like? Our Assistant Web Archivist, Carlos Lelkes-Rarugal, designed some short animated videos using recordings from the British Library Sound Archive and images from the British Library Flickr account. You can view these on the UK Web Archive, Digital Scholarship and the Sound Archive’s Wildlife Department Twitter accounts.

The title for this blog post was inspired by the sound made by the Tawny Owl. This and other sounds can be experienced in the Sound Archive at the British Library which has over 2,500 recordings of owls from all over the world. You can hear a selection of some these recordings on the British Library, Sound & Vision blog.

The Digital Scholarship team have also put together a useful album of digitised illustrations of owls on the British Library Flickr account. Their latest blog post encourages you to use these images for various creative projects.

 

Get involved with preserving owls online with the UK Web Archive

The UK Web Archive aims to archive, preserve and give access to the UK web space. We endeavour to include important aspects of British culture and events that shape society. The biodiversity of the UK is an important aspect of our collective national culture and is represented in several British Library collections including the UK Web Archive.

We can’t however, curate the whole of the UK Web on our own, we need your help to ensure that information, discussion and creative output on this subject are preserved for future generations.

Anyone can suggest UK websites to be included in the UK Web Archive by filling in our nominations form: https://www.webarchive.org.uk/en/ukwa/nominate

We already have an Online Enthusiast Communities in the UK curated collection that features some owl related websites in the Animal related hobbies subsection. Browse through what we have so far and please nominate more content!

 

31 July 2020

LGBTQ+ Lives Online

 
 A white banner with the LGBTQ+ flag colours painted on with the text - love is love
Photo by 42 North from Pexels

By Steven Dryden, British Library LGBTQ+ Staff Network & Ash Green CILIP LGBTQ+ Network

 

When the internet first rose to prominence in the late 1990s, one of the primary modes of communicating with others was through internet chat rooms and forums. Suddenly, isolated people all over the world with a personal computer and internet access could communicate with others ‘like them’.

By using the term ‘like them’ we acknowledge that there is some form of social oppression which makes a person, perhaps alone in a rural community, feel unable to be themselves - to know anything about themselves at all. It is perhaps partly for the need to feel more connected with other people ‘like them’ that LGBTQ+ people adapted to online community-building quickly. Now, as we have been living online for over 25 years, it seems pertinent to consider what traces of early digital lives survive, and how we can begin to make sense of it. What survives of digital campaigns to legalise the age of consent for all sexualities in the UK (2001), gain recognition and protections of members of the trans community (Gender Recognition Act 2004) or the battle for marriage equality in the UK (England and Wales, 2013, Scotland 2014, Northern Ireland 2019)? As well as historical content such as this, we must also ensure we are ready and able to curate current and future online discussions and websites surrounding LGBTQ+ lives as well.

Part of this process has already begun. Through the UK Web Archive, the British Library along with the other five UK Legal Deposit Libraries, has been able to run an annual domain crawl of the UK web since April 2013, after the implementation of Non-Print Legal Deposit Regulations. Prior to this websites were archived on a permissions basis since January 2005. Through the Shine interface you can search the JISC UK Web Domain Dataset (1996-2013), this holds all the .uk websites archived by the Internet Archive from 1996 to April 2013. As a next step, the British Library and Chartered Institute of Library and Information Professionals (CILIP) LGBTQ+ Network are pleased to work collaboratively and develop LGBTQ+ Lives Online. This project will tag and subject categorise relevant websites in the UK Web Archive, and expand the scope of websites we collect for future generations. We look forward to sharing with you over the coming months the work that is being undertaken and how you can contribute.

CILIP LGBTQ+ Network members are pleased to be working collaboratively with the British Library and the UK Web Archive on this project, and recognise the historical value and importance of developing the LGBTQ+ Lives Online web archive.

The aim of the UK Web Archive is to collect content published on the UK web that reflects all aspects of life in the UK. This includes important aspects of British culture and events that shape society. The LGBTQ+ Lives Online collection reflects the important role this community plays in British society. The UK Web Archive is delighted to collaborate with the British Library LGBTQ+ Staff Network and the CILIP LGBTQ+ Network to build on the existing LGBTQ+ collection. Although there is a dedicated collection about the LGBTQ+ community, many of the websites tagged in this collection also intersect with other collections in the archive such as our various sports collections, Political Action and Communication and Oral History in the UK.

 

Get Involved:

CILIP LGBTQ+ Network, the British Library and the UK Web Archive welcome nominations for UK websites which should be included in the LGBTQ+ Lives Online.

Nominations can be made via this form: https://www.webarchive.org.uk/en/ukwa/nominate

 

Keep an eye on the CILIP LGBTQ+ Network Twitter as well as the UK Web Archive blog and Twitter account for more updates on the LGBTQ+ Lives Online collection.

 

29 July 2020

15 Years of UKWA - Looking back at our first collections

By Jason Webber, Web Archive Engagement Manager, The British Library

 

This blog follows on from ‘15 Years of the UK Web Archive - The Early Years’.

2020 marks fifteen years since  the UK Web Archive (UKWA) started archiving UK published  websites. In this blog I’ll be looking at the first curated collections that were made and some of the differences in web archiving from then until now.

In 2005, when the British Library (as part of the UK Web Archive Consortium (UKWAC)) started collecting websites, the techniques and procedures were still being pioneered. It was identified early on that grouping captured websites into collections would be useful for future researchers. Read about a few of our first.

 

Indian Ocean Tsunami 

On Boxing day 2004, a huge earthquake and subsequent Tsunami caused severe destruction and loss of life in many areas around the Indian Ocean. Almost immediately afterwards a huge international relief effort was underway that included several UK based efforts. This catastrophic event happened just at the point that UKWAC started archiving websites and curators quickly decided that this deserved to be reflected in the archive . Selection and archiving took place between January and March 2005. It resulted in a small collection of websites representing news articles, charities and the response from travel companies.

This first collection demonstrated the ability of web archives to collect digital material around key events as they happened. Indian Ocean Tsunami collection

 

Collection_2435_indianoceantsunami
Indian Ocean

 

UK General Election 2005
In addition to ‘rapid response’ events, UKWA aims to collect important national events such as elections. 2005 was a period before fixed term elections and the curation team had only a matter of weeks to organise a plan between the government calling the election and it taking place. The way that candidates promoted themselves was different in 2005 than they are now. Only some had their own websites, Facebook was not yet widespread and Twitter didn’t yet exist. It is a fascinating contrast between the 2005 UK General Election and the last one in 2019 both in number (148 v 2,234) and in the range and breadth of the collection.

 

View of Westminster Bridge and the Palace of Westminster from the opposite side if the River Thames

 

Blogs
We all now know what a blog is, right? In 2005 though, it was a relatively new way for people to self publish on the web. It was so new that when the collection was first made we felt the need to explain what one was and that it was a shortening of ‘web log’.

Since then, of course, blogs have been a widespread form of self expression and creativity. They cover every imaginable subject from politics to satire, local history to personal history and many more. This collection contains over 1000 blogs, many of which are no longer available. See what you can find in the Blogs collection.

 

Image of word tiles spelling the word blog

 

Selective curation

Since 2013, thanks to the Non-Print Legal Deposit Regulations, the UK Web Archive is able to archive any UK published website. Prior to 2013, however, curators had to obtain permission from the website owner before any archiving  could take place. UKWA has always tried to collect a representative sample of the UK web which can include a very wide range of topics and opinions. We have always tried to be clear that selection is not endorsement, either of views or of quality. Each item in the collection is rich in its own way.

 

100+ curated collections and counting

Since these first collections in 2005, the number of collections has grown to over 100.  See all of our curated collections here.

We have continued to respond to important events with ‘rapid response’ collections such as the Zika Virus outbreak of 2016-2017 and the death of Margaret Thatcher in 2013. We have also continued to collect political events such as General elections, Scottish and Welsh Parliamentary elections and several key referendums such as the EU referendum. We also try to represent all parts of the UK from the FTSE100 to the lives and hobbies of the nation in ‘Online enthusiasts’.

 

24 June 2020

Our new Science web archive collection

 
By Philip Eagle, Subject Librarian - Science, Technology and Medicine at The British Library
 
 
Air pump CC0
A Philosopher Shewing an Experiment on the Air Pump, 1769 by Valentine Green

 

Introduction

We have just activated our new web archive collection on science in the UK. One of the British Library's objectives as an institution as a whole is to increase our profile and level of service to the science community. In pursuit of this aim we are curating a web archive collection in collaboration with the UK legal deposit libraries. We have some collections already on science related subjects such as the late Stephen Hawking and science at Cambridge University, but not science as a whole.

 

Collection scope

We have interpreted "science" widely to include engineering and communications, but not IT, as that already has a collection. Our collection is arranged according to the standard disciplines such as biology, chemistry, engineering, earth sciences and physics, and then subdivided according to their common divisions, based on the treatment of science in the Universal Decimal Classification.

The collection has a wide range of types of site. We have tried to be fairly exhaustive on active UK science-related blogs, learned societies, charities, pressure groups, and museums. Because of the sheer number of university departments in the UK, we have not been able to cover them all. Instead we have selected the departments that did best in the 2014 Research Excellence Framework, and then taken a random sample to make sure that our collection properly reflects the whole world of academic science in the UK. We are also adding science-related Twitter accounts. Social media is generally difficult to archive due to its proprietary nature, but Twitter is open source so we can archive this more easily.

 

Access

Under the Non-Print Legal Deposit Regulations 2013 we can archive UK websites but we are only able to make them available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. Some of the sites in the collection have already had permission granted, such as the Hunterian Society, Dame Athene Donald’s blog, and the Royal College of Anaesthetists. Some others who have not given permission include Science Sparks, the Wellcome Collection, and the British Pregnancy Advisory Service. The Web Archive page will tell you whether any archived site is only viewable from a library, anything with no statement can be viewed on the public web.


Get involved

As ever, if you have a site to nominate that has been left out, you can tell us by filling in our public nomination form: https://www.webarchive.org.uk/ukwa/info/nominate

23 June 2020

WARCnet and the UK Web Archive

By Jason Webber, Web Archiving Engagement Manager

 

We at the UK Web Archive (UKWA) have recently taken part in a new initiative called WARCnet led by the University of Aarhus in Denmark (and funded by Independent Research Fund Denmark).

“The aim of the WARCnet network is to promote high-quality national and transnational research that will help us to understand the history of (trans)national web domains and of transnational events on the web, drawing on the increasingly important digital cultural heritage held in national web archives.”

 

Warcnetblog-01
WARCnet logo

 

The majority of participants are researchers currently using web archives as part of their studies, many with extensive experience and others new to the field. This makes this an exciting project to be part of as it is an excellent way for content holders such as UKWA to be able to work closely with a group of researchers and try and understand their needs and challenges. The project had a kick-off meeting in May 2020 that was originally intended to be in person but took place virtually. All the speakers pre-recorded their talks which does now mean that these are now all available (including one by myself). I’d particularly recommend viewing the two keynote speakers Matthew S. Weber and Ian Milligan.

 

Warcnetblog-02
Title slide for Jason Webber's WARCnet presentation

 

Working Groups
It is intended for any outcomes from WARCnet to be driven by the participants themselves and to this end four working groups have been formed:

 

  • Working Group 1 - Comparing entire web domains
  • Working Group 2 - Analysing transnational events
  • Working group 3 - Digital research methods and tools
  • Working group 4 - Research data management across borders

 

The UKWA team is involved with each of the first three working groups, all of which have met in the last weeks to see how we can take this project forward. You can read more about each group here.

There are at least three more small conferences planned (currently as in person), one later this year in Luxembourg and two next year in London and Aarhus.

Look out for updates on our involvement with this initiative on this blog and through our twitter account @UKWebArchive and @WARC_net.

08 June 2020

Documenting the Olympics & Paralympics

 
 
Olympic Stamps
Stamps issued by Greece in 1896, the Universal Postal Union Collection, Philatelic Collections, The British Library.

 

Join our panel discussion to discover more about researchers' experiences when navigating archives, as well as the collection policies related to Olympics/Paralympics of GLAM organisations. This event is a collaboration between the British Society of Sports History (BSSH) and the British Library Web Archive team.

 

Register here to receive the joining details:

https://forms.gle/Tjzikxgjvr3FofSr8 

Date:           19 June 2020

Time:          3-4:30pm (BST) / 10-11:30am (EST)

Location:    Zoom

Twitter hashtag: #ResearchingtheGames

 

Presentations

Heather Dichter, De Montfort University - Finding Olympic history in non-sport archives

Laura Alexandra Brown, Northumbria University - The heritage of the Games: Interpreting urban change in Olympic host cities

Robert McNicol, Librarian, Wimbledon Lawn Tennis Museum - Researching the Olympics/Paralympics at Wimbledon

Helena Byrne, Curator of Web Archives, British Library - Preserving the Olympics/Paralympics online

 

What to expect

There is a broad mix of physical, digitised and born digital resources will be covered in the presentations. The Curator of Web Archives, Helena Byrne will be discussing the UK Web Archive collections related to the Olympics/Paralympics as well as the collaboration with the International Internet Preservation Consortium (IIPC).

The year 2020 was originally an Olympic/Paralympic year before the outbreak of the coronavirus pandemic. It is also a significant milestone for the UK Web Archive and the IIPC. It marks 15 years since the first UK Web Archive collections were published and also 10 years since the IIPC first started archiving the Olympics.

 

UKWA Sports
https://www.webarchive.org.uk/en/ukwa/collection

 

The UK Web Archive and sports

The UK Web Archive has been archiving sports related websites since it was established in 2005. However, it wasn’t until 2017 when dedicated sports collections were established. There are three broad collection groups Sports Collection, Sports: Football and Sports: International Events. The subsections of the Sports: International Events includes two summer and two winter Olympic/Paralympic collections from 2010, 2012, 2014 and 2016. The largest of these collections is the Olympic & Paralympic Games 2012 collection as the Games were hosted in the UK.

 

Access and reuse

Under the Non-Print Legal Deposit Regulations 2013 (NPLD) access to archived content is restricted to a UK legal deposit library reading room. However, if we have permission from the website owner, we can make the archived version of their content open access along with government publications under the Open Government Licence. This is why if you browse through the collections on our website, most of the links to archived content will direct you to one of the UK legal deposit libraries for access but some of the content you can view from your personal device.

 

IIPC and the Olympic/Paralympics

The UK Web Archive is made up of the six UK legal deposit libraries, two of those libraries, the British Library and the National Library of Scotland are also members of the International Internet Preservation Consortium (IIPC) which was founded in 2003. In 2010 the IIPC started its first collaborative collection on the Winter Olympics 2010 and has covered every Olympic/Paralympic Games since. Since the formation of the IIPC Content Development Group (CDG) the collections have started to include a broader range of subjects on and off the playing field.

 

Get Involved

The UK Web Archive aims to archive, preserve and give access to the UK web space.

If you see content that that should be included in one of sports collections then please fill in our online nomination form.

29 May 2020

Using Webrecorder to archive UK political party leaders' social media after the UK General Election 2019

This blog post is is by Nicola Bingham, Helena Byrne, Carlos Lelkes-Rarugal and Giulia Carla Rossi

Introduction to Webrecorder

The UK Web Archive aims to capture the whole of the UK web space at least once a year, and targeted websites at more frequent intervals. We conduct this activity under the auspices of the Legal Deposit Regulations 2013 which enable us to capture, preserve and make accessible the UK Web for the benefit of researchers now and in the future.

Along with many cultural and heritage institutions that perform at-scale web archiving, we use Heritrix 3, the state of the art crawler developed by the Internet Archive and maintained and improved by an international community of web archiving technologists.

Heritrix copes very well with large scale, bulk crawling but is not optimised for high fidelity crawling of dynamic content, and in particular does not archive social media content very well.

Researchers are increasingly turning their attention to social media as a significant witness to our times, therefore we have a requirement to capture this content, in certain circumstances and in line with our collection development policy. Usually this will be around public events such as General Elections where much of the campaigning over recent years has been played out online and increasingly on social media in particular. 

For this reason we have looked at alternative web archiving tools such as Webrecorder to complement our existing toolset. 

Webrecorder was developed by Ilya Kreymer under the auspices of Rhizome (a non-profit organisation based in New York which commissions, presents and preserves digital art), under its digital preservation program. It offers a browser based version, which offers free accounts up to 5GB storage and a Desktop App

Webrecorder was already well known to us at the UK Web Archive although we had not used it until recently. It is a web archiving service which creates an interactive copy of web pages that the user explores in their browser including content revealed by interactions such as playing video and audio, scrolling, clicking buttons etc. This is a much more sophisticated method of acquisition than that used by Hertrix which essentially only follows HTML links and doesn’t handle dynamic content very well. 


What we planned to do

The UK General Election Campaign ran from the 6th of November 2019 when Parliament was dissolved, until polling day on the 12th of December 2019. On the 13th of December 2019 the UK Web Archive team, based at the British Library attempted to archive various social media accounts of the main political party leaders. Seventeen political leaders from the four home nations were identified and a selection of three social media accounts were targeted: Twitter, Facebook and Instagram. Not all leaders have accounts on all three platforms, but in total forty four social media accounts were archived. These accounts are identified in the table below by an X. 

List of UK political political part leaders' social media accounts archived
Image credit: Carlos Lelkes-Rarugal

 

 

How we did it

On the 13th of December, 2019 we ran the Webrecorder Desktop App across twelve office PCs. Many were running the Webrecorder Autopilot function over the accounts, but we had mixed success, in that not all accounts captured the same amount of data. As the Autopilot functionality didn’t work well on all accounts, a combination of automated and manual capture processes were used where necessary. It took the team a lot longer than expected to archive the accounts therefore some were archived on a range of dates the following week.    

 

Large political party’s vs smaller party’s social media accounts

The two largest political party leaders, Jeremy Corbyn and Boris Johnson, have many more social media followers than the other home nations party leaders. This meant that it was more difficult to get a comprehensive capture of Corbyn and Johnson’s Twitter accounts than, for example, Arlene Foster’s. The more popular Twitter accounts took many hours to crawl; Corbyn’s took almost ten hours to archive thirteen day’s worth of Tweets (which only took us up to 1st December). 

 

Technical Issues

We experienced several technical issues with crawling, mainly concerned with issues around  IP addresses, the app crashing, and Autopilot working on some computers and not others. It was hard to get the app restarted after it crashed, so some time was lost when this happened. Different computers with the same specs ran differently. The Autopilot capture for Jeremy Corbyn’s and Boris Johnson’s Twitter accounts were started at the same time but Corbyn’s ran uninterrupted while Johnson’s crashed when it reached 475 MB. Although Corbyn’s account was crawled for nearly ten hours it only collected 93 MB of data. In contrast, Nigel Farage’s Twitter page was crawled for over four hours and only produced 506 MB. It is important to check the size of crawled data, as the hours the Webrecorder Desktop App is running on Autopilot does not necessarily translate into a high fidelity crawl. 

 

Added complications when using multiple devices with the same user profile:

Complications arose mainly from the auditing and collating of WARC files; performing QA and keeping track of which jobs were successful and those that were not. 

Initially, all participants in this project had planned to use their own work PC or work laptop and a local desktop installation of Webrecorder. However, an hour or so into the process(early in the day), it soon became apparent that there would not be enough time to archive all of the social media accounts within our time frame, given the volume of social media accounts and the unanticipated time it would take to archive each one. For example, it took one instance of a desktop Webrecorder application almost ten hours to archive Jeremy Corbyn’s Twitter account (only able to capture Tweets up to a month prior to the day of archiving).

It was then decided that we could potentially, and experimentally, run multiple parallel Webrecorder applications across a number of office desktop PCs; PCs that were free and available for us to use. This was possible because of the IT Architecture in place, allowing users to log into any office machine with the correct credentials and making their personal desktop load up along with all their files and user settings, regardless of the PC they log into. 

The British Library’s IT system, which incorporates a lot of the Windows ecosystem, gives each user their own dedicated central work directory where they are given a virtual hard drive and  their own storage space for all their documents and any other work related files. This allowed one user to be logged into several office PCs at the same time and therefore run a separate desktop Webrecorder application running on each machine. This was indeed very helpful as it allowed each machine to focus on one particular social media account, which in many cases took hours to archive. 

Having multiple Webrecorder jobs greatly increased our capacity to archive by removing the previous bottleneck, that was, one webrecorder job per user. Instead, this was increased to several webrecorder jobs per user.

Work flow of gathering WARC files from Webrecorder
Image credit: Carlos Lelkes-Rarugal

 

 

Having multiple Webrecorder jobs added complications down the line, not necessarily impacting the archiving process, but rather, complicating the auditing and collating of WARC files. When a user had several Webrecorder jobs running concurrently, each job would still be downloading to the same user work directory (the user’s virtual hard drive). So if a user had many parallel jobs running, this would create multiple WARC files in the same folder (but with different names, so no clashes), WARC files being produced by the different desktop PC that the user had logged in to. This was quite an elaborate setup because once a job had completed, the entire contents of the Webrecrder folder (where the WARCs were stored) was copied to a USB so that an initial Quality Assurance (QA) could be performed on the completed job on a more capable laptop. The difficulty was in finding the WARC file that corresponded to the completed job, which was somewhat convoluted as there would have been multiple WARC files with this type of file-naming convention:

 “rec-20191213100335021576-DESKTOP-AOCGH38-7B5SEXKS.warc.gz”. 

As you can imagine, taking a copy of Webrecorder’s folder contents not only has the completed job, but also the instances of other WARC files from other incomplete jobs. Coupled with multiple jobs per PC, and multiple PCs per user; keeping track of what had completed and which WARCs were either corrupted or not up to standard, was quite demanding. 

 


Review of the data collected 

File size of data collected from UK political party leaders' social media accounts
Image credit: Carlos Lelkes-Rarugal

 

How to access this data

The archived social media accounts can be accessed through the UK General Election 2019 collection in a UK Legal Deposit Library Reading Room. The UK Legal Deposit Libraries are the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin Library.  

The 2019 collection is part of a time series of UK General Elections dating from 2005. They can be accessed over the Internet on the Topics and Themes page of the UK Web Archive website. All the party leaders' social media accounts are tagged into the subsection UK Party Leaders Social Media Accounts (access to individual websites depends on whether we have an additional permission to allow ‘open’ access). More information about what is included in the UK General Election 2019 collection is available through the UK Web Archive blog

 

Conclusion


Overall, undertaking this experiment was an interesting experience for our small team of British Library Web Archive Curators. Many valuable lessons were learnt on how best to utilise Webrecorder in our current practice. The major takeaway was that it was a lot more time consuming than we expected. Instead of taking up one working day, it took nearly a whole week to archive our targeted social media accounts with Webrecorder. Our usual practise is to archive social media accounts with the Heritrix crawler, which works reasonably well with Twitter but is less suited to capturing other platforms. For a long time, we were unable to capture any Facebook content with Heritrix, mainly due to the platform’s publishing model, however the way the platform is published has changed recently allowing us limited success. Archiving social media will always remain challenging for the UK Web Archive, for myriad technical, ethical and legal reasons. The sheer scale of the UK’s social media output is too large for us to capture adequately (and indeed, this may not even be desirable) and certainly too large a task for us to tackle with manual, high fidelity tools such as Webrecorder. However, our recent experience during the 2019 UK General Election has convinced us that using Webrecorder to capture significant events is a worthwhile exercise, as long as we target selected, in scope accounts on a case by case basis. 

 

27 May 2020

Web Archiving the UK General Election 2019

By Jennie Grimshaw, Curator for Official Publications, The British Library

The 2019 general election was a turning point in British political history.  It saw the resurgence of the Conservative vote from 8.8% at the May 2019 European Parliament election to 23.6% eight months later. It saw the collapse of the “Red Wall” in the North and the Midlands as seats such as Sedgfield,  Labour since 1935, and Workington, Labour 1918-1976 and 1979-2019, turned Conservative. It saw the breaking of the Parliamentary deadlock over Brexit with the return to power of pro-Leave Conservative Prime Minister Boris Johnson, with a majority of 80, which then enabled him to “Get Brexit Done”

Polling station sign

To help researchers trace how use of the Internet for political campaigning and communication has evolved over time, the British Library, the National Libraries of Scotland and Wales, and the Bodleian Library have collaborated to create a web archive collection for all UK general elections since 2005, using more or less the same categories – candidates web presence, national and local political party websites, online news and commentary, interest group manifestos and comment and analysis by think tanks. This collection is the fifth in the time series and is complemented by the EU Referendum and Brexit collections. In 2019 Northern Ireland political party and candidate sites were selected by the Public Record Office of Northern Ireland (PRONI).

Sadly, our ability to harvest social media sites is very limited due to technical and legal issues. We can gather Twitter feeds, but not Facebook pages as the site deliberately blocks the crawls and it has proved impossible to negotiate access.  Due to the limitations of our crawl software, we cannot always gather dynamic content, Wix-based sites, documents stored in the cloud or videos. This may explain why some worthy candidate sites you might expect to see have not been archived.

However, the UK Web Archive General Election 2019 does offers researchers a respectable total of 2237 sites, including:

  • UK–based news and comment sites , such as Politics Home, Political UK, the Commentator, Unherd, Reaction, CAPX and  CAPX 2019 Election Archive as well as Twitter feeds of selected  political journalists. We include satirical sites such as the spoof Conservative Manifesto created by a company called Concerned Citizens Ltd and the Daily Reckless, which offers satirical songs by Tommy Mackay.
  • Candidates campaign websites and Twitter feeds and local constituency party sites. The collection includes all candidates standing in Scottish and Welsh constituencies. Due to staff resource limitations, we can only capture a sample of the websites and Twitter feeds of candidates standing in English constituencies. We cover three inner London and three outer London boroughs, and one rural and one urban constituency from each of the English regions.  We have used the same English constituencies for every election since 2005.
  • Websites and Twitter feeds of 100 national political parties, ranging from fringe groups such as the Animal Welfare Party, Arthur Horner of the Welsh Communist Party, Britain First and the anarchist Class War Party to the major national parties (Conservative, Labour, Liberal Democrats, and Greens) and political parties in the devolved administrations (Plaid Cymru, Scottish Nationalists and the Northern Ireland parties). We also capture major political party blogs such as Conservative Home, Conservative Woman, Labour List and Liberal Democrat Voice.
  • Social media sites of the main party leaders captured in depth using Web Recorder. We can use this software only very sparingly as operating it is very resource and time intensive, but it can capture sites our regular crawler cannot reach.
  • Interest groups, seeking to influence party policies through engagement with candidates and publication of manifestos and lists of “asks”.  We have selected about 340 sites, ranging from the manifestos of campaigning charities such as Age UK and trade associations such as Airlines UK and the Association of the British Pharmaceutical Industry to unions, religious groups, such as the Evangelical Alliance and Muslim Council of Britain, and pressure groups such as the Campaign to Protect Rural England, Actionaid and Anti-Slavery International. Health charities and environmental groups are particularly prominent at this election and professional associations such as the medical royal colleges are also well represented. The voices of disabled people and minority groups are also heard through manifestos and comment from Leonard Cheshire Disability, SCOPE, Disability Rights UK, MENCAP, RNIB, Operation Black Vote and the Muslim Public Affair’s Committee’s Operation Muslim Vote 2019.
  • Thank tanks and academic research centres providing in-depth comment and analysis. We have sought to include both right- and left-wing views, and comment from political, legal and economic viewpoints. Targets include the Centre for Labour and Social Studies (CLASS), the London School of Economics British Politics and Policy blog, the Centre for Constitutional Change, Demos’ Manifesto for consensus politics, the Democratic Audit blog, the British Future think tank, Full Fact, the Institute for Fiscal Studies the King’s Fund and the Institute for Government, etc.

We hope that this collection will preserve the voices and illustrate the concerns and priorities of a wide spectrum of UK society and help to show how political parties and candidates engaged and  responded at this pivotal moment of UK history.

General Election 2019 collection. Note that you can view what is in this collection but many of the actual websites can only be viewed in the reading room of a UK Legal Deposit Library.