THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

03 November 2017

Guy Fawkes, Bonfire or Fireworks Night?

What do you call the 5th of November? As a child of the 70s and 80s it was 'Guy Fawkes' night and my friends and I might make a 'guy' to throw on the bonfire. It is interesting to see through an analysis of the UK Web Archive SHINE service that the popularity of the term 'Guy Fawkes' was overtaken by 'Bonfire night' in 2009. I've included 'Fireworks night' too for comparison.

Bonfire-night

Is this part of a trend away from the original anti-catholic remembrance and celebration to a more neutral event?

Examine this (and other) trends on our SHINE service.

By Jason Webber, Web Archive Engagement Manager, The British Library

24 October 2017

Web Archiving Tools for Legal Deposit

By Andy Jackson, Web Archive Technical Lead, The British Library - re-blogged from anjackson.net

Before I revisit the ideas explored in the first post in the blog series I need to go back to the start of this story…

Between 2003 and 2013 – before the Non-Print Legal Deposit regulations came into force – the UK Web Archive could only archive websites by explicit permission. During this time, the Web Curator Tool (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection & Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.

Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The old version of Heritrix embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT.1

The chosen solution was to use Heritrix 3 to perform the domain crawl separately from the selective harvesting process. While this was rather different to Heritrix 1, requiring incompatible methods of set-up and configuration, it scaled fairly effectively, allowing us to perform a full domain crawl on a single server2.

This was the proposed arrangement when I joined the UK Web Archive team, and this was retained through the onset of the Non-Print Legal Deposit regulations. The domain crawls and the WCT crawls continued side by side, but were treated as separate collections. It would be possible to move between them by following links in Wayback, but no more.

This is not necessarily a bad idea, but it seemed to be a terrible shame, largely because it made it very difficult to effectively re-use material that had been collected as part of the domain crawl. For example, what if we found we’d missed an important website that should have been in one of our high-profile collections, but because we didn’t know about it had only been captured under the domain crawl? Well, we’d want to go and add those old instances to that collection, of course.

Similarly, what if we wanted to merge material collected using a range of different web archiving tools or services into our main collections? For example, for some difficult sites we may have to drive the archiving process manually. We need to be able to properly integrate that content into our systems and present them as part of a coherent whole.

But WCT makes these kind of things really hard.

If you look at the overall architecture, the Web Curator Tool enforces what is essentially (despite the odd loop or dead-end) a linear workflow (figure taken from here). First you sort out the permissions, then you define your Target and it’s metadata, then you crawl it (and maybe re-crawl it for QA), then you store it, then you make it available. In that order.

WCT-workflow

But what if we’ve already crawled it? Or collected it some other way? What if we want to add metadata to existing Targets? What if we want to store something but not make it available. What if we want to make domain crawl material available even if we haven’t QA’d it?

Looking at WCT, the components we needed were there, but tightly integrated in one monolithic application and baked into the expected workflow. I could not see how to take it apart and rebuild it in a way that would make sense and enable us to do what we needed. Furthermore, we had already built up a rather complex arrangement of additional components around WCT (this includes applications like SPT but also a rather messy nest of database triggers, cronjobs and scripts). It therefore made some sense to revisit our architecture as a whole.

So, I made the decision to make a fresh start. Instead of the WCT and SPT, we would develop a new, more modular archiving architecture built around the concept of annotations…

  1. Although we have moved away from WCT it is still under active development thanks to the National Library of New Zealand, including Heritrix3 integration! ↩
  2. Not without some stability and robustness problems. I’ll return to this point in a later post. ↩

25 September 2017

Collecting Webcomics in the UK Web Archive

By Jen Aggleton, PhD candidate in Education at the University of Cambridge

As part of my PhD placement at the British Library, I was asked to establish a special collection of webcomics within the UK Web Archive. In order to do so, it was necessary to outline the scope of the collection, and therefore attempt to define what exactly is and is not a digital comic. As anyone with a background in comics will tell you, comics scholars have been debating what exactly a comic is for decades, and have entirely failed to reach a consensus on the issue. The matter only gets trickier when you add in digital components such as audio and animation.

Under-construction

Due to this lack of consensus, I felt it was important to be very transparent about exactly what criteria have been used to outline the scope of this collection. These criteria have been developed through reference to scholarship on both digital and print comics, as well as my own analysis of numerous digital comics.

The scope of this collection covers items with the following characteristics:

  • The collection item must be published in a digital format
  • The collection item must contain a single panel image or series of interdependent images
  • The collection item must have a semi-guided reading pathway1

In addition, the collection item is likely to contain the following:

  • Visible frames
  • Iconic symbols such as word balloons
  • Hand-written style lettering which may use its visual form to communicate additional meaning

The item must not be:

  • Purely moving image
  • Purely audio

For contested items, where an item meets these categories but still does not seem to be a comic, it will be judged to be a comic if it self-identifies as such (e.g. a digital picturebook may meet all of these criteria, but self-identifies as a picturebook, not a comic).

Where the item is an adaptation of a print born comic, it must be a new expression of the original, not merely a different manifestation, according to FRBR guidelines: www.loc.gov/cds/FRBR.html.

1 Definition of a semi-guided reading pathway: The reader has autonomy over the time they spend reading any particular aspect of the item, and some agency over the order in which they read the item, especially the visual elements. However reading is also guided in the progression through any language elements, and likely to be guided in the order of movement from one image to another, though this pathway may not always be clear. This excludes items that are purely pictures, as well as items which are purely animation.

Alongside being clear about what the collection guidelines are, it is also important to give users information on the item acquisition process – how items were identified to be added to the collection. An attempt has been made to be comprehensive: including well known webcomics published in the UK and Ireland by award-winning artists, but also webcomics by creators making comics in their spare time and self-publishing their work. This process has, however, been limited by issues of discoverability and staff time.

Well known webcomics were added to the collection, along with webcomics discovered through internet searches, and those nominated by individuals after calls for nominations were sent out on social media. This process yielded an initial collection of 42 webcomic sites (a coincidental but nonetheless highly pleasing number, as surely comics do indeed contain the answers to the ultimate question of life, the universe, and everything). However, there are many more webcomics published by UK and Ireland based creators out there. If you know of a webcomic that should be added to our collection, please do nominate it at www.webarchive.org.uk/ukwa/info/nominate.

Jen Aggleton, PhD candidate in Education at the University of Cambridge, has recently completed a three month placement at the British Library on the subject of digital comics. For more information about what the placement has entailed, you can read this earlier blog.

16 August 2017

If Websites Could Talk (again)

By Hedley Sutton, Team Leader, Asian & African studies Reference Services

Here we are again, eavesdropping on a conversation among UK domain websites as to which one has the best claim to be recognized as the most extraordinary…

“Happy to start the ball rolling,” said the British Fantasy Society. “Clue in the name, you know.”

“Ditto,” added the Ghost Club.

“Indeed,” came the response. “However … how shall I put this? … don’t you think we need a site that’s a bit more … well, intellectual?” said the National Brain Appeal.

“Couldn’t agree more,” chipped in the Register of Accredited Metallic Phosphide Standards in the United Kingdom.

“Come off it,” chortled the Pork Pie Appreciation Society. “That would rule out lots of sites straightaway. Nothing very intellectual about us!”

“Too right,” muttered London Skeptics in the Pub.

Before things became heated the British Button Society. made a suggestion. “Perhaps we could ask the Witchcraft & Human Rights Information Network  to cast a spell to find out the strangest site?”

The silence that followed was broken by Campaign Bootcamp. “Come on – look lively, you ‘orrible lot! Hup-two-three, hup-two-three!”

“Sorry,” said the Leg Ulcer Forum. “I can’t, I’ll have to sit down. I’ll just have a quiet chat with the Society of Master Shoe Repairers. Preferably out of earshot of the Society for Old Age Rational Suicide.”

“Let’s not get morbid,” said Dream It Believe It Achieve It helpfully. “It’s all in the mind. You can do it if you really try.”

There was a pause. “What about two sites applying jointly?” suggested the Anglo Nubian Goat Society. “I’m sure we could come to some sort of agreement with the English Goat Breeders Association.”

“Perhaps you could even hook up with the Animal Interfaith Alliance,” mused the World Carrot Museum.

“Boo!” yelled the British Association of Skin Camouflage suddenly. “Did I fool you? I thought I would come disguised as the Chopsticks Club.

“Be quiet!” yelled the Mouth That Roars even louder. “We must come to a decision, and soon. We’ve wasted enough time as it is.”

The minutes of the meeting show that, almost inevitably, the site that was eventually chosen was … the Brilliant Club.

If there is a UK based website you think we should collect, suggest it here.

09 August 2017

The Proper Serious Work of Preserving Digital Comics

Jen Aggleton is a PhD candidate in Education at the University of Cambridge, and is completing a work placement at the British Library on the subject of digital comics. 

If you are a digital comics creator, publisher, or reader, we would love to hear from you. We’d like to know more about the digital comics that you create, find out links to add to our Web Archive collection, and find examples of comic apps that we could collect. Please email suggestions to Jennifer.Aggleton@BL.uk. For this initial stage of the project, we will be accepting suggestions until the end of August 2017.

I definitely didn’t apply for a three month placement at the British Library just to have an excuse to read comics every day. Having a number of research interests outside of my PhD topic of illustrated novels (including comics and library studies), I am always excited when I find opportunities which allow me to explore these strands a little more. So when I saw that the British Library were looking for PhD placement students to work in the area of 21st century British comics, I jumped at the chance.

Having convinced my supervisor that I wouldn’t just be reading comics all day but would actually be doing proper serious work, I temporarily put aside my PhD and came to London to read lots and lots of digital comics (for the purpose of proper serious work). And that’s when I quickly realised that I was already reading comics every day.

The reason I hadn’t noticed was because I hadn’t specifically picked up a printed comic or gone to a dedicated webcomic site every day (many days, sure, but not every day). I was however reading comics every day on Facebook, slipped in alongside dubiously targeted ads and cat videos. It occurred to me that lots of other people, even those who may not think of themselves as comics readers, were probably doing the same.

Forweb2-slytherinpic
(McGovern, E. My Life As A Background Slytherin, https://www.facebook.com/backgroundslytherin/photos/a.287354904946325.1073741827.287347468280402/338452443169904/?type=3&theater Reproduced with kind permission of Emily McGovern.)

This is because the ways in which we interact with comics have been vastly expanded by digital technology. Comics are now produced and circulated through a number of different platforms, including apps, websites and social media, allowing them to reach further than their traditional audience. These platforms have made digital comics simultaneously both more and less accessible than their print equivalents; many webcomics are available for free online, which means readers no longer have to pay between £8 and £25 for a graphic novel, but does require them to have already paid for a computer/tablet/smartphone and internet connection (or have access to one at their local library, provided their local library wasn’t a victim of austerity measures).

Alongside access to reading comics, access to publishing has also changed. Anyone with access to a computer and internet connection can now publish a comic online. This has opened up comics production to many whose voices may not have often been heard in mainstream print comics, including writers and characters of colour, women, members of the LGBTQ+ community, those with disabilities, and creators who simply cannot give up the stability of full-time employment to commit the time needed to chase their dream of being a comics creator. The result is a vibrant array of digital comics, enormously varying in form and having a significant social and cultural impact.

But digital comics are also far more fragile than their print companions, and this is where the proper serious work part of my placement comes in. Comics apps are frequently removed from app stores as new platform updates come in. Digital files become corrupted, or become obsolete as the technology used to host them is updated and replaced. Websites are taken down, leaving no trace (all those dire warnings that the internet is forever are not exactly true. For more details about the need for digital preservation, see an earlier post to this blog). So in order to make sure that all the fantastic work happening in digital comics now is still available for future generations (which in British Library terms could mean ten years down the line, or five hundred years down the line), we need to find ways to preserve what is being created.

One method of doing this is to establish a dedicated webcomics archive. The British Library already has a UK Web Archive, due to the extension of legal deposit in 2013 to include the collection of non-print items. I am currently working on setting up a special collection of UK webcomics within that archive. This has involved writing collections guidelines covering what will (and won’t) be included in the collection, which had me wrestling with the thorny problem of what exactly a digital comic is (comics scholars will know that nobody can agree on what a print comic is, so you can imagine the fun involved in trying to incorporate digital elements such as audio and video into the mix as well). It has also involved building the collection through web harvesting, tracking down webcomics for inclusion in the collection, and providing metadata (information about the collection item) for cataloguing purposes (this last task may happen to require reading lots of comics).

Alongside this, I am looking into ways that digital comics apps might be preserved, which is very proper serious work indeed. Not only are there many different versions of the same app, depending on what operating system you are using, but many apps are reliant not only on the software of the platform they are running on, but sometimes the hardware as well, with some apps integrating functions such as the camera of a tablet into their design. Simply downloading apps will provide you with lots of digital files that you won’t be able to open in a few years’ time (or possibly even a few months’ time, with the current pace of technology). This is not a problem that can be solved in the duration of a three month placement (or, frankly, given my total lack of technical knowledge, by me at all). What I can do, however, is find people who do have technical knowledge and ask them what they think. Preserving digital comics is a complicated and ongoing process, and it is a great experience to be in at the early stages of exploration.

And you can be involved in this fun experience too! If you are a digital comics creator, publisher, or reader, we would love to hear from you. We’d like to know more about the digital comics that you create, find out links to add to our Web Archive collection, and find examples of comic apps that we could collect. Please email suggestions to Jennifer.Aggleton@BL.uk. For this initial stage of the project, we will be accepting suggestions until the end of August 2017. In that time, we are particularly keen to receive web addresses for UK published webcomics, so that I can continue to build the web archive, and do the proper serious work of reading lots and lots of comics.

07 August 2017

The 2016 EU Referendum Debate

 


LEAFLET

Pictured: Official EU referendum campaign leaflets – Remain (left hand side) and Leave (right hand side). Do you see any similarities?

My name is Alexandra Bulat and I am a PhD student at the School of Slavonic and East European Studies, University College London. My research is on attitudes towards EU migrants in the UK, based on fieldwork in Stratford (London) and Clacton-on-Sea.

The 2016 EU referendum campaign represents an important period when attitudes towards the topical ‘uncontrolled EU migration’ were shaped, expressed, and passionately debated. In this context, websites and social media played a key role in presenting the public with arguments about EU migrants and migration. Can we find the same campaign information today by browsing web resources? Some campaign websites have since been amended, renamed, redesigned, or simply disappeared from the visible online space. Here is where the UK Web Archive can help researchers like me who analyse particular events in history, such as the EU referendum.

In June 2017, I started a three month placement with the British Library Contemporary British Collections. The project is titled Researching the EU Referendum through Web Archive and Leaflet Collections. I use the EU referendum web archive and 177 digitised leaflets and pamphlets (available in the LSE Digital Library ‘Brexit’ collection) to answer the following research question: Who is speaking about EU migration and how?

In the first stage of research, I created a spread sheet for the leaflets and pamphlets, recording basic information such as title, organisation, and their position in the campaign. I also included all the content about freedom of movement, migrants, refugees, and closely linked topics. Overall, almost two thirds of the materials supported remaining in the EU, with only five categorised as ‘neutral’ and the rest arguing for leaving the EU. Just under half of these materials mentioned immigration, with more ‘Leave’ than ‘Remain’ sources. About a fourth of the items were clearly targeted to a specific region or town/city, the most common being London, Cambridge and various locations in Wales.

The second stage involved using the UK Web Archive to search for the websites and social media (in particular, Twitter handles) that were explicitly mentioned in the printed material, or that I could easily infer from the information available. Only six leaflets did not mention an online presence and I was unable to find it any evidence of it. However, the large majority of them had website(s) or social media mentioned in the printed publication. I ended up with a list of 49 main websites and social media presence for over half of them. Almost all those websites were archived, so I could see the exact information which had been live during the referendum. Most websites were available in the UK Web Archive, but some archived copies were only found in the Internet Archive. For comparison purposes, I looked at the latest record each website had before June 23rd. For some this was as close as 22 June, offering a real snapshot of the debate right before the polling day, but others were not archived in 2016 at all (but had earlier records).

There is a variety of websites, from the official Vote Leave (www.voteleavetakecontrol.org) and Britain Stronger In Europe (www.strongerin.co.uk), to less familiar campaigns such as University for Europe (www.universitiesforeurope.com) and The Eurosceptic (www.eurosceptic.org.uk). A majority of these websites are in the Library’s ‘EU Referendum’ special collection, which brings together a range of websites such as blogs, opinion polls, interest groups, news, political parties, research centres and think thanks, social media and Government sources, who all wrote about the Referendum. Nevertheless, some smaller campaigns, or websites that are not necessarily dedicated to the Referendum but included some content about it, were not included the special collection.

One example of the importance of archiving the web is www.labourinforbritain.org.uk . Although this is a rather well known campaign (which even has its own Wikipedia page, where this website is quoted), its website is not ‘live’ anymore.

Screen capture 1: ‘Live website’, 1 August 2017

SORRY

The UK Web Archive only started making records of it in 2017, but it had already displayed an error message. However, the Internet Archive has snapshots from before it disappeared from the live web. The Labour In campaign is an important resource for my research – it is one amongst a small number of sources making a more positive case about EU migration, which is essential to compare and contrast to the less favourable arguments made by other campaigners. Although the main Labour Party website had a tab about the Referendum, it did not include the same content as this campaign website, entirely dedicated to referendum issues.

Screen capture 2: ‘Archived website’, 22 June 2016

IMMIGRATION

In addition to finding information that is not ‘live’ anymore, the web archive helps to contextualise the leaflets and complement the information provided in those printed campaign materials. The Bruges Group webpage is a good example in this sense. The digitised leaflet collection has four different leaflets from them. However, a comprehensive list of viewable leaflets is available on the archived website. In this case, the information was still on the live web when I last checked (apart from a slight change in formatting). However, no one knows for how long it will remain there, particularly after ‘Brexit’ is not anymore in the public debate.

By helping recover seemingly ‘lost’ information, complementing other datasets, contextualising the research and possibly many other roles, web archives are valuable resources that researchers should be encouraged to explore in greater depth. To mark the end of my PhD placement, I am helping to put together a roundtable discussion at the British Library with EU referendum collection curators and academics from a number of institutions, to create the space for conversation around future use of web archives in academic research and beyond.

Alexandra Bulat, August 2017

Save

Save

Save

Save

08 June 2017

Revitalising the UK Web Archive

By Andrew Jackson, Web Archiving Technical Lead

It’s been over a year since we made our historical search system available, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the Big UK Domain Data for the Arts and Humanities project and use it to re-develop the UK Web Archive.

Screenshot-ukwa-homepage
UKWA homepage

Our current website has not changed much since 2013, and doesn’t describe who we are and what we do now that the UK Legal Deposit regulations are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.

One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for SHINE represents less than a third of our holdings, so now we need to scale our system up to cope with over ten billion documents, and a growth rate of 2-3 billion resource per year. We will continue working with the open source indexer we have developed, while updating our data processing platform (Apache Hadoop) and dedicating more hardware to the SolrCloud that holds our search indexes. If this all works as planned, we will be able to offer a complete search service that covers our entire archive, from 1995 to yesterday.

Shine-word-home
SHINE search results page

The first release of the new website is not expected to include all of the functionality offered by the SHINE prototype, just the core functionality we need to make our content and collections more available to a general audience. Quite how we bring together these two distinct views of the same underlying search index is an open question at this point it time. Later in the year, we will make the new website available as a public beta, and we’ll be looking for feedback from all our users, to help us decide how things should evolve from here.

As well as scaling up search, we’ve also been working to scale up our access service. While it doesn’t look all that different, our website playback service has been overhauled to cope with the scale of our full collection. This allows us to make our full holdings knowable, even if they aren’t openly accessible, so you get a more informative error message (and HTTP status code) if you attempt to access content that we can only make available on site at the present time. For example, if you look at our archive of google.co.uk, you can see that we have captured the Google U.K. homepage during our crawls but can’t make it openly available due to the legal framework we operate within.

The upgrades to our infrastructure will also allow us update the tools we use to analyse our holdings. In particular, we will be attending the Archives Unleashed 4.0 Datathon and looking at at the Warcbase and ArchiveSpark projects, as they provide a powerful set of open source tools and would enable us to collaborate directly with our research community. A stable data-analysis framework will also provide a platform for automated QA and report generation and make it much easier to update our datasets.

Taken together, we believe these developments will revolutionise the way readers and researchers can use the UK Web Archive. It’s going to be an interesting year.

 

28 April 2017

What websites do we collect during UK General Elections?

The UK Web Archive has been archiving websites connected to General elections since 2005.

During the 2005 and 2010 elections, collecting was done on a permissions-cleared basis requiring curators to make contact with individual website owners requesting permission to archive the website before it was captured and stored. Any site belonging to website publishers who refused permission, did not respond or were not contactable were not archived. The 2015 election was collected following the introduction of new Legal Deposit regulations in 2013 that allow any UK website to be collected without permission.

Although the collections are not comprehensive, due to various factors such as the time consuming permissions process and the ephemeral nature of websites (which often do not include contact details), there are large sections of content relating to the General Elections that could not be covered.

Collection Summary:

2005

The UK General Election 2005 was the first of our Election collections. It includes 139 different items, or ‘Targets’ which cover a wide variety of websites such as those of individual candidates, major political parties, interest groups and a selection of election manifestos. Even though this collection is fairly small it is worth highlighting that until relatively recently election campaigning was predominantly carried out through print media; in 2005 it was by no means the case that all political candidates had a website.

2010

The UK General Election 2010 collection is much bigger totalling 770 items. This collection has eleven sub categories that cover:

Candidates (15 items)

Election Blogs (27 items)

Interest Groups (113 items)

News and Commentary (30 items)

Opinion Polls (7 items)

Other (8 items)

Political Parties - Local (191 items)

Political Parties - National (54 items)

Public and Community Engagement (13 items)

Regulation and Guidance (15 items)

Research Centres and Think Tanks (14 items)

2015

The UK General Election 2015 collection is the biggest collection of its type with 7,861 items. By 2015 we observed that much more, traditionally paper-based content had moved onto the web. This shift in publishing along with the introduction of the Non-Print Legal Deposit Regulations (NPLD) in 2013, which enabled the Legal Deposit Libraries to collect online UK content at scale without seeking explicit permissions, meant that this collection was bigger than those of previous years. This collection has eleven sub categories that cover:

Candidates (1,957 items)

Election Blogs (100 items)

Interest Groups (416 items)

News and Commentary (4,582 items)

Opinion Polls (32 items)

Other (75 items)

Political Parties - Local (442 items)

Political Parties - National (142 items)

Public & Community Engagement (45 items)

Regulation & Guidance (7 items)

Research Centres & Think Tanks (62 items)

 All content archived in 2015 will be available to users later this year either via the UK Web Archive website or through a UK Legal Deposit Library Reading Rooms depending on the permission status of the individual websites.

2017

As the June 2017 general election was called at short notice, the collection will likely be much smaller in size compared to the 2015 collection. However, as a number of the websites in the 2015 collection are still live they will be re-tagged for the 2017 collection which will give the curators more time to focus on selecting the more ephemeral websites and social media content.

By Helena Byrne, Assistant Web Archivist, The British Library