Introduction

The UK web is one of the most important aspects of the nation’s digital record. But the web is extremely vulnerable, and websites can and do disappear frequently. Preserving them, and providing access to those preserved versions, have become matters of urgency and strategic importance.

30 March 2020

UKWA: What's available when the reading rooms are closed?

By Jason Webber, Web Archiving Engagement Manager, The British Library

Like many public places at the moment, the reading rooms of the UK Legal Deposit Libraries are going to be shut for some time. What does this mean for the UK Web Archive? Well as some of you might know we try to collect every UK website at least once a year and this is done under the provision of the Non-Print Legal Deposit Regulations 2013. A condition of these regulations are that content collected can only be viewed on library premises. Never fear though, we still have lots for you to do!

UKWA website home page

Discover millions of websites
At the end of 2018, we launched our new service for searching the whole UK Web Archive catalogue from anywhere. Go to our website: www.webarchive.org.uk and search for a web address (URL) or word/phrase. You will get results showing all of our resources that you can access from anywhere. Tick the box 'At Libraries' to see everything that we have collected. Access thousands of websites
Over the 15 years that we have been archiving websites we have frequently sought permission from owners to make their sites publicly viewable outside of library premises. In that time we have received permissions from over 15,000 website owners These websites have been selected because they relate to a specific topics or event, for their importance, or because they were about to go offline. Lots to see!

Browse 'Topics and Themes'
You can browse over 100 different topics and themes. From the extensive 'Brexit' collection to 'Web Comics' there is something for everyone. As a starter, check out 'Online Enthusiasts' and discover many of the hobbies and societies in the UK.

SHINE service
UK Web Archive holds a collection of all the .uk websites that were archived by the Internet Archive between 1996-2013. The service includes a 'trends' feature that we highly recommend that you try.https://www.webarchive.org.uk/shine/graph

You can enter a word or phrase (in speech marks) to see the relative popularity in a given year. Enter different terms separated by a comma and you can compare popularity e.g. tom,jane. See who is 'best', cat or dog or the emergence of words such as 'iphone', 'emoji' or phrases such as 'credit crunch'.

Do tell us what you find!

Trends - the use of the term 'loungewear' in the UK web

Nominate websites
We are still able to add websites to the archive and welcome nominations! We want to archive every single UK website and your help is invaluable. Make your suggestions here: www.webarchive.org.uk/nominate

Stay safe everyone!

@ukwebarchive

Posted by Jason Webber at 3:29 PM

Tags

Web/Tech

23 March 2020

Boris Johnson, fertility and the royal baby: how far does the concept of Olympic Legacy go?

By Caio Mello, Doctoral Researcher at the School of Advanced Study, University of London

Recently, exploring the data available in the UK Web Archive related to London’s 2012 Olympic legacy, I found a very curious fact. Boris Johnson - Mayor of London during the games - told the BBC in 2013 that a baby boom in London that year was among the legacies of the event. According to Johnson, his team at City Hall had looked at the data and found a rise in birth rates that year not seen in the capital since 1967, the year after England won the FIFA World Cup. Moreover, Johnson said that even Kate Middleton and Prince William’s first baby could be considered a post-Olympic outcome.

In a recent blog post I briefly introduced my research on media coverage of the Rio and London Olympics and discussed the wide range of attributes to which the word legacy has been attached. As part of a Digital Humanities project, this study seeks to develop an interdisciplinary approach to the topic, combining both qualitative and quantitative methods. In order to do that, I have been looking at different repositories of news articles, including web archives. In addition to accessing content available in the UK Web Archive by going to the British Library, I have also searched for news articles through SHINE, a tool developed as part of the Big UK Domain Data for the Arts and Humanities project, to explore UK web content collected and stored by the Internet Archive. SHINE offers access to an open data repository that has allowed me to conduct multiple studies by writing Python scripts that return language and textual analysis. By scraping some of the news articles from SHINE and analysing word frequency in an initial exploratory study, I have been able to get a sense of how broad the concept of legacy might be.

Although often concerned with what could be described as ‘material legacy’, the articles and reportage go beyond physical infrastructure - such as stadiums - to describe expectations that more people will practise sports or even that a country might be more strongly recognised as open and welcoming. Legacy definitely seems to carry a very positive meaning per se and, when it refers to negative outcomes, it often seems to flirt with irony. On the other hand, words like gentrification appear in very dubious contexts: sometimes they refer to regeneration and development, at other times to an unsustainable process that leads to people’s exclusion from traditional areas affected by this sort of transformation.

While Johnson’s reference to the baby boom can be understood as a joke, it reveals how obsessive politicians can become in using the official narrative of Olympic legacy as it relates to their particular country or host city. A good legacy, as pointed out by MacRury and Poynter in Olympic Cities: 2012 and the remaking of London, is fundamentally important for managing tensions between Olympic dreams and huge economic investment.

In December 2019, Boris Johnson, now the Prime Minister, tweeted his desire to host the football World Cup in 2030. “I want it to show our national confidence as we get Brexit done”, wrote Johnson. Once again, immaterial legacy emerges at the heart of a political argument defending the choice to participate in such mega-events. Looking deeply into these multiple dimensions of legacy seems to be an important step to understand, through language usage, how narratives have been built around the Olympics and how different actors have appropriated the concept and disputed its meanings.

Posted by Jason Webber at 3:26 PM

13 March 2020

Theseus' Data Store

By Andy Jackson, Web Archiving Technical Lead, The British Library

My father used to joke about how he’d had his hammer his whole working life. He’d bought it when he’d started out as a joiner, and decades later it was still going strong. He’d replaced the shaft five times and the head twice, but it was still the same hammer! This simple story of maintenance and renewal springs to mind because a few days ago, we finally managed to replace the most important component of our data store. Our storage cluster has been running near-continuously for almost a decade, but as of now, every single hardware component has been renewed or replaced.

Andy's Dad's hammer

We use Apache Hadoop to provide our main data store, via the Hadoop Distributed File System (HDFS). We mostly like it because it's cheap, robust, and helps us run large-scale analysis and indexing tasks over our content. But we also like it because of how we can maintain it over time.

HDFS runs across multiple computers, all working together to ensure there are at least three copies of any data stored in the system, and that these copies are in separate machines and separate server racks. It runs like a beehive. The 'queen' is called the Namenode, and although it doesn't store any data, it keeps track of where all the data is and orchestrates the ingest and replication processes. The 'worker' nodes just store and maintain their own blocks of data, and send data back and forth between themselves as instructed by the Namenode. The Namenode also provides the interface we use to access the system, referring each client to the right set of worker nodes as files are accessed. All the time, the system calculates checksums of the chunks of data and uses this to verify the integrity of the files.

This architecture was designed to anticipate hardware failure and recover from it, which makes the system much easier to maintain. If a drive, or even a full server fails, we can simply remove it, replace it, and keep an eye on it as the data is re-distributed. As new, higher-capacity drives come along, we can upgrade the drives in each node one-by-one, in a rolling update that grows the capacity of the cluster.

Rear of the UKWA racks

Similarly, over time, we can upgrade the operating system and other supporting software on every node, to make sure we're up to date. Almost all of this can be done while the system is running, without interrupting access. But the exception is the Namenode – as a hive needs its Queen, HDFS needs its Namenode, so we avoid interrupting it unless absolutely required. It had been running on the same hardware all this time, but now it's happily running on a new bit of kit. At last.

Like the Ship of Theseus, every piece has been replaced, but it's still the same store, and the data is still safe. Of course, it's not as easy to manage and as transparently scaleable as Cloud storage, but for on-site storage it does a great job. Rather than having to shift between storage silos every few years, the data is in constant motion, and this design allows the components and support contracts for the different layers to move at different speeds and rates of renewal over the years. This is one of the advantages of open source systems – they can provide a stable interface for a service, decoupled from any particular vendor or hardware, allowing support methods, contracts and contractors to change over time.

But HDFS has strong competition these days. There's many other options, many of which are compatible with the defacto standard, S3 (Simple Storage Service).. Being able to work with the same interface whether storage is local or in the cloud might make all the difference. We're happy with HDFS for now, but we'll be preparing for the day when a new ship comes alongside and it's time to shift the cargo...

Posted by Jason Webber at 9:04 AM

Tags

Web/Tech

09 March 2020

A Readers Journey: an introduction to using the Reading Rooms at the British Library

By Rachel Brett, Science Reference Team Leader, The British Library

The British Library is one of six UK legal deposit libraries. We are a dual site based in London and Boston Spa, Yorkshire. At St. Pancras in London we have eleven different subject focused reading rooms comprising of:

Humanities 1
Humanities 2 - Sound & Vision
Newsroom
Science 1 & 2
Social Sciences
Business & Intellectual Property

Plus specialist collections readings rooms:

Manuscripts
World’s largest Maps collection
Rare Books early pre 1850 print & Music
Asian & African Studies

British Library reading room

All reading rooms are staffed by qualified Reference Specialists providing librarian support to all our readers. Access to the Reading Rooms is by free Reader Pass which everyone needs to register for first before they can enter the rooms. Online pre-register is available which must be completed in person with 2 forms of ID. It is advisable to check our website as only certain types of ID are suitable.

Once registration is complete readers are asked to leave large bags and outdoor coats in the cloak room and lockers. We are looking after the National collections, and therefore have some restrictions, including: No pens (pencils only) no food or drink or other items that may damage or mark collection items and no downloading (the only exception being BIPC1).

Depending on the subject and more specifically material type that is being consulted readers can select the room they wish to use. The exception being specialist collections can only be viewed in the respective reading rooms.

“Where are all the books?”
We hold millions of copies of collection items and as such you can’t come in and find all the books on the shelves as you would in public libraries. Our collection is in closed storage and can only be accessed via searching and ordering items from our online catalogues. The main catalogue for the majority of the Library's collection, including digital items is called Explore the British Library and can be used remotely to discover and locate items, identify delivery times and order material. Items that are held in our basements at St. Pancras take 70 minute for delivery, while collections held at Boston Spa have a 48 hour+ delivery time, with no deliveries at weekends. Readers can order up to 10 items per day and 6 can be kept on reserve for 3 working days on a rolling basis. The only material that is held on open shelves in the reading rooms is our reference collections including dictionary, encyclopaedia. National bibliographies, biographies etc. Additionally, we also subscribe to databases, the majority of which can also only be accessed in the reading rooms.

More than just books
We are a hybrid library, and have multi-format collections covering all subjects, including

Legal Deposit copies of items published within the UK & Ireland- which also now included digital items and ebooks
Open access reference works
World’s largest Maps and Rare Books collections
Newspapers, Journals, comics, zines
Manuscripts & Music scores, Theatre archives
Indian Office records, photographs
over 2 million sound recordings
Stamps, Official publications
5 million reports, conference papers
over 50m patents – the largest collection in the World

Oh yes and the UK Web Archive!
Only some of the UK Web Archive is available to access anywhere, the majority is only available in Reading Rooms of the six UK Legal Deposit Libraries.

To access the web archive, take your readers pass to any of the reading rooms and find a free computer terminal. From there, go to www.webarchive.org.uk and you will be able to access all of the UK Web Archive collection.

Also note that there is FREE WIFI, and items can be scanned freely to email or USB, digital items cannot be downloaded, but photography is allowed with some restrictions. Copyright law applies.

Additionally we offer a Chat service and 1-2-1 sessions delivered by our reference teams to help readers navigate our procedures and collections.

Posted by Jason Webber at 9:52 AM

02 March 2020

15 Years of the UK Web Archive - The Early Years

Think back 15 years to the beginning of 2005. Future Prime Minister David Cameron wasn't yet Leader of the Conservative party and Google Maps, Twitter and the iPhone all had yet to be launched. It was, however, the year that we started collecting copies of UK published websites for permanent preservation and access.

The original UK Web Archive Consortium website captured by the Internet Archive.

First UKWAC website via the Internet Archive

Our Origins
In the beginning a group of interested UK institutions - The British Library, The National Archives, Wellcome Trust, National Library of Scotland, National Library of Wales and JISC - formed a consortium (UKWAC) to implement a project to archive websites.

A multi-disciplinary team was formed to look into the many challenging technical and curatorial issues involved in archiving websites. The learning curve in this field can be steep and was especially so then. At the time only a few other organisations were in this field, including the Internet Archive and the National Library of Australia (NLA). In fact, initially, it was a tool developed by NLA - Pandas software - that we used in those early days. Later on we switched to using Heritrix to collect the web (and still do).

One of the special elements of archiving the web is that whilst it can be difficult, this very challenge encourages co-operation and partnership. We have been very grateful for all of the input and help along the way.

What to archive?
In this early era of web archiving, website 'targets' were carefully selected by curators and the owners were asked for permission for us to archive and publicly display them. If the website owner refused, which is very rare, or didn't respond then the website couldn't be archived.

Overall, the responses that we received from website owners were overwhelmingly positive and led to the formation of key early collections such as the 2005 General Election and the 7 July London terrorist attacks.

BBC News 2005 Election website

Where next?
By early 2005 the first websites were being captured and stored, however, it wasn't until May 2005 that we first displayed them to the public.

The evolution of the UK Web Archive website and how we developed a growing number of 'special collections' (as they were initially known) will be the subject of future blog posts over the next few months.

Watch this space!

#15YearsOfUKWA

Posted by Jason Webber at 5:50 PM

Tags

Web/Tech

26 February 2020

Spotlight on Hedley Sutton, Asian & African Studies Reference Team Leader at The British Library

By Helena Byrne, Curator of Web Archives, The British Library

Hedley Sutton is the Team Leader, Asian & African Studies Reference Services at the British Library. He joined the Library as a cataloguer in what was then called the Bibliographic Services Division in 1982. Early in 1988 he moved to the India Office Library and Records Section (later renamed the Oriental & India Office Collections … then Asia, Pacific & Africa Collections … and now Asian & African Studies) as Serials and Acquisitions Librarian, before taking up his current role in the Reference Enquiry team in 1999.

In a previous blog post (2014) Hedley stated that:

“A Reference Team Leader spends most of their day answering queries sent in by e-mail, fax and letter or manning Reading Room enquiry desks. Some, however, also help with contributing to the selection of sites for inclusion in the UK Web Archive.”

In 2008, Hedley started to select websites for the web archive team and to date has selected over 6,000 targets. His initial focus was on UK published websites related to his own specialism of Asian and African studies, however he soon turned to selecting websites on a wide variety of topics that will be of interest to future researchers.

In Hedley’s free time, he likes to write limericks and when he started to come across websites that covered interesting niche subject areas he was inspired to write a series of blogs called If Websites Could Talk. In the first blog post (2016), Hedley brings many of the websites to life as they discuss amongst themselves “to which might be regarded as the most fantastic and extraordinary site of all”. In the second blog post (2017), the websites talked about “which one has the best claim to be recognized as the most extraordinary”. After a long break, the third blog post (2020), also tries to determine which website is the most extraordinary site of all.

You can view archived versions of the websites that Hedley has selected by searching on the UK Web Archive website: https://www.webarchive.org.uk/ukwa/

If you know of a website that you feel should be in the UK Web Archive, please nominate it.

Posted by Jason Webber at 3:53 PM

Tags

Web/Tech

09 January 2020

If Websites Could Talk - Part 3 (this time it's personal)

By Hedley Sutton, Team Leader, Asian & African Studies Reference Services

After a long gap of time, we are back eavesdropping on a conversation among U.K. domain websites as they try and decide which of their number should be recognized as the most extraordinary site of all.

“Who would like to begin?” said the Museum of Fenland Drainage. “Perhaps you, Catholic Association of Performing Arts?”

“If you’re going to get all religious, then I think we should be considered,” said the Equinox Pilgrimage of the Glastonbury Zodiac.

“Religions are merely derelict husks of an impoverished intellectual paradigm,” mused the UK Sartre Society. “We’d be far more inclined to nominate a site encapsulating the essential shallowness of contemporary culture.”

“Such as … us, perhaps?” chipped in Desperate Optimists.

“Rubbish!” cried Primal Bushcraft & Survival. “We want a site that’s rugged and tough!”

“Then you surely mean us,” said Adrenalin Addicts. “We’re much tougher than you!”

“Now now, just calm down,” said the Challenging Behaviour Foundation soothingly. “Why don’t you two make up and have a little chat with the Balloon and Party Professionals Association? If you don’t, we may need to use the services of Action for Happiness. Or in the worst case scenario, the British Pain Society. ”

“If you are lucky, you might make the British Blacklist."

“How about a song?” chuckled the *Falmouth Fish Sea Shanty Collective*. We’d now like to entertain you all with a duet with our dear friends the Cornish Sardine Management Association. The National Federation of Fishmongers may like to join in too.”

“Fish doesn’t seem to agree with us,” said the UK Men’s Sheds Association, changing the subject. “We usually find we have to go running to the Association of Registered Colon Hydrotherapists.”

In the background the Apostrophe Protection Society could be seen, mouthing the words “Thank you.”

“Keep still!” pleaded the Big Wasp Survey. “I think I see one. Look out, Flea Circus Research Library, it’s heading your way!”

“We’re getting off the point,” sighed the Pylon Appreciation Society. “A sing song or an insect hunt aren’t going to help us decide.”

With time running out, they eventually decided that the best qualified candidate site would be … Perfect Information.

Also see:

If websites could talk

If websites could talk (again)

Posted by Jason Webber at 5:04 PM

04 December 2019

What is left behind? Exploring the Olympic Games legacies through the UK Web Archive

By Caio Mello, Doctoral Researcher at the School of Advanced Study, University of London

The Olympic Games happen every four years. This means that every four years a city has to be chosen as a host city. It is easy to think about the impact of hosting such a big event in your own country. Usually governments have to prepare everything for their guests and be aware that the local population is expecting something that will remain as a legacy after the event ends. But what are people actually expecting? What usually happens after the Olympics? Are people happy or unhappy with the legacy left behind with the end of the games? We can try to answer these questions by reading what was published on the internet before, during and after the games in these countries that have hosted the Olympics.

This project will be looking at the media coverage of the two most recent Olympics that took place in London (2012) and Rio (2016). Our main goal is to detect the kind of legacy that had been covered by the media and also the sentiments behind the articles, considering how they had changed over time. This analysis will provide us with insights about what sort of legacy people usually expect and what are their feelings when they face the materialization of their plans some years later. This kind of research has many possible applications. It can help governments to plan better public policies as well as provide us with tools to understand the impacts of such big events, what can be used to find solutions.

What do we mean by Olympic Legacy?
Despite being an important pillar of the Olympic Movement and also regularly brought up to justify cities – and nations – participation in the event, the concept of legacy is not very clear and has still been requiring some effort of scholars and members of the International Olympic Committee (IOC) to determine – or even get closer of – what it exactly means. Legacy can be basically understood here as being all the results generated by the event.

Among the most recognizable legacies of Olympic Games are the sports infrastructure - such as new stadiums and training venues – and the urban planning – which involves many aspects such as new residential areas and new transportation infrastructure, for instance. But it is not reducible to material – or “hard” – legacy. There are also many abstract/immaterial legacies that can be called as soft legacies. As an example it is possible to mention the national self-confidence, production of new ideas, popular memory and additional know-how.

It is important to point out that legacy does not have to be positive, although most of the time it is used in positive contexts. There are possible negative legacies such as debts from planning or construction and infrastructure that is not needed after the event.

This research is part of the CLEOPATRA Innovative Training Network and it has been conducted under a PhD developed at the School of Advanced Study at the University of London. The project is entitled ‘Nationalism, internationalism and sporting identity: the London and Rio Olympics’. For more information: cleopatra-project.eu/.

Posted by Jason Webber at 10:54 AM