UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

11 May 2022

The Queen's Platinum Jubilee in the UK Web Archive

By Daniela Major, PhD Student, School of Advanced Studies, University of London

Whether you’re an avid monarchist, a staunch republican or simply obsessed with Netflix’s “The Crown”, there is no doubt that Elizabeth II has achieved a unique place in history The 70 years of her reign have been witness to profound changes in world politics and in British society. When she was crowned, Churchill was her Prime Minister, Khrushchev was freshly in charge of the Soviet Union and Eisenhower had just become the President of the United States.

Queen Elizabeth II

Throughout her decades as monarch, Queen Elizabeth has worked with 14 UK Prime-Ministers and met 13 American Presidents. She has received state visits from countless foreign leaders, who themselves influenced the shape of 20th and 21st century history: from Charles de Gaulle to Mikhail Gorbachev.

During her reign, the United Kingdom went through dramatic changes. From the dismantling of the British Empire to referendums on Welsh devolution and Scottish independence. The Queen’s honour list depicts a country where diversity is celebrated. She’s given honours to authors such as V.S Naipaul and Salman Rushdie, singers like Paul McCartney and Bono and artists like Paula Rego.

For many reasons, the Platinum Jubilee is a great opportunity to explore this dialogue between the present and the past. How and why we celebrate, or how and why we refuse to do so, places us in a specific historical context. In this case, right into 21st century UK, in a world in constant change.

Queens Platinum Jubilee logos

So far, we have discovered that food is a favourite in every celebration. Fortnum & Mason and the Big Jubilee Lunch are celebrating the Jubilee by sponsoring a competition awarding the best pudding – following the Victoria Sponge, named after Queen Victoria, and Coronation Chicken, created in honour of Elizabeth II’s coronation. The judges include Mary Berry of Great British Bake-Off fame, food historian Regula Ysewijn and MasterChef’s Monica Galetti.

A slew of cultural celebrations are on the cards: The Reading Agency launched the Big Jubilee Read which chose ten outstanding books from the last 7 decades. The Royal Mint has created a commemorative coin and the Royal Philharmonic Concert Orchestra gave a concert at the Royal Albert Hall. Throughout the whole of the UK, Town Councils are preparing for street parties, tree planting, and jubilee lunches.

This is where you come in. The UK Web Archive wants to know how you are choosing to remember this Jubilee.

  • Are you taking part in the Jubilee’s bake-off?
  • Are you lighting a beacon or attending a street party?
  • Are you going to a protest? Have you written about how the UK cannot have 70 more years of monarchism?

Help us remember this moment in History so that future historical sources reflect the full diversity of public activity. Help us show how people across the UK celebrate important dates and how they look back to their own past, how they celebrate their present.

If you know of a website worth keeping for posterity, nominate it and make your suggestion.

23 February 2022

International Women’s Day 2022 - save your event ad now!

By Helena Byrne, Curator Web Archives, The British Library

8th March is International Women’s Day (IWD). Originally started in the trade union movement, IWD was an important day to highlight the inequalities women face and to campaign for equal rights. In recent years, IWD has had a wider remit and includes celebrating the cultural, political, and socioeconomic achievements as well as struggles of women.

British Library Votes for Women exhibition website

British Library, Votes for Women online exhibition webpages, archived 2018

Events of all kinds are held on 8th March or close to that date to mark the occasion. Most of these events are advertised online through websites, social media and in online event platforms like Eventbrite. A simple search on Eventbrite for International Women’s Day brings up 500 pages from around the world but mostly in the UK. A little over half of those events are advertised for London.

Are you attending or organising an IWD event this year that is advertised online? Nominate that website/online advert to the UK Web Archive by filling in our ‘Save a website’ form.

Glasgow Women's Library+

Glasgow Women's Library, archived 2008.

What is the UK Web Archive?
The UK Web Archive is a collaboration of the six UK legal deposit libraries working together to preserve websites for future generations. We archive websites published in the UK on a wide variety of subjects such as politics, sports, hobbies and social issues etc. and have over a hundred curated collections in the UK Web Archive.

On IWD 2022 you might be interested in browsing some of our collections related to women’s rights such as Unfinished Business: The Fight for Women’s Rights (2020), Gender Equality (2018), Political Action & Communication (2015), and Women’s Issue (2005-2013).

We work with subject experts to curate our collections but also take nominations from the public so please nominate your IWD event ads or any other UK published content that you feel should be included in the UK Web Archive by filling in our ‘Save a website’ form.

07 February 2022

The Queen’s Platinum Jubilee in the UK Web Archive

By Nicola Bingham, Lead Curator, Web Archives, British Library.

6th February 2022 marks 70 years since King George VI passed away in his sleep at the royal estate at Sandringham and Princess Elizabeth, his oldest daughter and next in line to the throne (who was in Kenya at the time) became Queen. She was crowned a little over a year later as Queen Elizabeth II on June 2, 1953, at age 27.

Platinum Jubilee
2022 will see communities come together to celebrate the Queen’s Platinum Jubilee throughout the UK and Commonwealth. Celebrations will include street parties, concerts, the 'Queen's Green Canopy', an initiative to plant a tree for the jubilee, a Jubilee Pageant, 'a River of Hope', made up of flags decorated with images of hope drawn by children which will make its way along the Mall and the opening of the royal palaces, Sandringham and Balmoral to visitors amongst other events. The focal point of the celebrations will be a four-day long bank holiday weekend, from Thursday 2nd to Sunday 5th June.

Queens Platinum Jubilee logos

The Queen's Platinum Jubilee logos in English and Welsh.

The UK Web Archive will be capturing a record of this momentous event in a special collection about the Platinum Diamond Jubilee. 

We have a series of collections related to the Queen and other members of the royal family, dating back to 2012 when the Queen celebrated her Diamond Jubilee. This collection was co-curated by the Royal Archives at Windsor, the UK Legal Deposit Libraries and the Institute of Historical Research. It contains archived copies of websites produced by the Royal Household together with a wide range of related material such as Blogs, commentaries, news articles together with anti-monarchist and opposing views. Similarly in 2016 the Legal Deposit Libraries curated a collection to mark the Queen’s 90th Birthday.

Diamond Jubilee commemorative marmite

Source: boingboing.net/2012/04/20/maamite-is-jubilee-marmite.html

In 2021 staff at the Legal Deposit Libraries curated a collection to commemorate His Royal Highness Prince Philip the Duke of Edinburgh (10th June 1921 to 9th April 2021). It includes the websites of many of the organisations that the Duke was associated with during his lifetime, either as President, Patron, Honorary Member or in another capacity. The Duke had special interests in scientific and technological research and development, the welfare of young people, education, conservation, the environment and the encouragement of sport. The collection also includes statements from Commonwealth Organisations reflecting on the life of the Duke.

British Racing Car Drivers website with Prince Phillip

Source: www.webarchive.org.uk/wayback/archive/20210421115025/http://www.brdc.co.uk/HRH-The-Prince-Philip-Duke-of-Edinburgh-KG-KT

This year we will be curating a collection to mark the Queen's Platinum Jubilee. We'll announce ways to contribute to this collection in the near future, however in the meantime if you know of, or contribute to a UK or Commonwealth website with relevant content related to the Queen please let us know via our ‘Save a website’ form. We would be particularly pleased to hear from you if your website features personal or community stories about the Platinum Jubilee, or previous years’ Jubilee celebrations.

19 January 2022

Explore Women’s Football in the UK Web Archive

By Helena Byrne, Curator Web Archives, The British Library

On 5 December 1921, the Football Association (FA) banned women from playing football on affiliated grounds and stated that football is “quite unsuitable for females and ought not to be encouraged” (FIFA.com). It took almost fifty years to overturn this ban. With the formation of the Women’s Football Association (WFA) in 1969 the FA were under more pressure to remove the ban. It was at the FA Council Meeting on January 19th, 1970 that the FA made the decision to rescind the Councils Resolution of 1921.

To celebrate 52 years since the ban was lifted, this blog post gives a quick overview of women’s football in the UK Web Archive (UKWA). To mark National Sporting Heritage Day back in 2018 we published a blog post outlining the UKWA sports collection policies. 

History of women's football website in the uk web archive

History of the Women's FA, archived in 2018

Sport has always been included in the UKWA archive since it’s formation in 2005. In recent years we have been blogging more about these collections. Football in all its varieties is probably the most popular sport in the UK, which is why there is a collection dedicated exclusively to football and related activities. The most developed subsection of this collection is on soccer with almost 4,000 items in the collection. These range from individual web pages, subsections of websites as well as full websites, blogs and some social media platforms. 

Explore the extensive Soccer collection on the UK Web Archive Website.

We have collected a wide range of content from sports clubs (amateur and professional), fan sites, football research and events. There is no distinction in the collection based on gender as all content related to the sport is treated equally. 

Accessing the UK Web Archive
Under the Non-Print Legal Deposit Regulations 2013, we can archive UK published websites but are only able to make the archived version available to people outside the Legal Deposit Libraries Reading Rooms, if the website owner has given permission. 

Some of the websites  in UKWA that have already had permission granted, include Charlton Athletic Women, Sent Her Forward and Tartan Kicks: The Magazine For Scottish Women's Football. Some examples of websites that are onsite-only access include the Crawley Old Girls (COGS), Her Game Too and Dick, Kerr Ladies FC 1917-1965: Women's Football History.

Tartan kicks website in the UK Web Archive

Tartan Kicks website, archived in 2019

As the content of UKWA has mixed access, the message ‘Viewable only on Library premises’ will appear under the title of the website if you need to visit a Legal Deposit Library to view the content. If there is no message underneath then the archived version of the website should be available on your personal device.

Get involved with preserving women’s football online with the UK Web Archive
The UK Web Archive works across the six UK legal Deposit Libraries and with other external partners to try and bridge gaps in our subject expertise. But we can’t curate the whole of the UK web on our own, we need your help to ensure that information, discussion and creative output related to women’s football are preserved for future generations. Anyone can suggest UK published websites to be included in the UK Web Archive by filling in our nominations form.

Keep an eye on the UKWA blog and Twitter account to find out more details on our forthcoming collection to preserve the UEFA Women's Euro 2022 competition taking place across England from July 6 to July 31, 2022. 

06 January 2022

UKWA 2021 Technical update

By Andy Jackson, UKWA Technical Lead, British library

During the last quarter of 2021, the technical services that make up the UK Web Archive underwent lot of changes behind the scenes. These changes should help us to improve our services, so it’s worth explaining a little about what’s been going on.

Starting the Hadoop 3 Migration
Our Hadoop cluster is now quite old, and updating this to a newer version has been a long-standing issue. The old Hadoop version no longer gets updates, and is not supported by modern tools and libraries, which prevents us from making the most of what’s available.

For a long time, it was unclear how best to proceed – an in-place update seemed too risky, but a cluster-to-cluster migration appeared to require too much hardware. So, over recent years, we have spent time learning how to set up and maintain a Hadoop 3 cluster, and evaluating different migration strategies, focusing on how we might maintain service during any migration.

We eventually decided a cluster-to-cluster migration should be possible, as long as we can purchase higher-density storage so we have enough headroom to migrate content over ahead of migrating hardware. Earlier in the year, following some procurement delays, we were able to purchase and establish this new Hadoop 3 cluster, with each server providing over 450TB of raw storage (compared to about 85TB per server for the older cluster).

While this was being set up, we also had to generalize our services so that all important process can be run across both clusters, and that WARC records can be retrieved from either. This has been quite time-consuming, but as 2021 drew to a close (and space on the older cluster was getting tight!), we were finally able to shift things so that newly-harvested content is written to the new Hadoop 3 cluster.

Behind the scenes, our file tracking database was updated to scan both clusters and act as a record of which files are where, and to update this record hourly rather than just once per day. A new WARC Server component was created that takes Wayback request for WARC records, and uses the tracking database to work out which cluster they are on, and then grabs and returns the WARC record in question.

In the future, the tracking database will be used to help orchestrate the movement of content to Hadoop 3, with hardware being shifted over as it becomes available. The new WARC Server means that we will be able to maintain an uninterrupted service throughout.

But to avoid interruption now, we also needed to enable access to the newer content on Hadoop 3 by indexing it for playback. To this end, a new CDX indexer implementation has been created that can be run on either cluster (built with Webrecorder’s Python tools rather than Java) . As before, the tracking database is used to keep track of what’s been indexed, but both clusters can now be indexed promptly.

Similarly, although not fully moved into production yet, the Document Harvester document extractor and the Solr full-text indexing tasks have been re-written to be able to run on either cluster, and be more robust than the prior implementations.

At time time of writing, the main public website and the internal Storage Report have not been fully moved over to run across both systems, so there may be some slight inconsistencies there in the short term. However, we expect to resolve this in the next week or two.

Task Orchestration via Apache Airflow
This large set of changes has also been used as an opportunity to update how our critical web-archiving tasks are implemented and orchestrated. We were using the Luigi framework to define tasks and their dependencies, but over time we have found this to be problematic in a number of ways:

  • The code that performs tasks and the code that orchestrates those tasks were mixed together in the same source files. This made it very hard to work on improving any individual task on it’s own, and made testing difficult.
  • The Luigi task scheduling seems to be unreliable, with processors occasionally getting stuck and not making any progress, or not raising any errors on failure. This particularly affected the Document Harvester, leading to a number of outages.
  • The Luigi task management interface is not very useful. It does not make it easy to look at previous runs, and presents very little detail.
  • The way Luigi encourages task dependencies to be coded makes it very difficult to clear out those dependencies so task can be re-run.

Therefore, while updating the various web archive tasks, they have been modified to run under Apache Airflow.

Apache airflow

This is a popular and very widely used workflow definition and scheduling system, with both Google and Amazon offering Airflow as a fully-managed cloud services as well as a healthy open source community around it. Along with this choice of workflow platform, we have also chosen to implement each task tool as a separate standalone Python command-line program. This means:

  • Task code is separate from orchestration, can be developed independently, and tasks can be deployed as Docker containers, which keeps the underlying software dependencies apart.
  • We get to use the Airflow scheduler, which appears to be more reliable, will warn us when tasks get stuck or fail, and provides Prometheus integration for monitoring.
  • The Airflow Web UI is very detailed, allows access to task logs, summaries of runs and statistics, makes workflow management easier, and provides a framework for documenting each workflow.
  • The Airflow Web UI also makes it easy to clear the status of failed workflow runs so they can be re-run as needed.

Over time, we expect to move all web archiving tasks over to this system.

W3ACT
W3ACT is used by UKWA curators and other authorised users to add targets and manage Quality assurance and licencing. There only have been minor updates to the W3ACT curation service lately, rolled out towards the end of December. 

  • QA Wayback is now running PyWB version 2.6.3 for improved playback (e.g. ukwa-pywb#70).
  • Improvements to how the W3ACT authentication cookie is handled, resolving w3act#662.

UKWA Website
Most of the recent work on the UKWA website (www.webarchive.org.uk) user interface has focused on improving the presentation of our large set of curated collections by grouping them into categories. This work is still being discussed and developed internally, so isn’t part of the public website yet. However, we’re making good progress and hope to release a new version of the website over the coming weeks.

Apart from the interface itself, some additional work has been done to update the internal services (e.g update PyWB to version 2.6.3 and add the WARC Server to read content from both Hadoop clusters), and move the deployment to our newer production platform. As indicated above, these updates should be rolled out shortly.

2021 Domain Crawl
As in 2020, the 2021 Domain Crawl was run on the Amazon Web Services cloud. This time, following improvements to Heritrix and building on prior experience, the crawl ran more smoothly and efficiently than in 2020, using less memory and disk space for the crawl frontier. The crawler was started up early in August for penetration testing, and then taken down while the security concerns were addressed. The actual crawl began on the 24th of August, starting with 10 million seed URLs, and the vast majority of the crawl had completed by mid-November. Most of the 27 million hosts we visited were crawled completely, but ~57,200 hosts did hit the 500MB size cap. However, some of these were content distribution networks (CDNs), i.e. services hosting resources for other sites, so some caps were lifted manually and the crawl was allowed to continue.

URL rates in UKWA domain crawl

On the 30th of December, the crawl was stopped, having processed 2.04 billion URLs and downloaded 99.6 TB of data (uncompressed). However, a lot of the CDN content remained uncollected, and would take a very long time to collect under Heritrix’s normal ‘politeness’ rules. In the future, it would be good to find a way to allow Heritrix to crawl these sites much more quickly, without having to manually intervene to decide which hosts are CDNs.

At this time, it has not been decided whether the 2022 Domain Crawl will be run in the cloud or from our Boston Spa site. Either way, we expect to begin the process of transferring domain crawl 2020/2021 content from AWS to our Hadoop 3 cluster over this next year.

Upcoming work
In the next quarter (Jan-Mar 2022), as well as the future updates outlined above, we are also expecting to:

  • Receive hardware for the additional Hadoop 3 replication cluster, then start setting it up and populating it ahead of it being transferred to the National Library of Scotland later in the year
  • Improve monitoring of the process of moving WARCs and logs to Hadoop (in part to ensure we spot problems with the Document Harvester earlier)
  • Add improved reporting services, replacing the current Storage Report with one that is up-to-date and runs across both clusters (ukwa-notebook-apps#12)
  • Integrate static documentation and translations into the main website, via a simple CMS (ukwa-services#48). This will make it easier to add more pages and manage the translation of those pages to/from Welsh and Scottish Gaelic.
  • Begin implementing the NPLD Player, which we need in order to improve reading-room access across the Legal Deposit libraries. We’re currently finalizing the details of how our external partner will help us do this, and more details will be made available over the next couple of months.

15 December 2021

How a web designed for the visually impaired is a better web for everyone

By Jason Webber, Web Archive Engagement Manager, The British Library

This Disability History Month, staff from across the British Library have collaborated on a series of blog posts to highlight stories of disability and disabled people in the Library’s collections. Each week a curator will showcase an item from the collections and present it alongside commentary from a member of the British Library’s Disability Support Network. These selections are a snapshot insight into the Library’s holdings of disability stories, and we invite readers to use these as a starting point to explore the collections further and share your findings with us.

The web, created over 30 years ago, has revolutionised the world of information sharing but it has not always been an ideal space for all users and in particular those who are visually impaired. By regularly capturing copies of websites over time, Web Archives can document changes and see the progress on accessibility.

During the 1990s and early 2000s it was not unusual for websites to use small, fixed type, poor colour contrast, animations, dense text and many other techniques that can make it harder to read or view. For example, this was the first website I helped maintain back in 1999 when I had just started in the Court Service web team. Not too bad for the time, it does illustrate in some ways how web design and accessibility could look like over 20 years ago.

Court Service website 1999

The Court Service (then part of the Lord Chancellor’s Dept) website in 1999, captured by the UK Government Web Archive. 

Contrast the 1999 Court Service website with the thoroughly modern and accessible GOV.UK website whose team work extremely hard to make it as easy to view and use as possible.

Accessibility-govUK2021

Gov.UK archived website from 2021

Improvements
Whilst far from perfect, the modern web is a much better place now for visually impaired people but how did this change come about?

In 1995, just as the web was gaining in popularity, the landmark ‘Disability Discrimination and Equality Act’ came into force in the UK (Note: this legislation has had many subsequent updates since then). At a similar time the ‘Web Content Accessibility Guidelines (WCAG)’ were being developed. Also, charities such as the Royal National Institute for the Blind (RNIB) have been huge champions for online accessibility, even offering a badge of approval to compliant sites.

Legislation, guidance and campaigning have all helped to move web designers and website owners into thinking about all their audience and improving standards.

Principles of web accessibility
At a basic level, websites should be available to everyone and with just a few principles in place, this is entirely achievable. Text should be made so that the user can scale the font size, images should have descriptive captions and alternative text. Videos and multimedia should have subtitles or captions. If websites are structured correctly they allow screen readers to ‘speak’ the website to the user. In 2021, all websites should follow these and several other recommendations in order to be compliant. Read the full WCAG guidance for more.

Another example could be RNIB’s own website that has undergone considerable change and improvement over the years. See these archived websites from 2008 and 2021.

RNIB website 2008

RNIB archived website from 2008 

RNIB website 2021

RNIB archived website from 2021

A better web for everyone!
Making the web accessible for visually impaired people is something that benefits everyone. Bigger text with more ‘white space’ and high colour contrast on a page makes much easier (and quicker) reading. Many people today with no visual impairment use captions and subtitles on videos they watch, either to keep the volume low (or off) or it just makes things easier to understand.

From the website owners point of view, why would anyone want to discourage people using their website? Reading their news, latest blog or educational resource or if they are a business, buying their products or services.

Making an accessible web is a WIN-WIN for us all and we should be grateful for the hard work of those who got us where we are today and who are still striving for improvements.

Read more information on accessibility in the early web.


Reflection from British Library staff Disability Support Network member
I completely agree with Jason, making websites, or anything in life, accessible for people with impairments and disabilities, does benefit everyone. Very often actions taken to make something accessible for one kind of disability actually benefits many others. For example many of the website guidelines will benefit those with neurodiverse differences as well as visual impairments. Lots more can still be done to make web content accessible. Particularly with a growing increase of information shared via social media as opposed to a website. To make things accessible often just takes some time, not everything has a financial implication. An example being, taking the time to write Alt Text and Image Descriptions.

I often find that design and aesthetics are still a barrier to making things accessible. If the outcome of making something accessible doesn’t fit in with the aesthetics and design branding of an organisation, they often won’t bother making the effort to make it accessible. Making information accessible doesn’t have to compromise on design, people just need to change their perceptions and their approach, and make adaptations.

Sarah

12 November 2021

Welsh language websites within the UK Web Archive

By Aled Betts, Acquisitions Librarian and Web Archivist, National Library of Wales

The National Library of Wales have been collecting Welsh language websites to archive for the UK Web Archive since the 2004. In 2018, we decided to collate these websites and include them in a dedicated Collection in order to make it more accessible to researchers.

Significantly, 2018 was an important milestone for the Welsh language as it was 25 years since the passing of the Welsh Language Act in 1993 which gives effect to the principle that in the conduct of public business in Wales, the English and Welsh languages should be treated ‘on the basis of equality’. It was also 10 years since the passing of Welsh Language (Wales) Measure 2011 giving the Welsh language official status in Wales. In terms of Government and Public Bodies, the following principle that the Welsh language will not be treated less favourably than English was observed. As a result, the Welsh language is clearly visible and widespread on the web as many websites by law are now bilingual.

However, the aim of the Welsh Language Collection was not simply to list websites that were published through the medium of Welsh. The focus was more on those websites and organisations whose aim was to promote and facilitate the use of the Welsh language in all walks of life. The Collection also covers websites relating to Welsh language communities, online and physical, where Welsh is the medium of communication. It also looks at bodies that promote Welsh umbrella organisations as well as groups that campaign and lobby for the language. Furthermore, we have been collecting Welsh language websites since 2004, therefore we were able to showcase many of these websites and show how much they had changed over the last 17 years!

Here is just a small sample of the type of websites covered in the Welsh Language Collection.

Advocacy, campaigning and lobbying
Much of the work promoting the Welsh language across Wales is done by Mentrau Iaith (English: Language Initiatives). These are community-based organisations that operate to raise the profile of the Welsh language in a specific area. The percentage of Welsh speakers vary considerably. For instance, the highest percentages of Welsh speakers can be found in Gwynedd (64%) and the lowest is Blaenau Gwent (8%) therefore the challenges in each area differ. In order to capture this important work, we also archived their twitter feeds. These feeds are showing us how these initiatives are promoting the Welsh language in their respective areas. Furthermore, the Menter Iaith (English: Language initiative) umbrella body website is one the earliest sites we captured, a site we first archived in 2006.

Welsh-language-02

Mentrau Iaith (English: Language initative) website in 2021

Mentrau Iaith website

Mentrau Iaith (English: Language initative) website in 2006 

Over the last 2 decades, we have seen bodies and organisations evolve, grow and some disappear. A statutory body set up under the Welsh Language Act 1993 was Bwrdd yr Iaith Gymraeg (English: Welsh Language Board). The board was responsible for administering the Welsh Language Act and for seeing that public bodies in Wales kept to its terms. The Welsh Language Board was abolished in 2012 and following the passing of the 2011 Welsh Language (Wales) Measure, powers were transferred to the Welsh Government and the Welsh Language Commissioner, a new body promoting and facilitating the use of the Welsh language. Fortunately, we have captured this transfer of power as we have been archiving the Welsh Language Board website since 2008 and the Welsh Language Commissioner since 2012, in both cases, open access has been granted.

Welsh-language-03

Bwrdd yr iaith Gymraeg/ (English: Welsh Language Board) website in 2008

Welsh-language-04

Comisiynydd y Gymraeg (English: Welsh Language Commissioner) website in 2021

Arts and Culture
The Welsh language has a lively and vibrant arts, music and literature scene. This is no more exemplified by the Eisteddfod Genedlaethol (English: National Eisteddfod) and Urdd Gobaith Cymru, the Welsh language national voluntary youth organisation, who run the Urdd Eisteddfod, arguably Europe's largest youth festival. Both sites are archived since early 2000’s. The National Eisteddfod is held in different locations each year alternating between north and south Wales therefore naturally the content changes every year. The first National Eisteddfod we archived was Eisteddfod Genedlaethol Cymru Casnewydd a’r Cylch (English: National Eisteddfod of Wales Newport and surrounding area) in 2004 and our first Urdd National Eisteddfod was Eisteddfod yr Urdd Sir Ddinbych (English: Urdd Eisteddfod Denbighshire) in 2006! Again, open access granted, therefore available to view anywhere.

Welsh-language-05

The Eisteddfod Genedlaethol Cymru Casnewydd a’r Cylch (English: National Eisteddfod of Wales Newport and surrounding area) 2004

Welsh-language-06

Urdd Eisteddfod Denbighshire 2006

Alongside the all-important bodies, we archive a plethora of arts and culture websites, from record labels to folk groups, theatrical bodies, local eisteddfodau and Welsh language festivals. Same goes for the buoyant Welsh literature and publishing scene, close to a hundred websites listed within our ‘literature and publishing sub-section.

Education and Learning
An all-important sub-section is Education and learning. Here two types of websites dominate. One is education and learning through the medium of Welsh. Here, Welsh-medium education, including Mudiad Meithrin (English: Nursery Movement), formed in 1971, to nurture early-years Welsh speakers to Coleg Cymraeg Cenedlaethol (English: Welsh National College), formed in 2011, to develop Welsh-language courses and resources for Higher Education students are archived.

Welsh-language-07

Coleg Cymraeg Cenedlaethol (English: Welsh National College) website in 2011

Secondly, the web has seen an explosion of language learning websites globally. This is also apparent in the Welsh language allowing those wishing to learn a second language to do so through the internet.

Welsh-language-08

SaySomethinginWelsh website in 2011

As of 2021, the collection has between 500 and 600 websites and is a growing collection. However, a significant collection, as many websites were collected since the early days of web archiving in 2004. The principle of equality had been an underlying theme in Welsh language discourse and legislation was passed to meet this demand. The Collection explores how promoting and supporting the Welsh language has changed over the past 20 years but also shows how legislation has helped shape this change.

19 October 2021

Clouds and blackberries: how web archives can help us to track the changing meaning of words

By Dr Barbara McGillivray (Turing Fellow), Pierpaolo Basile (Assistant Professor in Computer Science, University of Bari), Dr Marya Bazzi (Turing Fellow) and  Dr Jenny Basford, Jason Webber (British Library)

NOTE: This a re-blog from the Alan Turing Institute, with permission.

The meaning of words changes all the time. Think of the word ‘blackberry’, for example, which has been used for centuries to refer to a fruit. In 1999, a new brand of mobile devices was launched with the name BlackBerry. Suddenly, there was a new way of using this old word. ‘Cloud’ is another example of a well-established word whose association with ‘cloud computing’ only emerged in the past couple of decades. Linguists call this phenomenon ‘semantic change’ and have studied its complex mechanisms for a long time. What has changed in recent years is that we now have access to huge collections of data which can be mined to find these changes automatically. Web archives are a great example of such collections, because they contain a record of the changing content of web pages.

But how can we automatically detect in a huge web archive when a word has changed its meaning? A common strategy is to build geometric representations of words called word embeddings. Word embeddings use lots of data about the context in which words are used so that similar words can be clustered together. We can then do operations on these embeddings, for example to find the words that are closest (and most similar in meaning) to a given word. It’s a useful technique, but building embeddings takes a lot of computing power. Having access to pre-trained embeddings can therefore make a big difference, enabling those in the scientific community without sufficient computational resources to participate in this research.

A team of researchers from The Alan Turing Institute and the Universities of Bari, Oxford and Warwick, in collaboration with the UK Web Archive team based at the British Library, has now released DUKweb, a set of large-scale resources that make pre-trained word embeddings freely available. Described in this article, DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a collection of all .uk websites archived by the Internet Archive between 1996 and 2013. (This dataset is held and maintained by the UK Web Archive, which has been collecting websites since 2005, initially on a selective basis and since 2013 at a whole domain level.) DUKweb contains 1.3 billion word occurrences and two types of word embeddings for each year of the JISC UK Web Domain Dataset. The size of DUKweb is 330GB.

Researchers can use DUKweb to study semantic change in English between 1996 and 2013, looking at, for instance, the effects of the growth of the internet and social media on word meanings. For example, if the word ‘blackberry’ is used mostly to refer to fruits in 1996 and to mobile phones in 2000, the 1996 embedding for this word will be quite different from its 2000 embedding. In this way, we can find words that may have changed meaning in this time period. The figure below (from Tsakalidis et al., 2019) shows four words whose contexts of use have changed in the last couple of decades: ‘blackberry’, ‘cloud’, ‘eta’ and ‘follow’. The bars indicate words most similar to these four words in 2000 (red bars) and in 2013 (blue bars). The scale along the bottom gives a measure of the change.

figure 02 - analysis - clouds, blackberries

The resources that underpin DUKweb are hosted on the British Library’s research repository, and are available for anyone in the world to download, reuse and repurpose for their own projects. This repository is part of the BL’s Shared Research Repository for cultural heritage organisations, which brings together the research outputs produced by participating institutions, and makes them discoverable to anybody with an internet connection. Providing a stable, dedicated location to hold heritage datasets in order to share them with a wider research community has been one of the key drivers in the implementation and development of this repository service. We are grateful to the British Library’s Repository Services team for supporting this collaboration between the UK Web Archive team and the Turing by making the content for DUKweb available.

Read the paper: DUKweb: diachronic word representations from the UK Web Archive corpus