UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

12 July 2023

UK Web Archive Technical Update - Summer 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2023 Q1 report.

At the end of the last quarter, we launched the 2023 Domain Crawl. This started well (as described in the 2023 Q1 report) but a few days later it became clear the crawl was going a bit too well. We were collecting so quickly, we started to run out of space on the temporary store we use as a buffer for incoming content.

The full story of how we responded to this situation is quite complicated, so I wrote up the detailed analysis in a separate blog post. But in short, we took the opportunity to move to a faster transfer process and switch to a widely-used open source tool called Rclone. After about a week of downtime, the crawl was up and running again, and we were able to keep up and store and index all the new WARC files as they come in.

Since then, the crawl has been running pretty well, but there have been some problems…

2023-07-05-dc-storage-and-queues
2023 Domain Crawl Storage and Queues

The crawler uses disk space in two main ways: the database of queues of URLs to visit (a.k.a. the crawl frontier), and the results of the crawl (the WARCs and logs). The work with Rclone helped us get the latter under control, with the move from /mnt/gluster/dc2023 to sharing the main /opt drive and uploading directly to Hadoop. These uploads run daily, leading to a saw-tooth pattern as free space gets rapidly released before being slowly re-consumed.

But the frontier shares the same disk space, and can grow very large during a crawl. So it’s important we keep an eye on things to make sure we don’t run out of space. In the past, before we made some changes to Heritrix itself, it was possible for a domain crawl to consume huge amounts of disk space. Once, we hit over 100TB for the frontier, which becomes very difficult to manage. In recent domain crawls, our configuration changes we’ve managed to get this down to more like 10TB.

But, as you can see, around the 13th of June, we hit some kind of problem, where the apparent number of queues in the frontier started rapidly increasing, as did the rate at which we were consuming disk space. We deleted some crawler checkpoints to recover some space, as we very rarely need to restart the crawl from anything other than the most recent daily checkpoint, but this only freed-up modest amounts of space. Fortunately, the aggressive frontier growth seemed to subside before we ran out of space, and the crawl is now stable again.

Unfortunately, it’s not clear what happened. Based on previous crawls, it seems unlikely that the crawler suddenly discovered many more millions of web hosts at this point in the crawl. In the past, the number of queues has been consistently up to around 20 million at most, so this leap to over 30 million is surprising. It is possible we hit some weird web structures, but it’s difficult to tell as we do don’t yet have reliable tools for quickly analysing what’s going on in this situation.

Suspiciously, just prior to this problem, we resolved a different issue with the system used to record what URLs had been seen already. This had been accidentally starved of resources, causing problems when the crawler was trying to record what URLs had been seen. This lead to the gaps in the crawl monitoring data just prior to the frontier growth, as the system stopped working and required some reconfiguration. It’s possible this problem left the crawler in a bit of a confused state, leading to mis-management of the frontier database. Some analysis of the crawl will be needed to work out what happened.

In the laster quarter, the new URL search feature was deployed on our BETA service. Following favourable feedback on the new feature, the main https://www.webarchive.org.uk/ service has been updated to match. We hope you find the direct URL search useful.

We’ve also updated the code that recognises whether a visitor is in a Legal Deposit reading room, as it wasn’t correctly identifying readers at Cambridge University Library. Finally, there was an issue with how the CAPTCHAs on the contact and nomination forms were being validated, which has also been resolved.

Our colleagues from Webrecorder delivered the initial set of changes to the ePub renderer, making it easier to cite a paragraph of one of our Legal Deposit eBooks. Given how long the ePub format has been around, it is perhaps surprising that support for ‘obvious’ features like citations and printing are still quite immature, inconsistent and poorly-standardised. To make citation possible, we have ended up adopting the same approach as Calibre’s Reference Mode and implemented a web-based version that integrates with out access system.

We’ve also worked on updating the service documentation based on feedback from our Legal Deposit Library partners, resolved some problems with how the single-concurrent-use locks were being handled and managed, and implemented most of the translations for the Welsh language service. The translations should be complete shortly, and and updated service can be rolled out, including the second set of changes from Webrecorder (focused on searching the text of ePub documents).

Replication to NLS

The long process of establishing a replica of our holdings at the National Library of Scotland (NLS) is finally nearing completion. We have an up-to-date replica, and have been attempting to arrange the transfer of the servers. This turned out to be a bit more complicated that we expected, so has been delayed, but should be completed in the next few weeks.

Minor Updates

For curators, one small but important fix was improving how the W3ACT curation tool validates URLs. This was thought to have been fixed already, but the W3ACT software was not using URL validation consistently and this meant it was still blocking the creation of crawl target records with top-level domains like .sport (rather than the more familiar .uk or .com etc.). As of June 23rd, we released version 2.3.5 of W3ACT that should finally resolve this issue.

Apart from that, we also updated Apache Airflow to version 2.5.3, and leveraged our existing Prometheus monitoring system to send alerts if any of our SSL certificates are about to expire.

03 July 2023

RESAW 2023 Conference Report from the UK Web Archive

By Cui Cui Bodleian Libraries/University of Sheffield Information School, Nicola Bingham, Helena Byrne, British Library, Alice Austin Edinburgh University.

RESAW 2023 Exploring the archived web during a highly transformative age - Sciencesconf.org
RESAW 2023 Exploring the Archived Web During a Highly Transformative Age

2023 was the fifth RESAW conference. RESAW stands for Research Infrastructure for the Study of Archived Web Materials. It was established in 2012, aims to promote a collaborative European research infrastructure for the study of archived web materials and holds a conference every two years. The 2023 conference was held in Marseille from June 5-6 under the theme ‘Exploring the Archived Web During a Highly Transformative Age’. There was a packed programme with a number of UK based presentations especially from the UK Web Archive teams based at the Bodleian Libraries, British Library and Archive of Tomorrow project partner, University of Edinburgh.

The keynote presentations from the conference were streamed live and the recording of the day two keynote ‘Saving Ukrainian Cultural Heritage Online' by Sebastian Majstorovic (European University Institute) is available on the Inspé Aix-Marseille YouTube channel.

In this blog post participants from the UK Web Archive teams have reported back on their conference experience.

Bodleian Libraries/University of Sheffield Information School 

Cui Cui, Web Archivist / PhD researcher

The experience of presenting two papers in the fifth RESAW conference turned out to be a highly emotional one for me. The first presentation alongside my fellow web archivist, Alice Austin from University of Edinburgh, marked the end of the Archive of Tomorrow project. The opportunity provided me with a chance to reflect on the work we carried out for the project. The second presentation concluded the initial phase of my PhD research project on participatory web archiving. Presenting at the conference compelled me to summarise the findings from a survey I delivered last year, aiming to gain insights into the current practices of participatory web archiving. This experience not only marked a significant milestone, but also served as a starting point to bring theories and practices together to develop better web archives. 

During a panel discussion titled “Interrogating the logics of web archiving in the era of platformization”, Jessica Ogden, Katie Mackinnon, Emily Maemura posed some critical questions about web archiving practices. Who are we collecting for, what shall we collect and how can we approach this process ethically? They particularly put content creators at the centre of considerations and challenged web archivists to critically reflect our practices and ethical considerations. It is assuring that we are not alone in grappling with these complex issues as web archivists. These questions echo with the constant dilemmas we face as web archivists. In particular, the Archive of Tomorrow project highlighted the double-bind situations we encountered when dealing with ethical considerations and piloted engagement work with content creators. From both researchers’ and archivists’ perspectives, it is evidenced that these concerns call for more evidence-based studies and a deeper understanding of the views held by content creators and other wide range of stakeholders. 

Overall, the RESAW conference provided a thought-provoking experience. It allowed me to reflect on our work, consolidate my understanding, and recognise the need for continued efforts to address these complex issues.

British Library

Nicola Bingham, Lead Curator of Web Archives

I felt very privileged to attend this conference at the Mucemlab in Marseille, set in the courtyard of Fort Saint-Jean, with a stunning mix of old and new architecture and amazing sea views. During the conference, I found numerous presentations informative, engaging, thought-provoking and humorous, however, among them, two in particular, sparked profound reflections on curatorial praxis within the context of my own work.

Henrik Smith-Sivertsen took the audience on a captivating journey into the world of digital music archiving. With a focus on three distinct songs, he illustrated how the mediascapes in which they were published have a significant impact on the archiving process. Through his exploration, he highlighted the challenges of capturing and preserving complex digital objects from social media platforms and streaming services. The question of which version(s) to capture became a pivotal point of discussion, raising awareness of the dynamic nature of digital music and the evolving digital landscape it resides in. A thought-provoking video presentation showcased the different online iterations of Lukas Graham's "7 Years" from 2015. The variations in platforms, remixes, and user-generated content surrounding this song demonstrated the diverse ways in which music proliferates and evolves online. The presentation served as a powerful reminder of the challenges faced by archivists when attempting to capture and preserve such dynamic and multi-faceted digital musical artefacts.

Tiancheng Leo Cao from the University of Texas at Austin's intriguing paper focused on the changing meanings of openness within the museum context. He shed light on the gradual shift from an institution-oriented understanding to an access-oriented interpretation, prioritising the needs and participation of the public. I was struck by how this ideology parallels our thinking in the UK Web Archive where efforts are being made to embed more participation in the curatorial process. By involving communities, ensuring diverse perspectives, and including multiple voices, heritage organisations can create a more inclusive and representative platform for preserving our digital heritage.

Helena Byrne, Curator of Web Archives 

This was my second time attending a RESAW conference. The first I attended was 2017 as part of the Web Archiving Week event held in London when the IIPC Web Archiving Conference and RESAW collaborated on organising a full week of web archiving activities. At RESAW 2023 I co-presented two presentations both on day two of the conference. These were both collaborations that came out of the WARCnet network. The first was a joint presentation with Emily Maemura from (University of Illinois) where we fed back some initial findings from the series of workshops we facilitated on ‘Describing Collections with Datasheets for Datasets’. The second presentation was a joint presentation with Sharon Healy (Maynooth University) on ‘Assessing the Scholarly Use of Web Archives in Ireland’. In this presentation we highlighted a section from a much larger report that will be published as part of the WARCnet Papers and Special Reports

A key highlight for me in the programme was the session 'Building the Next Generation of Web Archive Analysis Service'. This panel gave an overview of the development of the Archives Unleashed project from 2017. The project is now winding up and will be supported by the Internet Archive who will be releasing a subscription service to Archives Research Compute Hub (ARCH) this summer. I've been lucky enough to attend Archives Unleashed events in 2017 and 2019 so it was really great to see how the project has changed over time. I wish the Archives Unleashed team all the best.

University of Edinburgh

Alice Austin, Web Archivist

The Archive of Tomorrow project team took two papers to RESAW this year. The first was a deep-dive into the Trans Health sub-category within the Talking About Health collection. The second, presented jointly with my fellow web archivist Cui Cui of the Bodleian Libraries, delivered a condensed version of the project’s Final Report, and reflected on the challenges, wins and losses of the project as a whole.

A few related themes emerged from this year’s papers. A number of speakers reflected on the value of the archived web as a source for ‘bottom-up’ perspectives on the impact of online spaces in the development of narratives at a personal and social level. Arguing that the events of 9/11 galvanised emerging web archiving efforts, Ian Milligan’s paper explored how the resultant archived pages provide a rich source for future historians wanting to understand how that day evolved; Dana Diminescu’s paper on the archive of the ‘Comme a la maison’ platform examined how changes in the language of hospitality used online can reflect changes in societal understanding of the migrant experience; and Anya Shchetvina’s paper discussed how web-based communication objects can become recontextualised as memory objects.

Another theme concerned how to do web archiving in an age of ‘platformisation’. A trio of papers by Emily Maemura, Jess Ogden, and Kate MacKinnon explored this in detail, raising important questions about how web archiving practices might better serve the communities that they draw from. Camille Riou considered the vulnerability of data in a capitalist world in the context of the withdrawal of Twitter’s API for academic research, and Cade Diehm and Benjamin Royer of the New Design Congress presented an excellent overview of the sector’s readiness to grapple with issues of the polycrisis such as colonialism, privatisation and datafication. 

The sixth RESAW Conference will be held in 2025 at University of Siegen in Germany. The theme for the conference is ‘Histories of the Datafied Web: Infrastructures, metrics, aesthetics’. More details about the conference and the call for papers will be announced in due course. 

28 June 2023

IIPC Web Archiving Conference 2023 Report from the UK Web Archive

By Nicola Bingham, Helena Byrne, Ian Cooke, Carlos Lelkes-Rarugal, Andrew Jackson, Richard Price British Library, Leontien Talboom Cambridge University Library, Mark Simon Haydn National Library of Scotland.

IIPC WAC2023 Conference Banner with details of the online and in person conference details.
IIPC WAC2023 Conference Banner

The IIPC 2023 Web Archiving Conference was hosted by the Netherlands Institute of Sound and Vision in Hilversum and co-organised by KB, National Library of the Netherlands. There was an online session held on May 3rd and the main in-person event took place on May 11th and 12th. There was a packed programme that included Q&A sessions for pre-recorded presentations for the online day and  presentations, workshops, lighting talks as well as posters for the in-person event. This was the first in-person IIPC conference since 2019 when the event was hosted  by the National and University Library in Zagreb (NSK), Croatia. 

Many UK Web Archive colleagues from Bodleian Libraries, the British Library, Cambridge University Library and National Library of Scotland attended the conference both as delegates and presenters. In this blog post they have reported back on their conference experience.

British Library

Nicola Bingham, Lead Curator of Web Archiving

Attending the IIPC conference in person for the first time since 2019 was a great experience. The combination of reconnecting with colleagues after four long years and the (literally) colourful ambience of the Beeld & Geluid (Institute for Sound & Vision), created an atmosphere brimming with renewed energy and optimism. I will highlight just a few of the presentations and conversations that were interesting from my point of view.

I enjoyed hearing about the De Digitale Stad Herleeft (the Digital City Revived) from Marleen Stikker, founder and ‘mayor’ of DDS, Marieke Brugman of UNESCO and Tjarda de Haan, Bits and Bytes United. Presentations focused on the "webarchaeological excavations” which took place to reconstruct, preserve, store and make accessible this unique digital heritage based on KB’s XS4ALL web collection - which was listed as UNESCO Memory of the World Heritage for the Dutch list and is now under review for the worldwide list.

I enjoyed insights into diversity and co-curation from Jesper Verheof, a Researcher-in-Residence at KB working on "Mapping the Dutch LGBT+ Web Archive". Jesper's work utilises KB's collections to explore the unique web sphere formed by LGBTQ+ - or queer people - and how this evolved over time. It sparked intriguing insights and perspectives which could be applied to our own LGBTQ+ collection.

Collaboration and innovation in web archiving were recurring themes at the conference. Valuable insights were shared by the team from the Library of Congress, emphasising their investment in and education of curators to effectively participate in the web archiving process. 

Finally, I had the privilege of presenting the research by WG2 of the WARCnet project, ‘Surveying the Landscape of COVID-19 Web Collections in European GLAM Institutions’ in a session dedicated to Covid-19 collections. Our findings shed light on the scope of these collections, how they were defined, and the common challenges institutions face in making them accessible for research purposes. 

Helena Byrne, Curator of Web Archives 

I participated in both the online and in-person event as a collaborator in a presentation in the online day and co-facilitating a workshop at the in-person event. I was involved in the ‘Developing a Reborn Digital Archival Edition as an Approach for the Collection, Organisation, and Analysis of Web Archive Sources’ project with Sharon Healy (Maynooth University) and Juan-José Boté-Vericad (Universitat de Barcelona). Along with Emily Maemura (University of Illinois) we facilitated Workshop-01 ‘Describing Collections with Datasheets for Datasets’. This was part of a series of workshops we hosted to see if the Datasheets for Datasets framework could be applied to UK Web Archive collections published as data. 

As a participant there were so many great takeaways from this conference. One of the sessions that stands out most for me is the ‘Renewal in Web Archiving: Towards More Inclusive Representation and Practices’. This was on day two of the conference. The conversations in this session were really useful for me to try and ensure that we continue to try and develop more inclusive collections and opportunities to engage in the curation process. In this session we heard about the next steps for the Archiving the Black Web (ATBW) project. Although this is a USA based project, its impact will be global as they are now currently developing a training programme to improve the curation and research use of the archived black web. 

Andrew Jackson, Web Archive Technical Lead

I was involved in a couple of tool workshops during the conference, where it was great to see the interest in shared tooling, and the collaborative commitments this implies. I was also interested in how many of the presentations related to issues around information literacy. For more, see my blog post Reflections on the IIPC Web Archiving Conference 2023.

Ian Cooke, Head of Contemporary British & Irish Publications

This year’s conference was a strong reminder that web archiving is about people - the people whose lives and experiences are expressed in the collections we build; the people whose imaginations shaped the way we use, and have used, the web over time; and the people who are working across collecting, preserving and researching the archived web.

There was a great mix of presentations, blending new developments in technologies, evolving research methods, and approaches to creating and understanding collections, in ways that were accessible to all attendees. Giulia Carla Rossi and I were both pleased to talk about the development of our practice at the British Library, and legal deposit libraries, in collecting ‘emerging formats’.  

The IIPC itself is celebrating its 20th year, and the conference reflected that sense of celebration. It also demonstrated the maturing of practice, and reflection on web archiving methods and goals, at many of the organisations represented. A highlight of the conference was the presentations by Makiba Foster and Zakiya Collier on the Archiving the Black Web project, and the potential of web archiving to contribute to ‘black self-education practices, collective study and librarianship’. Foster and Collier argued for well-resourced institutions to take responsibility for providing support to community heritage organisations in building inclusive collections, and also stressed the need for ethical considerations, in particular regarding the rights of people represented within collections, when building collections.        

Overall, it was a privilege to take part in the conference and to have the time to connect in person with a community of web archive practitioners and researchers, being able to share knowledge and experience and reminding ourselves of what we have in common.

Carlos Lelkes-Rarugal, Assistant Web Archivist

I very much enjoyed my second attendance of an IIPC annual web archiving conference, 2019 was my first one, so I didn’t quite know what to expect. Sufficed to say, the 2023 WAC was just as successful and another enjoyable, unique experience.

There’s such a diverse background of people, I think this is because web archiving is approached very differently as each organisation have their particular way of going about it, which is why there is such an emphasis on sharing knowledge and information. I attended many talks and learnt about new methods of quality assurance, the infrastructure set up of institutions, policies on collecting; whichever presentation it is, you can be sure there’s something innovative going on that could be applied to your domain.

The UK Web Archive itself represents the six UK Legal Deposit Libraries, and as such, we’re inherently maintaining relationships but more importantly trying to build new relationships for new opportunities, collaborations, and potential partnerships. We’re a small team (larger than others) but still relatively small when considering the scope of our work, and I think this is exactly what the IIPC can help with. Like many organisations, the UK Web Archive does at times find web archiving to be a challenge, and as such, the IIPC helps foster a network of people who are willing to share their knowledge and expertise so that we can connect with them to tackle these emerging and ever-evolving challenges. There’s a collective effort to further web archiving, we’re trying to advance a field that has a lot of potential, so if you’re interested, please join this invaluable community.

Richard Price, Head of Contemporary British Collections

I attended this conference to reacquaint myself with web archiving in a little more detail than I have for some years. It was a privilege to attend, seeing so many different kinds of response from the international community and, if I may so, I felt especially proud of my colleagues at the British Library for their presentations and workshops. If there was a common thread through the papers it was that the problem-solving and information-sharing intrinsic to the web archiving community are values translated from the early days of the web itself – that substantial part of the early Internet that was altruistic and public-minded – and, in today’s archiving world, underpinned by layers of technical, social, and curatorial expertise. Thank you to IIPC and to Sound and Vision at Hilversum, and to all those presenting and attending!

Cambridge University Library

Leontien Talboom, Technical Analyst

This was my first time attending IIPC apart from a very brief appearance on a panel in 2022. I was fortunate enough to be a co-presenter on two talks during the conference. One was with my colleague Mark Haydn where we presented on the datasets that we were able to create during the Archive of Tomorrow project and the other was with my colleague Caylin Smith where we explored the difficulties and opportunities of capturing the University of Cambridge domain. 

Both presentations were really enjoyable and it was great to get feedback and questions from colleagues across our field. As this was my first time attending IIPC I wasn’t sure what to expect. However, I was pleasantly surprised by the wide range of topics and formats discussed. One that really stood out to me was the work of Emily Escamilla who talked about reference rot and what would happen if GitHub was to disappear. This really showcased how much as an academic sector we rely on these types of sources to be around when referencing them, but this is not necessarily a given. 

National Library of Scotland

Mark Haydn, Metadata Analyst

It has been a few years since I've been at an in-person conference, & I had forgotten how nice it can be to visit another city and spend a few days immersed in presentations and conversations with people working in the same area. Sometimes this meant hearing about something immediately relevant to my own metadata work at the National Library of Scotland, like hearing Tom Storrar of the UK Government Web Archive assess how effective their work ramping up collecting early in the pandemic to capture frequent website updates had been, or listening to members of the ResPaDon Project detail their experiences extending regional access to web archives collections across France. Other presentations served as an opportunity to better understand topics being explored further afield: there were many demonstrations of potential uses of AI, not all of them ominous, ranging from automatically producing descriptive summaries of technical metadata, for use in Library of Congress catalogue records, to generating a generic Stirring Plenary Speech at short notice.

As well as listening in, my colleague Leontien Talboom and I presented some of our work on the Archive of Tomorrow project, summarising the progress that's been possible since the development of the British Library's web archive metadata export. We heard about other institutional and international approaches and platforms for looking at web archives at scale, like Archive-It's ARCH tools, and caught fellow Archive of Tomorrow web archivist Cui Cui's discussion of knowledge sharing before heading back to the UK.

The 2024 IIPC Web Archive Conference will be hosted by the Bibliothèque nationale de France (BnF) 24-26 April. Follow the IIPC Twitter account for updates and the call for papers due out in early autumn.

26 June 2023

LGBTQ+ Connections and Community

By Ash Green, CLIP LGBTQ+ Network, and Goldsmith University

The Marlborough Pub and Theatre
The Marlborough Pub and Theatre

I was browsing through the LGBTQ+ Lives Online collection recently, and reminded myself that I had added The Marlborough Pub and Theatre to it when I first began co-curating the collection. As far as I can remember, it was one of the first sites I added to the archive. I wanted it in there because it had been an important part of my coming out around 2017. I had a personal connection to it, and I wanted there to be a record of the impact it had on me. I know future explorers of the UK Web Archive won’t know why that site is archived, but maybe they will stumble across this blog post in connection to it and understand its importance to at least one BTQ person – me.

So, why did I specifically want this site in there? Well, in 2017, when I was working out what support there was for me as a trans/gender non-conforming person, I discovered The Clare project, which is a Tran’s support group in Brighton. I went along to it, and afterwards we went to The Marlborough Theatre and Pub, which was a venue with a long history of support for the LGBTQ+ community. The pub was the sort of place where I didn’t know anyone, but just being there made me feel okay about who I was. It was the first time being in an LGBTQ+ venue had felt like that to me. And I realised that there were other people there who seemed to be on similar paths in their lives. It was a reassuring place, and it was a place where I learnt about how diverse the LGBTQ+ community was. I remember going to a queer cabaret there, and it was such an amazing, heart-warming, queer, eye-opening and fun night. The pub is still there – now called The Actors. I’m not a regular visitor, and if you mention my name in there, they won’t know who I am. But when I call in from time to time when I’m in Brighton, I still get that sense of belonging to a community even if I’m quietly sitting in a corner reading on my own. It is a place that re-energises me.

It got me wondering about other sites in the LGBTQ+ Lives Online collection focused on artistic communities that may have had a similar impact on others in the same way that The Marly did on me.

So, for example, what joy did members of South Wales Gay Men's Chorus, Songbirds Choir, or True Colours LGBT Choir feel when they first sang with these choirs?

How excited were listeners when they heard a new track on LGBT Underground that stuck a strong emotional chord with them, and has stayed with them forever?

How did filmmakers feel when their first film appeared at the Scottish Queer International Film Festival, LezDiff, or the Iris Prize? And who in the audience saw something for the first time at these film festivals that resonated strongly with them?

And what sense of connection and belonging did those in queer / LGBTQ+ art groups such as The Queer Dot, Sanctuary Queer Arts, Wise Thoughts, and VFD find within their arts communities?

And maybe there are LGBTQ+ people who attended Queen Jesus, Teatro do Mundo, or even The Marlborough theatre performances, who realised the voice on stage was talking directly to them, and they clearly understood its message in relation to who they are as an LGBTQ+ person.

I’m know I can’t possibly be the only LGBTQ+ person who feels a strong connection with a place or community like these. Maybe you have a story to share about one of the sites in the collection? Or maybe you have a site like one of these that you would like us to add. You can nominate sites for inclusion here: https://www.webarchive.org.uk/nominate

We can’t curate the whole of the UK web on our own. We need your help to ensure that information, discussions, personal experiences and creative outputs related to the LGBTQ+ community are preserved for future generations. Anyone can suggest UK published websites to be included in the UK Web Archive by filling in the above nominations form.

If you would like to explore any of the sites mentioned in this blog post, you can find them in the Arts, Literature, Music & Culture subsection of the LGBTQ+ Lives Online collection: https://www.webarchive.org.uk/en/ukwa/collection/3090

19 June 2023

Reflections on the IIPC Web Archiving Conference 2023

By Andrew Jackson, Web Archive Technical Lead

Tessa Walsh (Webrecorder) Anders Klindt (Royal Danish Library) Ilya Kreymer (Webrecorder) & Andy Jackson (British Library ) demonstrating the new Browsertrix features in the workshop 'Browser-Based Crawling for All: Getting Started with Browsertrix Cloud'
Demonstrating the new Browsertrix features in the workshop 'Browser-Based Crawling for All: Getting Started with Browsertrix Cloud'

My main goal for the conference was to support the adoption and development of shared open source tools. I've been involved in the IIPC project Browser-based crawling for all, and at the conference I helped run a workshop where attendees could start exploring Browsertrix Cloud and give feedback to the project and to Webrecorder. There were some initial problems with the capacity of the demo system, but these were quickly resolved and the workshop was a success and provided useful feedback for future work.

I also ended up chairing the SolrWayback session, which showed many great examples of how that search interface and the underlying indexing tools (developed by UKWA) have been used by different web archives to help explore and analyse their collection. It's heartening to see more and more web archives doing this kind of thing.

There were a lot of good presentations and discussions around tools, but I'd particularly like to recommend that you all check out Warchaeology by the National Library of Norway Web Archive, and Scoop by the Harvard Library Innovation Laboratory.

Both the Scoop presentation and the Bellingcat keynote provided important insights into what it takes for web archives to be legally-admissible evidence (see also e.g. this post about Scoop and this post from Bellingcat). There are interesting questions here about our tools and workflows, like whether the WARC or WACZ formats are sufficient in their current form, and whether there are opportunities for deeper collaboration across the domains of cultural heritage, law, and open source investigation.

Finally, across a number of presentations, the conference also raised questions about the current and future role of cultural heritage institutions. Are our approaches to information literacy fit for an age of fake news and ChatGPT pollution? Is there something libraries and archives can learn from how Bellingcat and fact checkers like Full Fact are helping people find reliable information and avoid conspiracy theories? Can web archives do more to fight disinformation? I look forward to seeing more about this at future conferences!

04 May 2023

Regal Reflections: Exploring a New UK Web Archive Collection on King Charles III

Nicola Bingham, Lead Curator of Web Archiving, British Library

It has been 70 years since a new monarch was crowned in the UK. As we bear witness to a new era of the British monarchy and reflect on its role within the UK, the UK Web Archive is recording and preserving this momentous occasion by capturing websites in a special collection about King Charles III. Work started in earnest on this collection on 8th September 2022 when the late Queen, Elizabeth II, passed away and Charles became King, however, it also forms part of a larger series of collections about the British monarchy in the early 21st Century, curated by staff in the UK Legal Deposit Libraries.

Through this series of special collections, we can trace how the Royal Family has adopted the internet to communicate more efficiently with their supporters, members of the public, and other stakeholders as well as to promote their charitable causes and connect with younger generations who are more likely to engage with social media. As well as ‘official’ information, the UK Web Archive is also capturing user-generated content from a wide range of publishers including the general public, as recorded in websites, blogs, and social media posts, much of which is not available through traditional historical records.

In building this collection we have several priorities. As with all our collecting activity, our mission is to save ephemeral digital content ensuring it is preserved for the historical record. A good illustration of this is that the official website of Charles, Prince of Wales, published in his former position as heir apparent, no longer exists on the internet and is only available in the web archive.

Screenshot of the archived website of the Prince of Wales. Image of the Prince walking in a garden

Archived copy of www.princeofwales.gov.uk/ in the UK Web Archive (21/06/2019) https://www.webarchive.org.uk/wayback/archive/20190621085304/https://www.princeofwales.gov.uk/

We hope that the collection can help to provide a more comprehensive understanding of King Charles III and his impact on society, by preserving a diverse range of viewpoints and perspectives. There is a huge groundswell of affection for the new King, and the Royal Family in general, and a great sense of celebration and optimism in the lead-up to the Coronation on 6th May, however, there is of course, opposition, skepticism, and criticism, all of which is reflected online. It is important to capture all sides of the conversation to provide a balanced view of the Royal Family and create a digital legacy that will be of interest to researchers to study, and future generations to appreciate.

Another of our aims is to represent different communities across the UK and Commonwealth in the UK Web Archive. The collection will reflect how towns, cities, and villages celebrate the Coronation. Many people will be holding street parties, such as the residents of Calderdale, West Yorkshire, where residents are encouraged to get together and make the Coronation Weekend a community celebration to remember.

Seal of King Charles III - red background and white seal

In Glasgow organisations and communities are encouraged to engage in various Coronation initiatives and events in order to create a positive lasting legacy. The Big Help Out, for example, is an opportunity to highlight the positive impact of volunteering. It is hoped that the extra bank holiday for the Coronation will be remembered as a day of donating time and skills to help charities, causes, and the vulnerable.

Along with street parties, other traditions surrounding significant royal events include the manufacture and purchase of souvenirs. This article on the V&A’s website, preserved in the UK Web Archive, shows a few examples of souvenirs from past events such as the 'Jubilee' biscuit tin made in 1887 for the Carlisle-based biscuit manufacturer Carr & Co., to commemorate Queen Victoria's Golden Jubilee and the 'Coronation Coach' biscuit tin resembling the ornate coach used by King George VI and Queen Elizabeth on their Coronation Day on 11 December 1936. Of course, now that online shopping is ubiquitous any type of royal-themed memorabilia or amenity can be purchased, from the more traditional such as this mug from the National Archives shop to the more esoteric such as hiring a King Charles look-a-like.

One of the more peculiar aspects of the British monarchy is that special occasions are often associated with an official dish. Queen Elizabeth had
curried chicken for her Coronation, which was a relatively exotic choice in the Britain of the 1950s while King Charles has a ceremonial quiche (disappointingly not named Quiche l’Reign) which is intended for people to cook at home as part of the Coronation Big Lunch.

Tweet from the Prime Ministers twitter account discussing the upcoming coronation.

Image from UK Government Twitter showing Queen’s Coronation banquet UK Prime Minister (@10DowningStreet) / Twitter (webarchive.org.uk)]

In conclusion, the UK Web Archive is a collection affording a unique opportunity to witness and record unfolding historical events. As a historical figure, Charles III and the events that occur during his reign will be of significant interest to researchers, scholars, and the general public. Please do visit the King Charles III collection in the UK Web Archive, and if you know of a website that should be included in this collection, please nominate it here: https://www.webarchive.org.uk/en/ukwa/info/nominate

 

20 April 2023

UK Web Archive Technical Update - Spring 2023

By Andy Jackson, Web Archive Technical Lead, British Library

This is a summary of what’s been going on since the 2022 Q4 report.

Summarising Our Holdings

We regularly report on our holdings so other teams across the Legal Deposit Libraries have an understanding of how much data we hold and how we grow over time. Until recently, the reporting mechanism we used did not fully take into account the storage used across different clusters, and on Amazon Web Services.

In January the old reporting mechanism was replaced with a new implementation, better integrated with our other systems and covering all storage services. The Airflow scheduler (discussed in previous reports) generates updated lists of holdings from different systems, and a Jupyter notebook is then used as a dashboard. This is made accessible via the W3ACT curation service, unlike the old system, which was only available to British Library staff.

While it doesn’t get updated automatically, there’s also an older copy of the notebook on GitHub. See UK Web Archive Holdings Summary Report. As you can see there, the UK Web Archive now holds over 1.4 PB of WARCs and logs.

The new system for Reading Room access to Non-Print Legal Deposit material has also made steady progress. An alpha version of the system has been rolled out across all LDLs so staff can access the service for testing, and a beta service is being rolled out to run alongside the current system in reading rooms. The deployment of the services themselves has also been automated, using GitLab CI/CD to updated the systems rather than relying on updating them by hand.

Staff testing raised some additional requirements to be met before the service roll-out can proceed. Working with Webrecorder to meet these requirements will be the focus for the next quarter.

UKWA Website

Edited 28th April 2023 to include translation updates.

The main website has been updated to run version 2.6.9 of our PyWB playback engine, and version 1.4.5 of the main search interface. Version 1.4.5 does not change the sites basic functionality, but does significantly improve the Scotting Gaelic version of the site.

However, we’ve also looked at more significant changes to the public interface to the archive.

Firstly, we’d like to update to newer version of PyWB, which now features an updated timeline and calendar display. Secondly, some experimentation with letting search engines to index selected website showed that it may be necessary to include links to the archived sites somewhere in the main site so that the crawler finds and prioritizes those URLs for indexing. To test this out, a page has been added to the site that lists any archived sites that require indexing, and that page has been included in the site map.

Finally, we’ve found a lot of queries are better answered by direct URL search than keyword search, so wanted to find ways to better integrate PyWB’s URL search functionality with the main site. To make URL search easier to use, we want to change the the main search interface on the front page of the website to spot URL searches and direct the user to the right results.

The BETA version of the website has been updated to include these changes, and is now available For review. If you have any feedback, please let us know.

The BETA homepage for the UK Web Archive  offering URL or Full Text search

Image: The BETA homepage for the UK Web Archive, offering URL or Full Text search

Web Archive Discovery tool updates

One long-standing issue we have is that our full-text search does not contain recent material, and over the next year we hope to revisit the scaling problems we’ve seen and try to improve the situation.

As an initial step towards this, we spent some time updating our search tools. The webarchive-discovery indexer has been updated to use version 2 of Apache Tika, along with other upgrades to other dependencies like the Nanite wrapper that makes is possible for us to use National Archive’s PRONOM/DROID format identification engine. This changes are quite significant, so the version number has been bumped from 3.3.x to 3.4.x.

We are also considering an alternative workflow, where we store the extracted metadata in an intermediate form, rather than going directly to Apache Solr or Elasticsearch. To enable us to experiment with this approach, the indexer has been modified to support writing the extracted metadata to JSON Lines output files so that we can use it to support multiple forms of indexing or analysis.

2023 Domain Crawl Preparation

As discussed in the previous report, this year we are bringing the domain crawl back on-site rather than running on the cloud. The technical preperation for this was fairly straightforward, given the deployment of the crawl is largely automated. The main change from the last on-site crawl is that we switched to using a server with plenty of fast SSD disks. The cloud crawls had shown us how much the whole thing can benefit from faster disks, so we have attempted to match that when running on our own servers.

Add some updated seed lists from Nominet and from our curators, and we are ready to roll on the anniversary of the first Non-Print Legal Deposit domain crawl. That one started on the 12th of April 2013, and so we’ve chosen that for our start date this year. This will be part of the wider celebrations from across the legal deposit libraries.


Addendum - 13th April 2023

Due to staff holidays, we are only now publishing this quarterly report, so we can add some notes on the launch of the 2023 domain crawl.

The crawl was set up on the 11th, and loaded with the 11 million seed URLs from Nominet and the 27,059 domain crawl seeds from W3ACT (including 13,460 non-UK seeds). On the morning of the 12th, the crawl was launched, and seems to be running well, at around 400 URLs per second. If the system can sustain this rate, which corresponds to around one billion URLs per month, the whole crawl should complete in 2-3 months time.

Dashboard for the first 24 hours of the 2023 Domain

 Image: Dashboard for the first 24 hours of the 2023 Domain

For more information on the anniversary of Non-Print Legal Deposit, see Celebrating ten years of collecting the UK Web Space.

04 April 2023

Celebrating ten years of collecting the UK Web Space

Nicola Bingham, Lead Curator, Web Archiving, British Library

This April, we are celebrating ten years of collecting and preserving digital publications in the UK such as websites, e-books, and online journals, under legal deposit regulations. The UK Web Archive forms an important part of our collecting activity, across all six legal deposit libraries. We aim to preserve a copy of every UK website that we can identify, reflecting the broad range of experience and expression across the UK.

Large upper case text in a dark colour that reads - Everything Forever. The subtitle is - 10 Years Electronic Legal Deposit. At the bottom of the image is the logo of the six UK Legal Deposit Libraries - British Library, Bodleian Libraries, Cambridge University Library, National Library Scotland, The Library of Trinity College Dublin and the National Library of Wales.

The UK Web Archive provides a detailed insight into the evolution of online public communication over the past two decades. Communication on the web is central to understanding the history, politics, culture and society of the 21st century. However, we know that information shared publicly on the web is rapidly changed, deleted and replaced. The UK Web Archive helps people to understand current events, and the recent past, by preserving that information before it is lost.

Here are a few examples of topics and themes that we have preserved in the archive:

  • General elections: We have archived websites related to every UK general election since 2005. These websites provide a fascinating insight into the political campaigns, issues, and debates of each election.
  • London Olympics and Paralympics 2012: These websites document the planning, organisation, and events of the games, as well as the cultural and social impact they had on the UK.
  • Brexit: This collection documents the political, social, and economic impacts of Brexit. It contains official sources as well as voices from all sides of the debate across the UK.
  • Online Enthusiast Communities: This collection provides insight into hobbyists in the UK. It covers a wide range of interests from more traditional areas, such as stamp collecting and cycling, to the more esoteric, such as the UK Roundabout Appreciation Society.

The UK Web Archive is used by researchers to answer significant questions on various topics. Recent examples include:

The UK Web Archive has been in existence since 2004. Legal deposit regulations came into effect on 6 April 2013 which increased our capacity to collect the UK’s online heritage and ensure it is available for future generations to research and study.

Prior to these regulations, we had to ‘hand pick’ websites to archive, and then could only proceed with written permission of the website owner. From 6 April 2013, the six legal deposit libraries of the UK and Ireland (the British Library, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries, Cambridge University Library and the Library of Trinity College Dublin) were empowered to collect and preserve all web content that could be identified as published in the UK. Since then, we have been archiving the UK Web at the “domain” level and hold many millions of websites - or over a Petabyte of digital content. The 11th annual “domain crawl” will be launched this week.

How can I access it?
Anyone can access the UK Web Archive, free of charge, at the six UK Legal Deposit Libraries.

You can search the archive, and view thousands of openly accessible archived websites at https://www.webarchive.org.uk/

Help us build the archive
Even though we aim to collect as much of the UK Web as possible, we miss many websites as we cannot automatically identify all of them as being published in UK. If you know of a UK website that should be preserved, please suggest it here: https://www.webarchive.org.uk/en/ukwa/info/nominate