By Carlos Lelkes-Rarugal, Assistant Web Archivist
When people think of web archiving, social media is often overlooked as a content source. Although it's impossible to capture everything posted on social media platforms, at the UK Web Archive, we strive to archive the public profiles of key figures like politicians, journalists, athletes and industry leaders. However, archiving social media presents significant challenges for any institution trying to capture and preserve it. Recently, a new tool has helped us archive social media content with greater success.
This blog post outlines our approach to archiving social media for the 2024 General Election, highlighting what worked well and identifying areas for improvement.
Challenges of the Previous Workflow
In an earlier blog post, we discussed our efforts in collecting content for the 2024 General Election. While we updated the user nomination process, we still relied on the same website crawler, Heritrix. Here is a simplified version of the previous workflow:
- Nominate a seed
- Validate seed and create a metadata record
- Send seed and metadata to the Heritrix crawler
- Archive, process, and store the website
- Make the archived website available for viewing
This workflow enabled us to archive thousands of websites daily, thanks to Heritrix’s robust capabilities. However, despite its effectiveness at archiving static websites, Heritrix is less adept at capturing dynamic content such as maps or social media. While we can archive video, UK Non-Print Legal Deposit regulations prevent us from archiving video-streaming platforms like YouTube or TikTok.
The Challenges of Archiving Dynamic Content
Dynamic content is notoriously difficult to archive. Automated crawlers like Heritrix struggle with elements that rely heavily on JavaScript, asynchronous loading, or user interactions—common features of social media platforms. Heritrix cannot simulate these browser-based interactions, meaning critical content can be missed.
The challenge for web archiving institutions is compounded by the rapid evolution of social media platforms, which continually update their designs and policies, often implementing anti-crawling measures. For example, X (formerly Twitter) once allowed open access to its API. In April 2023, however, the platform introduced a paid API and a pop-up login requirement to view tweets, essentially blocking crawlers. This shift mirrors a broader trend among social media platforms to protect user data from unauthorised scraping and repurposing of data, a practice often linked to training AI models.
While archiving dynamic content is a known problem, finding tools capable of managing these complexities has proven difficult. Webrecorder, an open-source tool, offers one potential solution. It allows users to record their interactions within a web browser, capturing the resources loaded during the browsing session. This content is then packaged into a file, enabling the recreation of entire web pages. While Webrecorder has evolved, it is only part of the solution.
Introducing Browsertrix
Heritrix and Browsertrix both offer valuable solutions for web archiving but operate on different scales. Heritrix’s strength lies in its ability to handle a high volume of websites efficiently, but it falls short with dynamic content. Browsertrix, by contrast, excels at capturing interactive, complex web pages, though it can require more manual intervention.
Despite the increased time and effort involved, Browsertrix offers several key advantages:
- High-Fidelity Crawling: Browsertrix can accurately archive dynamic and interactive social media content.
- Ease of Use: Its user-friendly interface and comprehensive documentation made Browsertrix relatively easy for our team to adopt. Plus, its widespread use within the International Internet Preservation Consortium (IIPC) means additional support is readily available.
Archiving Social Media: A New Approach
One of the most significant challenges in archiving social media is dealing with login authentication. Most social platforms now require users to log in to access content, making it impossible for Heritrix to proceed beyond the login page. Heritrix does not create a browser environment, let alone maintain cookies or browser sessions, so it cannot simulate user browser interactions that are sometimes necessary to view or download content.
This is where Browsertrix excels. Operating within a web browser environment, Browsertrix can handle login credentials, enable browser events like drop-down menus, and capture content that loads asynchronously, such as social media posts. Essentially, it records a user’s browsing session, capturing the resources that make the visible web page.
During the 2024 General Election, we ran Browsertrix alongside Heritrix. Heritrix handled the majority of the simpler website nominations, such as MP and party websites, while Browsertrix focused on more complex social media accounts.
Workflows and Resources for the 2024 General Election
Although we planned to integrate Browsertrix into our archiving efforts for the 2024 General Election, unforeseen delays meant that we only gained access to the tool on June 28th—just one week before polling day on July 5th. However, prior planning helped us decide on key social media accounts.
Key considerations for this workflow included:
- Collaboration with Legal Deposit Libraries
- Limited time frame
- Archiving multiple social media accounts
- Daily archiving schedules
- Finite Browsertrix resources
We had an organisational account with five terabytes of storage and 6,000 minutes of processing time. However, as with any web archiving, the actual crawl times and data requirements were difficult to predict due to the variable size and complexity of websites.
Which is why we try to encapsulate our crawls with general parameters assigned to each seed, for example the frequency of a crawl or the data cap. In an ideal world, we would crawl them every minute with unlimited data, but there is a cost to everything, and so our strategy relies on the expertise of curators and archivists to determine the ideal parameters that will ensure a best-effort capture, whilst ensuring we utilise our hardware as efficiently as possible.
Using Browsertrix, the first task was to prioritise which social media platform to tackle first, depending on how many accounts were nominated for each platform. In total, we had 138 social media accounts to archive:
- 96 X accounts
- 25 Facebook accounts
- 17 Instagram accounts
X was by far the most active platform, making it a priority. After some trial and error, we found that a three-minute crawl time produced high-quality captures for most accounts. Here are some of the settings that were adjusted, in various combinations:
- Start URL Scope
- Extra URL Prefixes in Scope
- Exclusions
- Additional URLs
- Max Pages
- Crawl Time Limit
- Crawl Size Limit
- Delay After Page Load
- Behaviour Timeout
- Browser Windows
- Crawler Release Channel
- User Agent
For X specifically, we staggered crawls by 30 minutes to avoid triggering account blocks. This came with its own challenges, as we had no system in place to manage scheduling and social media login details. For this reason, it was felt that the Browsertrix application should be solely managed by one experienced member of staff, rather than the curators who nominated the accounts in order to manage the social media account logins and the scheduling of crawl jobs. In practice, this meant that a spreadsheet was used, detailing the numerous social media accounts with their login and various crawling parameters.
Quality Assurance
Quality assurance (QA) is a crucial but time-consuming aspect of web archiving, especially when dealing with dynamic content. Browsertrix offers a QA tool that generates reports analysing the quality of individual crawl jobs, including screenshot comparisons and resource analysis. However, this feature can be resource-intensive; for instance, a QA report for a single Facebook capture required approximately 30 minutes of processing time. Given our limitation of 6,000 minutes of processing time and the large volume of crawl jobs, we had to selectively perform QA on key crawl jobs rather than generating reports for every one.
Browsertrix’s extensive documentation provides more details on its QA process, which we found valuable when managing our resources effectively during this large-scale archival effort. Users can run spot checks on crawl jobs, choosing those that might benefit from a QA report; this gives a sense of how healthy the capture is, and allows the user to adjust the Browsertrix settings. Another approach is to offload the quality assurance so that it is performed outside Browsertrix. The user can download the WACZ files and interrogate them to check their contents against the live website, again carrying out spot checks to see if certain significant resources were captured.
Looking at the live website in a web browser, users can analyse the network traffic and view what resources are loading, usually through the browser developer tools. The resources that load during network analysis also have the exact URI of the resources, which can be searched for within the WACZ file. Bear in mind, this sort of comparison with the live website should be done soon after crawling has completed, otherwise you may be conducting a comparison on a URL where the content has changed significantly to that which was initially crawled.
Some of the QA considerations which we were guided by include:
- If issues are found, what, if anything, can be realistically done to remedy them?
- Is it an issue with the crawler or with the playback software?
- How much time can you apportion to QA without it impacting other work?
- Will the time given over to QA yield an appropriate benefit?
- Can your QA scale?
Where to go from here?
The 2024 General Election marked the first time we used Browsertrix alongside Heritrix for social media archiving. While the process presented challenges, particularly around managing login authentication and processing constraints, Browsertrix proved to be an invaluable tool for capturing complex media. By refining our workflows and balancing the use of both crawl streams, we were able to archive a significant portion of relevant social media content. Looking forward, we will continue to develop and improve our tools and strategies; collaborating with partners and sharing our experience and knowledge by engaging with the wider web archiving community.