UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

11 August 2014

Web Archiving in the JavaScript Age

Among the responses to our earlier post, 'How much of the UK’s HTML is valid?', Gary McGath’s 'HTML and fuzzy validity' deserves to be highlighted, as it explores an issue very close to our hearts: how to cope when the modern web is dominated by JavaScript.

The Age of JavaScript
In particular, he discusses one of the central challenges of the Age of JavaScript: making sure you have copies of all the resources that are dynamically loaded as the page is rendered. We tend to call this ‘dependency analysis’, and we consider this to be a much more pressing preservation risk than bit rot or obsolescence. If you never even know you need something, you’ll never go get it and so never even get the chance to preserve it.

The <script> tag
To give you an idea of the problem, the following graph shows how the usage of the <script> tag has varied over time:

Script-tag-over-time-sml

In 1995, almost no pages used the <script> tag, but fifteen years later, over 95% of web pages require JavaScript. This has been a massive sea-change in the nature of the World Wide Web, and web archives have had to react to it or face irrelevance.

Tools
For example, for the Internet Archive’s Archive-It Service, they have developed the Umbra tool, which uses a browser testing engine based on Google Chrome to process URLs sent from the Heritrix crawler, extract the additional URLs that content depends upon, and send them back to Heritrix to be crawled.

We use a similar system during our crawls, including domain crawls. However, rendering web pages takes time and resources, so we don’t render every single URL of the billions in each domain crawl. Instead, we render all host home-pages, and we render the ‘catalogued’ URLs that our curators have indicated are of particular interest. The architecture is similar to that used by Umbra, based around our own page rendering service.

We’ve been doing this since the first domain crawl in 2013, and so this seems to be one area where the web archives are ahead of Google and their attempts to understand web pages better.

Screenshots
Furthermore, given we are having to render the pages anyway, we have used this as an opportunity to take screenshots of the original web pages during the crawl, and to add those screenshots to the archival store (we’ll cover more of the details on that in a later blog post). This means we are in a much better position to evaluate any future preservation actions we might require reconstructing the rendering process and we expect these historical screenshots to be of great interest to the researchers of the future.

By Andy Jackson, Web Archiving Technical Lead, The British Library

Comments

The comments to this entry are closed.

.