The <script> tag
To give you an idea of the problem, the following graph shows how the usage of the <script> tag has varied over time:
For example, for the Internet Archive’s Archive-It Service, they have developed the Umbra tool, which uses a browser testing engine based on Google Chrome to process URLs sent from the Heritrix crawler, extract the additional URLs that content depends upon, and send them back to Heritrix to be crawled.
We use a similar system during our crawls, including domain crawls. However, rendering web pages takes time and resources, so we don’t render every single URL of the billions in each domain crawl. Instead, we render all host home-pages, and we render the ‘catalogued’ URLs that our curators have indicated are of particular interest. The architecture is similar to that used by Umbra, based around our own page rendering service.
We’ve been doing this since the first domain crawl in 2013, and so this seems to be one area where the web archives are ahead of Google and their attempts to understand web pages better.
Furthermore, given we are having to render the pages anyway, we have used this as an opportunity to take screenshots of the original web pages during the crawl, and to add those screenshots to the archival store (we’ll cover more of the details on that in a later blog post). This means we are in a much better position to evaluate any future preservation actions we might require reconstructing the rendering process and we expect these historical screenshots to be of great interest to the researchers of the future.
By Andy Jackson, Web Archiving Technical Lead, The British Library