How to Make Websites More Archivable?

I was contacted by an organisation which is going to be disbanded in a couple of months. When the organisation disappears, so will its website. Fortunately we have already archived a few instances of their website in the UK Web Archive.

The lady who contacted me however complained that the archival copies are incomplete as they do not include the “database” and would like to deposit a copy with us. Under examination it turns out that a section called “events” which has a calendar interface, was not copied by our crawler. I also found out that 2 other sections, of which the content is pulled dynamically from an underlying database, seem to be only accessible via a search interface. These would have been missed by the crawler too.

The above situation reflects some common technical challenges in web archiving. The calendar is likely to send the crawler into the so-called “crawler trap” inadvertently as it would follow the (hyper-linked) dates on the calendar endlessly. For that reason, the “events” section was excluded from our previous crawls. The database driven search interface presents content based on searches or interactions, which the crawler cannot perform. Archiving crawlers are generally capable of capturing explicitly referenced content which can be served by requesting a URL, but cannot deal with URLs which are not explicitly in the HTML but embedded in JavaScript or Flash presentations or generated dynamically.

We found out the earliest and latest dates related to the events in the organisation’s database and used these to limit the data range the crawler should follow. We then successfully crawled the “events” section without trapping our crawler. For the other 2 sections, we noticed that the live website also has a map interface which provides browseable lists of projects per region. Unfortunately only the first pages are available because the links to consequent pages are broken on the live site. The crawler copied the website as it was, including the broken links.

There are a few basic things, if taken into account when a website is designed, which will make a website a lot more archivable. These measures ensure preservation and help avoid information loss, if for any reason a website has to be taken offline.

1. Make sure important content is also explicitly referenced.
This requirement is not in contradiction with having cool, interactive features. All we ask you to do is providing an alternative, crawler-friendly way of access, using explicit or static URLs. A rule of thumb is that each page should be reachable from at least one static URL.

2. Have a site map
Use a site map to list the pages of your website accessible to crawlers or human users, in XML or in HTML.

3. Make sure all links work on your website.
If your website contains broken links, copies of your website will also have broken links.

There are more things one can do to make websites archivable. Google for example has issued guidelines to web masters to help find, crawl, and index websites: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769. Many best practices mentioned here are applicable too to archiving crawlers. Although archiving crawlers work in a way that is very similar to search engine crawlers, it is important to understand the difference. Search engine crawlers are only interested in files which can be indexed. Archiving crawlers intend to copy all files, of all formats, belonging to a website.

Helen Hockx-Yu, Head of Web Archiving, British Library

UK Web Archive blog

How to Make Websites More Archivable?

Comments