How good is good enough? – Quality Assurance of harvested web resources
Quality Assurance is an important element of web archiving. It refers to the evaluation of harvested web resources which determines whether pre-defined quality standards are being attained.
So the first step is to define quality, which should be a straightforward task considering the aim of web harvesting is to capture or copy resources as they are on the live web. Getting identical copies seems to be the ultimate quality standard.
The current harvesting technology unfortunately does not deliver 100% replicas of web resources. One could draw up a long list of known technical issues in web preservation. Dynamic scripts, streaming media, social networks, database-driven content… The definition of quality quickly turns into a statement of what is acceptable, or how good is good enough. Web curators and archivists regularly look at imperfect copies of web resources and make trade-off decisions about their validity as archival copies.
We use four aspects to define quality:
1. Completeness of capture: whether the intended content has been captured as part of the harvest.
2. Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool.
3. Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively.
4. Appearance: look and feel of a website.
When applying these quality criteria, more emphasis is placed on the intellectual content rather than appearance or behaviour of a website. As long as most of the content of a website is captured and can be replayed reasonably well, then the harvested copy is submitted to the archive for long term preservation, even if the appearance is not 100% accurate.
Example of a "good enough" copy of a web page, despite missing 2 images
We also have a list of what is “not good enough” which helps separate the “bad” from the “good enough”. An example of this is the so called “live leakage”, a common problem in replaying archived resources, which occurs when links in an archived resource resolve to the current copy on the live site, instead of to the archival version within a web archive. This is a particular concern when the leakage is to a payment gateway which could cause confusion to users leading them to make payments for items that they do not intend to purchase or that do not exist. There are certain remedial actions we can take to address the problem but there is as yet no global fix. Suppressing the relevant page from the web archive is often a last resort.
Quality assurance in web archiving currently relies heavily on visual comparison of the harvested and the live version of the resource, review of previous harvests and crawl logs. This is time consuming and does not scale up. For large scale web archive collections, especially those based on national domains it is impossible to carry out the selective approach described above. Quality assurance, if undertaken, often relies on sampling. Some automatic solutions have been developed in recent years which for example examine HTTP status code to identify missing content. Automatic quality assurance is an area where more development will be welcome.
Helen Hockx-Yu, Head of Web Archiving, British Library
Very nice article.
I agree that web archiving quality is a huge issue. This is why I am developing ArchiveReady.com a tool to check website archivability.
A web site is archivable if it has certain characteristics and supports certain technologies. ArchiveReady.com is trying to check if all this is OK automatically.
In more detail, the following website attributes are considered necessary for a web site to support web archiving:
1. Follow web standards (HTML & CSS validation)
2. Provide correct HTTP headers
3. Provide images and other files in open formats. Don't use proprietary data formats.
4. Don't use many external resources from different places (e.g. widgets, external images, etc)
5. Use robots.txt protocol
6. Use sitemap.xml and RSS to let web archiving bots discover your content
The aforementioned website attributes can help web archiving in three areas:
a) Performance
b) Accuracy
c) Long Term Preservation