Analysing File Formats in Web Archives

Knowledge of file formats is crucial to digital preservation. Without this, it is impossible to define a preservation strategy. Andy Jackson, Web Archiving Technical Lead at the British Library explains how to analyse formats used in archived web resources for digital preservation purposes. This is also posted as an Open Planets Foundation Blog.

UK Web Archive recently released a new suite of visualisations and datasets. Amongst these is a format profile, summarising the data formats (MIME types) in the JISC UK Web Domain Dataset (1996-2010). This contains some 2.5 billion HTTP 200 responses stretching from 1996 to 2010, neatly packed into ARC files and stored on our HDFS cluster. Storing it in HDFS allows us to run Map-Reduce tasks over the whole dataset, and analyse the results.

Given this infrastructure, my first thought was to use it to test and compare format identification processes by running multiple identification tools over the same corpus. By analysing the depth and coverage of the results, we can estimate which tools are better suited to which types of resources and collection. Furthermore, much as double re-keying can be used to establish 'groud truth' for OCR data, each tool acts as an independent opinion on the format of an resource and so permits us a little more confidence in their assertions when they are found to coincide. This allows us to focus our attention on where the tools disagree, and helps to ensure that our efforts to improve those tools will have the greatest impact.

To this end, I wrapped up Apache Tika and the DROID binary signature identifier as part of a Map-Reduce task and ran them over the entire corpus. I mapped the results of both to a formalised extended MIME type syntax, such that each PUID has a unique MIME type of the form 'application/pdf; version=1.4', and used that to compare the results of the tools.

Of course, as well as establishing trust in the tools, this kind of data helps us start to explore the way format usage has changed over time, and is a necessary first step in understanding the nature of format obsolescence. As a taster, here is a chart showing the usage of different version of HTML over time:

As you can see, each version rises to dominance and then fades away, but the fade slows down each time. Across the 2010 time-slice, all the old versions of HTML are still turning up in the crawl. You can find some more information and results on the UK Web Archive site.

Finally, as well as exporting the format identifiers, I also used Apache Tika to extract any information it found about the software or hardware platform the resource was created on. All of this information was combined with the MIME type declared by the server and then aggregated by year to produce a rich and complex longtitudinal multi-tool format profile for this collection.

If this is of interest to you, please go and download the dataset and start exploring it. Please let me know if you find this dataset useful, and please share any interesting results you dig out of the dataset.

UK Web Archive blog

Analysing File Formats in Web Archives

Comments