06 January 2014
Scalable Preservation Environments: the nuts and bolts of digital preservation software tools
The British Library is a partner in the SCAPE Project, a Seventh Framework Programme (FP7) project co-funded by the European Union. Its aim is to enhance the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. Other partners include leading European libraries, universities and companies. A full list is available on the SCAPE website.
CC BY-NC 3.0
The British Library's Digital Preservation Team undertakes the R&D necessary to ensure the Library is able to implement the right technology and best practices to support digital preservation, at the right time. We have previously blogged here about our “Twelve Principles of Digital Preservation”.
Staff from the Digital Preservation Team - whilst representing the British Library’s interests within the project - lead the project in two key areas: we chair the technical coordination committee responsible for all technical developments within the project, and we lead a work package on creating and evaluating the execution of workflows for large scale digital repositories. We are also involved in two other “testbed” work packages related to web archiving and research datasets, as well as work packages surrounding the take-up of project outputs involving dissemination, demonstrations and training.
Our technical work within the project includes development and enhancement of characterisation and quality assurance tools and associated large scale workflows for characterisation of content within web archives, file format validation & identification of DRM in ebooks, and quality assured file format migration of TIFF files to JP2. Similar work by other partners includes characterisation of large audio/video files, audio migration, large scale ingest to a repository, arc to warc migration and other types of file format migration.
For execution of these tools and workflows across large scale data sets, the project uses Apache Hadoop. At the tool level however, software is discrete and can be used separately or within other large scale processing frameworks. The project is also creating services around policy-based preservation planning (Plato) and watch (Scout), and defining the necessary interfaces to enable all these entities to work together.
Some of the digital preservation tools and services that have been developed within the project include;
Tools:
xcorrSound - a suite of tools for automated quality assurance of audio migration processes.
The tools can:
- Find overlaps between sequential audio files
- Find occurrences of a smaller section of audio within a larger dataset
- Compare two audio files to see how they correlate
Matchbox can automatically find duplicates images, for example duplicate scans, or match images from two separate scans of a book.
Jpylyzer - a JP2 (JPEG2000 part 1) validator and properties extractor.
This tool can be used to:
- Verify if JP2 files conform to the JP2 specification
- Extract information about the encoding profile used for the file. This can be compared to an institutional encoding profile for verification
c3po (screencast) - a software tool for visualising and investigating the content types contained within a collection
Nanite can characterise files contained in web archives (arc/warc) without first extracting the files. The tool can be used on a Hadoop cluster.
Pagelyzer - visual, structural and hybrid comparison of web pages.
Services:
Plato is a preservation planning tool that integrates content characterisation, preservation actions and automated object comparison.
Scout is a preservation watch system that consolidates information from several sources (web, content, registries, policies) and monitors that information against a defined policy.
As you can see there is a wide variety of tools being produced or enhanced within the project. There are many more that are not listed. If you are interested in finding out more about any of these tools take a look at http://www.scape-project.eu/tools. More in-depth blog posts can be found on the Open Planets Foundation blog: http://www.openplanetsfoundation.org/blog.
William Palmer
Digital Preservation Technical Lead, SCAPE Project.