Collection Care blog

Behind the scenes with our conservators and scientists

06 January 2014

Scalable Preservation Environments: the nuts and bolts of digital preservation software tools

The British Library is a partner in the SCAPE Project, a Seventh Framework Programme (FP7) project co-funded by the European Union. Its aim is to enhance the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. Other partners include leading European libraries, universities and companies. A full list is available on the SCAPE website.

Digital preservation tools
A cartoon illustration of a small man in grey scale unlocking a box with vibrant colours and symbols such as musical notes, pictures, media players and written text

CC by CC BY-NC 3.0

The British Library's Digital Preservation Team undertakes the R&D necessary to ensure the Library is able to implement the right technology and best practices to support digital preservation, at the right time. We have previously blogged here about our “Twelve Principles of Digital Preservation”.

Staff from the Digital Preservation Team - whilst representing the British Library’s interests within the project - lead the project in two key areas: we chair the technical coordination committee responsible for all technical developments within the project, and we lead a work package on creating and evaluating the execution of workflows for large scale digital repositories. We are also involved in two other “testbed” work packages related to web archiving and research datasets, as well as work packages surrounding the take-up of project outputs involving dissemination, demonstrations and training.

Our technical work within the project includes development and enhancement of characterisation and quality assurance tools and associated large scale workflows for characterisation of content within web archives, file format validation & identification of DRM in ebooks, and quality assured file format migration of TIFF files to JP2. Similar work by other partners includes characterisation of large audio/video files, audio migration, large scale ingest to a repository, arc to warc migration and other types of file format migration.

For execution of these tools and workflows across large scale data sets, the project uses Apache Hadoop. At the tool level however, software is discrete and can be used separately or within other large scale processing frameworks. The project is also creating services around policy-based preservation planning (Plato) and watch (Scout), and defining the necessary interfaces to enable all these entities to work together.

Some of the digital preservation tools and services that have been developed within the project include;

Tools:

xcorrSound  - a suite of tools for automated quality assurance of audio migration processes.

XCorrSound
image of blue sound wave on XCorrrSound software

The tools can:

  • Find overlaps between sequential audio files
  • Find occurrences of a smaller section of audio within a larger dataset
  • Compare two audio files to see how they correlate

Matchbox can automatically find duplicates images, for example duplicate scans, or match images from two separate scans of a book.

Matchbox
Images of several scanned pages with red and green lines linking copies together

Jpylyzer  - a JP2 (JPEG2000 part 1) validator and properties extractor.

Jpylyzer
illustration of a hot air balloon on Jpylyzer software with several cream numbers on top of the image on the left and a duplicate scan on the right

This tool can be used to:

  • Verify if JP2 files conform to the JP2 specification
  • Extract information about the encoding profile used for the file. This can be compared to an institutional encoding profile for verification

c3po (screencast) - a software tool for visualising and investigating the content types contained within a collection

Nanite can characterise files contained in web archives (arc/warc) without first extracting the files. The tool can be used on a Hadoop cluster.

Pagelyzer - visual, structural and hybrid comparison of web pages.

Pagelyzer
Image comparing two similar web pages with text and images outlined with red and green squares

Services:

Plato is a preservation planning tool that integrates content characterisation, preservation actions and automated object comparison.

Scout is a preservation watch system that consolidates information from several sources (web, content, registries, policies) and monitors that information against a defined policy.

Scout
Image showing Scout system with blue icons, depicting content, policies, registries, web and human knowledge with arrows pointed to a large eye (the scout) which has an arrow coming out the bottom pointing towards an envelope labeled 'risk notification'

As you can see there is a wide variety of tools being produced or enhanced within the project. There are many more that are not listed. If you are interested in finding out more about any of these tools take a look at http://www.scape-project.eu/tools. More in-depth blog posts can be found on the Open Planets Foundation blog: http://www.openplanetsfoundation.org/blog.

William Palmer

Digital Preservation Technical Lead, SCAPE Project.

Comments

The comments to this entry are closed.

.