Collection Care blog

55 posts categorized "Digitisation"

06 January 2014

Scalable Preservation Environments: the nuts and bolts of digital preservation software tools

The British Library is a partner in the SCAPE Project, a Seventh Framework Programme (FP7) project co-funded by the European Union. Its aim is to enhance the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. Other partners include leading European libraries, universities and companies. A full list is available on the SCAPE website.

Digital preservation tools
A cartoon illustration of a small man in grey scale unlocking a box with vibrant colours and symbols such as musical notes, pictures, media players and written text

CC by CC BY-NC 3.0

The British Library's Digital Preservation Team undertakes the R&D necessary to ensure the Library is able to implement the right technology and best practices to support digital preservation, at the right time. We have previously blogged here about our “Twelve Principles of Digital Preservation”.

Staff from the Digital Preservation Team - whilst representing the British Library’s interests within the project - lead the project in two key areas: we chair the technical coordination committee responsible for all technical developments within the project, and we lead a work package on creating and evaluating the execution of workflows for large scale digital repositories. We are also involved in two other “testbed” work packages related to web archiving and research datasets, as well as work packages surrounding the take-up of project outputs involving dissemination, demonstrations and training.

Our technical work within the project includes development and enhancement of characterisation and quality assurance tools and associated large scale workflows for characterisation of content within web archives, file format validation & identification of DRM in ebooks, and quality assured file format migration of TIFF files to JP2. Similar work by other partners includes characterisation of large audio/video files, audio migration, large scale ingest to a repository, arc to warc migration and other types of file format migration.

For execution of these tools and workflows across large scale data sets, the project uses Apache Hadoop. At the tool level however, software is discrete and can be used separately or within other large scale processing frameworks. The project is also creating services around policy-based preservation planning (Plato) and watch (Scout), and defining the necessary interfaces to enable all these entities to work together.

Some of the digital preservation tools and services that have been developed within the project include;

Tools:

xcorrSound  - a suite of tools for automated quality assurance of audio migration processes.

XCorrSound
image of blue sound wave on XCorrrSound software

The tools can:

  • Find overlaps between sequential audio files
  • Find occurrences of a smaller section of audio within a larger dataset
  • Compare two audio files to see how they correlate

Matchbox can automatically find duplicates images, for example duplicate scans, or match images from two separate scans of a book.

Matchbox
Images of several scanned pages with red and green lines linking copies together

Jpylyzer  - a JP2 (JPEG2000 part 1) validator and properties extractor.

Jpylyzer
illustration of a hot air balloon on Jpylyzer software with several cream numbers on top of the image on the left and a duplicate scan on the right

This tool can be used to:

  • Verify if JP2 files conform to the JP2 specification
  • Extract information about the encoding profile used for the file. This can be compared to an institutional encoding profile for verification

c3po (screencast) - a software tool for visualising and investigating the content types contained within a collection

Nanite can characterise files contained in web archives (arc/warc) without first extracting the files. The tool can be used on a Hadoop cluster.

Pagelyzer - visual, structural and hybrid comparison of web pages.

Pagelyzer
Image comparing two similar web pages with text and images outlined with red and green squares

Services:

Plato is a preservation planning tool that integrates content characterisation, preservation actions and automated object comparison.

Scout is a preservation watch system that consolidates information from several sources (web, content, registries, policies) and monitors that information against a defined policy.

Scout
Image showing Scout system with blue icons, depicting content, policies, registries, web and human knowledge with arrows pointed to a large eye (the scout) which has an arrow coming out the bottom pointing towards an envelope labeled 'risk notification'

As you can see there is a wide variety of tools being produced or enhanced within the project. There are many more that are not listed. If you are interested in finding out more about any of these tools take a look at http://www.scape-project.eu/tools. More in-depth blog posts can be found on the Open Planets Foundation blog: http://www.openplanetsfoundation.org/blog.

William Palmer

Digital Preservation Technical Lead, SCAPE Project.

31 December 2013

New Year’s Resolution: 300 ppi?

Did you know that image resolution has absolutely nothing to do with how an image looks on a screen? It is a fairly safe bet that more of our collections will be digitised in the next few years. As technology moves on with great pace there is often debate as to the “best resolution” that images should be captured at. But what does that actually mean? This post will try to explain what is meant by the terms pixel and image resolution, and will demonstrate the relationship between them.

Pixels and megapixels

Digital images are made up of thousands or even millions of pixels (picture elements). A pixel is the smallest addressable element in a display device with a specific assigned value that can be read by a computer and mapped onto a grid to recreate an image. Each pixel is a sample of an original image, so the more samples available result in a more accurate representation of the original. We can change the appearance of an image by manipulating the pixels or by getting rid of some of them to reduce the file size. Below we see a digital image of the Gospel of St John from the Lindisfarne Gospels (British Library, Cotton MS Nero D.IV). It is obvious that the image with more pixels is of a higher quality than that with less pixels.

Unpixelated

Figure 1: Cropped portrait of St John wearing purple, gold and green robes. 

Pixelated

Figure 2: Pixelated close-up image of the portrait of St John. 

Figure 1 has more pixels and so produces a more accurate representation of the subject matter. Figure 2 looks “pixelated” due to the visibility of the pixel boundaries.

 

How pixels control resolution

Pixels control image resolution because the closer the pixels are placed (i.e. the more there are per inch), then the denser the image becomes with detail. Similarly, the fewer pixels an image has per inch, the further apart they are spaced, resulting in less detail and an image of poor quality.

Image resolution is therefore concerned with the number of pixels per inch (ppi) printed out on a piece of paper, and the size of those pixels. Since the software takes care of the pixel size, it’s really just the ppi that you need to think about.

Let’s try to understand that better by taking a look at an image captured with a DSLR camera. Below is a photograph of our new multispectral imaging system opened in the open source image processing software package ImageJ.

Screen shot of open image

Figure 3: Full-size, uncompressed photograph opened in image processing software package ImageJ.

If we look at the title bar of the image we can see some details about the image file.

Screen shot of open image title bar

Figure 4: The title bar tells us the name of the image file, the percentage size in brackets, and the number of pixels.

The title bar (DSC_0074.JPG (16.7%)) tells us that this file is only opening up to 16.7% of full size. The image is just too large to open on the screen at 100%. Below the title bar we can also see that the size of the image is 6,000 x 4,000 pixels (i.e. there are 6,000 pixels running along the image from left to right and 4,000 pixels running from top to bottom). That sounds like a lot of pixels. If we now zoom in on any part of the image we can see these pixels as little squares of colour.

Zooming in

Figure 5: A cropped portion of the original photograph.

Zooming in

Figure 6: Zoomed in portion of the original image. 

Zooming in

Figure 7: At maximum zoom it becomes apparent that the image is made up of pixels.

If there are 6,000 pixels along the top of the image, and 4,000 pixels along the side, then my incredible math skills suggest that there must be 24,000,000 (= 4,000 x 6,000) pixels in total, or 24 million pixels, or 24 megapixels (MP). A quick glance at the camera manual will show that this camera (Nikon D5200) has in fact got a 24 MP CMOS sensor, so our powers of deduction are correct.

 

Resolution doesn’t mean anything until you go to print

We now know that there are 6,000 x 4,000 pixels in our image. Great! But what does that mean if we want to print out this image on a piece of paper? How does a pixel correlate to the size of the page? Will the image fill the whole page or will it just appear as a tiny thumbnail? Take a look at the image resolution by opening the image up in another great open source image processing package called GIMP, and opening the Set Image Print Resolution window.

Set Image Print Resolution

Figure 8: Set image print resolution page. 

Here we can see that the X and Y resolution is 300 pixels/in which means that that for every inch of paper we have, there will be 300 pixels printed. So if we have 6,000 pixels along the top and 4,000 along the side that means we must have 6,000/300 = 20 inches along the top and 4,000/300 = 13.333 inches along the side… and if we look at the print size in the window above we can see that has already been calculated for us.

20 by 13+ inches is quite a large size. How can we print it out smaller to fit on our page? We need to fit more pixels into each inch, and since the size of an inch can’t change then the size of the pixels must change. That is done automatically for us by GIMP or Photoshop, or whatever image processing software package you are using. Let’s say we set our image resolution to be 600 pixels per inch. In that case we can see that the print size has adjusted to a much more manageable 10 x 6.67 inches. The resolution changes as the physical image size changes because the number of pixels that make up the image are being spread over a greater or lesser area.

Set Image Print Resolution

Figure 9:  By increasing the number of pixels per inch we can fit our image into a smaller area of the page.

PC monitors are generally considered to be low resolution devices meaning that images look good on screen even if they have a very small total number of pixels. This reduced number of pixels also allows images to load faster leading to an overall better user experience. But if you try to print it out, you may be disappointed at the tiny image that emerges from your printer. Printers are high resolution devices and require an image to have a resolution of about 300 pixels per inch to look sharp and to be of a good quality. 300 ppi is generally accepted as the resolution for professional quality printing, but that number is increasing all of the time. There are many great articles and tutorials about this and other aspects of digital objects found on the Digital Photo Essentials Tutorial for anyone new to the world of digital photography or photo-editing.

Best of luck with your New Year's Resolutions!

Christina Duffy (@DuffyChristina)
Imaging Scientist

22 December 2013

New hyperspectral imaging capabilities at the British Library

Collection Care has excitedly accepted delivery of a new hyperspectral imaging system. The system is designed specifically for archival and cultural heritage imaging for the purpose of revealing hidden and faded information. Digital imaging experts MegaVision, who are based in California, design the system. The EVTM camera includes MegaVision’s Monochrome E7 50-megapixel back, computer controlled shutter and aperture, and custom hyperspectral parfocal lens, which is responsive over the entire range of silicon sensitivity.

MegaVision system
Testing of the MegaVision system, image shows a light brown table with the imaging system ontop, to the left of the imaging system is a laptop and on the right is a book on top of a sheet of plastazote. Either side of the table are LED sidelights with diffusers, the stands of the sidelights are yellow and black.


  CC by Testing of the MegaVision Cultural Heritage EVTM Imaging System showing LED sidelights with diffusers, and the E7 50 MP digital camera back on vertical mount

The system integrates two previously disparate imaging capabilities: high-resolution photography and multi-spectral imaging. Images are captured over 12 spectral bands from the near ultraviolet (365 nm) to the near infrared (1050 nm). Captured images are used for preservation and scholarly studies of British Library collections on materials such as parchment, paper, papyrus, inks and other constituents of cultural items. A series of palimpsests (parchment from which writing has been erased and overwritten) and Treasures of the British Library have been identified for imaging, which will take place in the New Year.

 

Palimpsest
Cropped black and white image of a section of a Syriac manuscript showing evidence of previous writing under text.

CC by Evidence of palimpsest detail under UV illumination in this Syriac manuscript (OMS Add 14623)

The MegaVision system replaces the Forth Photonics MuSIS system, which was purchased in 2004 for work on the Codex Sinaiticus project, and has found many applications since. MuSIS creates spectral bands using band pass filters to filter the light after it is reflected from the collection item. The MegaVision system uses narrow-band LED illumination, which subjects the collection items to only the required light energy to expose the sensitive unfiltered monochrome sensor. The LED panels are configured with visible, UV and IR bands. This selective illumination process significantly reduces the light energy falling on collection items, and has the added bonus of looking very cool indeed!

Green light
Green illumination showing stand and led infusers saturated in green light.

CC by Green illumination: Narrow-band LED illumination subjects collection items to different light wavelengths (red, green, blue, cyan, amber, UV, IR)

MegaVision's PhotoshootTM digital image capture software controls all aspects of capture as well as controlling a colour wheel which allows additional light modifications such as filtration to isolate fluorescene in concert with UV illumination.

The technology has been internationally heralded for its use on the Archimedes Palimpsest Project, the Gettysburg Address and the Waldseemüller map, while data is still being captured from St Catherine’s Monastery in the Sinai Desert. The datasets will become digital assets of historical and scientific value in their own right, and can be further processed to enhance regions of interest.

This is a landmark purchase for Collection Care showing the committment we have to furthering the understanding of our collections and the importance of science and research in archival institutions. The quest for information recovery and discovery continues! 

Christina Duffy (@DuffyChristina)

Imaging Scientist

13 December 2013

Digitisation as a preservation tool; some considerations

Digitisation projects are today more and more a common and established reality in many big and small public institutions. The expectation from the public for online access has placed great pressure on public institutions which hold collections of historical and artistic value to provide it as soon as possible. Large investment in digitisation projects has had a major impact on the work pattern of many institutions, and on the collections involved in the processes related to the digitisation workflows.

I am a book conservator currently managing the conservation studio that has been created for the British Library/Qatar Foundation Partnership programme. Phase 1 runs until December 2014 and aims to digitise and make available online 500,000 images for scholars and the general public. These images will be taken from various British Library Arabic materials and it is our duty as conservators to support the digitisation process ensuring that no damage is caused to the library items processed through the digitisation workflow.

Phase box
Image of custom made phase box on a green table, the back of the box is grey and in inside is white. resting on the centre of the box is the heavily damaged manuscript which has a detached board.

CC by An example of a custom-made phase box for this heavily damaged manuscript

I want to present in this post some considerations about what conservation could potentially gain from these types of projects and how I think the long term preservation of historical items and their features can be improved through mass digitisation projects. The previous sentences make quite provocative statements. It is not a secret that conservators tend to look at digitisation projects, and in general at projects involving multiple processes, with caution if not suspicion. In general conservators are often against the “mass” approach and digitisation processes are primarily focused on targets that are sometimes strained under tight deadlines and budgets. This can be an unsuitable environment for the normal conservation requirements.

Conservation means attention to detail and much of the work involves time-consuming treatments carried out by skilled professionals at their benches. These treatments are often present to help public institutions achieve their aims and fulfil their strategic priorities. Enabling access to library collections is one of the more important principles of sustainable stewardship. Conservation at the British Library has in the last few years adopted the “fit for purpose” approach. With re-treatability and minimal intervention approaches clearly in mind, we know that today we have to plan our work in a more efficient and effective way. Planning is a fundamental step in our daily and long term work and to do so we need to know which specific goal we want to achieve.

In the present case for the British Library/Qatar Foundation Partnership programme, digital surrogates are the aim; good quality reproduction of items capable of providing online customers (scholars, readers and the general public) with the information they require. There are many steps between the shelves of the British Library storage areas and the cameras in the photographic studio. Conservators need to be present throughout each stage of this flow to support and to enable successful digitisation.

This can be difficult to achieve as full time conservators are expensive. Work needs to be customised but this certainly doesn’t mean compromising on the quality of the work carried out on collection items. In the context of the British Library/Qatar Foundation Partnership programme, a document about policies and procedures was produced by the conservation studio at the very beginning of the project. In this document we state that due to the scope and the nature of the project, we cannot treat items that are in need of conservation work that would take more than five hours. This means that generally we are not “fully” repairing the items we are processing through the workflow, but instead we are treating the items to a condition that enables digitisation.

After assessing the condition of the items brought into the project we decide if they are fit for handling, and if so they can proceed along the work flow. Quite often items with minor damage can still be digitised because the imaging and cataloguing processes (even if very intense from a handling point of view) are carried out in a highly monitored environment where we provide training for each member involved in handling library items, and constant support where needed.

We have also devised a colour “traffic” light system that we use to communicate through our tracking system on an online shared drive with the other strands of the project. A colour orange dot, for example, placed next to other information on the shared drive highlights that an item is in need of careful handling due to its fragile or damaged state.

SharePoint
A screenshot of the SharePoint window. Showing how an item is tracked through conservation. The SharePoint lists the shelfmark, batch, format, title, workflow stage, conservation indicator, its status, and assigned group. The items are pink or turquoise with a orange (in need of careful handling) or green (fit for handling) dot.

CC by Screenshot of the SharePoint window with information about items processed. Coloured dots highlight the conservation status of these items: orange: in need of careful handling/support from conservation, green: fit for handling

By doing this we ensure that all risks relating to possible damage occurring to items during handling and use are mitigated. At the same time we make possible the creation of surrogates from items that would otherwise not be available to readers in the reading rooms due to their condition, if not only after extensive conservation work. By providing surrogates to readers we should be able to preserve the original physical item from further handling, and this can only be achieved if an item’s access is subsequently reduced.

This is already quite an achievement - when it works, but even in such a customised capacity we can do more than that and the magic word here is “housing”. Good functional housing can be provided by creating customised, and not necessarily expensive, enclosures. If correctly used, phase boxes, folders, and Melinex enclosures provide very effective solutions to prolong the existence of fragile and endangered items.

We also provide supportive treatments such as repairs to major tears and weak areas. These are carried out only to minimise the risk of further damage during handling. This does not mean that as conservators we are sacrificing our knowledge and experience, but it means that we are shifting our expertise towards a wider and more comprehensive approach regarding what we can do for the preservation of our collection.

Conservation, as the word itself says, is the profession aimed to “conserve” items and all their historical features. Looking at the few examples below it is very clear that quite often full treatments have resulted in the complete transformation of the physical nature of the treated item. New sewing, heavy repairs applied to the supports, and new arrangements of items (loose leaves to a bound format) have completely jeopardised the understanding of the physical history of those items.

Restoration
The left side of the image shows volume flat on a green desk, the volume is in poor condition and has a red label on the front which states 'not to be issued refer reader to'. On the right side of the image there are two brown volumes in a slip case, spine out with darker brown labels with gold tooling.

CC by Two originally “similar” items have, after restoration, lost most of their original physical appearance and therefore invaluable information related to their history

I love books and I love the feeling of handling items that are as they were meant to appear when they were produced. Physical features are an integral part of the history of an object, and too often paper based items are considered only for their content.

Nothing of importance!
Leather bound volume with skinned leather and a large white label on the front with reads, in red ink, 'Nothing of importance' there are also some annotations above this in black ink but they are not legible.

CC by Unfortunately, many bindings and other physical features have been discarded as “Nothing of importance”!

In the following image it is possible to see how good intentions translated into over-restoration. This practice has caused a lot of losses of original features and therefore vital information about the item.

Guard book
Yellow guard book of rebound documents, book is open on green desk showing annotated pages. Behind the volume is an another guard book and a slip case which both of the guard books nestle in.

CC by Guard book of documents that were originally bound together. The paper is laminated and then “hooked” with paper hinges to be bound in the present format

It gives great personal and professional satisfaction to see my input valued and to enable others to enjoy items I am conserving in their original state. It is not always possible or even advisable to completely stop to do full treatments to damaged items, but it is important to remember that we take on a great responsibility by doing it. It is a natural and understandable expectation that we want to see things “as new”, but that is not the aim of conservation.

I like to say that conservation is not about preserving what we can see, but is to be able to leave things as they are as much as possible; it is what we cannot see that really matters.

Heavily damaged manuscript
Heavily damage manuscript with brown cover. There are three images of the manuscript, the top image shows the front of the volume which has a white label adhered with black writing which cannot be read from this image; and a red sticker which reads 'not to be issued refer to' printed in black ink. The left image on the bottom left shows the gutter of the manuscript when open and the image on the bottom right shows the inside top right corner of the board when the book is open.


CC by This heavily damaged manuscript has been digitised and re-housed in a box. By doing this we have been able to preserve all the original features of its contemporary binding, remnants of the sewing threads and materials used in the making of the cover. These details provide clues about specific crafts employed, as well as shedding new light on issues like provenance of the object. They may even inspire new approaches for the interpretation of its content

Mass processing workflows such as those employed in digitisation projects offer conservators a great opportunity to gain understanding about entire collections and not just about single items. By processing a great number of items the conservator acquires knowledge of a whole group of items leading to a wider understanding of the collections and the issues relating to them.

It is a great challenge for conservators to make the best use of this newly acquired knowledge. We have to be able to share what we learn with other strands of our institutions, and also more broadly with interested outside audiences. Information dissemination has never been easier with blogs and Twitter feeds allowing us to share our knowledge quickly and efficiently. It is an opportunity for better communication that we should embrace.

Flavio Marzo

Gulf History Arabic Science Project Conservator