26 September 2014
Applying Forensics to Preserving the Past: Current Activities and Future Possibilities
First Digital Lives Research Workshop 2014 at the British Library
With more and more libraries, archives and museums manifestly adopting forensic approaches and tools for handling and processing born digital objects both in the UK and overseas it seemed a good time to take stock. Archivists and curators were invited (via professional email listservs) to submit a short paper for an inclusive and interactive workshop stretching over two days in London.
Institutions are applying digital forensics across the entire lifecycle from appraisal through to content analysis, and have begun to establish workflows that embrace forensic techniques such as the use of write blockers for the formation of disk images, the extraction of metadata and the searching, filtering and interpreting of digital data, notably the appropriate management of sensitive information.
There are two sides to digital forensics for it begins with the protection of digital evidence and concludes with the retrospective analysis of past events and objects. Papers reflecting both aspects were submitted for the workshop (download DLRW 2014 Outline).
It provided participants with opportunities to report on current activities, highlight gaps, constraints and possibilities, and to discuss and agree collective steps and actions.
As the following list demonstrates delegates came from a diverse range of institutions: universities, libraries, galleries and archives, and the private sector.
Matthew Addis, Arkivum
Fran Baker, John Rylands Library, University of Manchester
Thom Carter, London School of Economics Library
Dianne Dietrich, Cornell University Library
Rachel Foss, British Library
Claus Jensen, Royal Library of Denmark and Copenhagen University Library
Jeremy Leighton John, British Library
Svenja Kunze, Bodleian Library, University of Oxford
John Langdon, Tate Gallery
Cal Lee, University of North Carolina at Chapel Hill
Caroline Martin, John Rylands Library, University of Manchester (contributor to paper)
Helen Melody, British Library
Stephen Rigden, National Library of Scotland
Elinor Robinson, London School of Economics Library
Susan Thomas, Bodleian Library, University of Oxford
Dorothy Waugh, Emory University
I gave an introduction to the original Digital Lives Research project and a brief overview of the ensuing internal projects at the British Library (Personal Digital Manuscripts and Personal Digital Archives), while Aquiles Alencar-Brayner gave an introduction to Digital Scholarship at the British Library including the award winning BL Labs project.
Short talks presented overviews of current activities at the National Library of Scotland, University of Manchester and London School of Economics and the establishment of forensic and digital archiving at these institutions, including the value of a secure and dedicated workspace, the use of a forensic tool for examining large numbers of emails, the integration of forensic techniques within existing working environments and practices, and the importance of tailored training.
Other talks were directed at specific applications of forensic tools in the preservation of complex digital objects in the Rose Goldsen Archive of New Media at Cornell University Library, the capture of computer games at the Royal Library of Denmark, and the challenges of capturing the floppy disks of poet and author Lucille Clifton at Emory University, these media being derived from a Magnavox Videowriter.
My colleagues Rachel Foss and Helen Melody and I presented a paper on the Hanif Kureishi Archive, a collection of paper and digital materials, recently acquired by the British Library’s literary curators: specifically, outlining the use of digital forensics for appraisal and textual analysis.
Prior to acquisition Rachel and I previewed the archive using fuzzy hashing (a technique for quickly identifying similar files).
After the archive was obtained and forensically captured, metadata were extracted from the digital objects and made available along with curatorial versions of the text documents, and Helen catalogued them using the British Library’s Integrated Archives and Manuscripts System.
One of the most exciting aspects of the archive is a set of 53 drafts of Hanif Kureishi’s novel Something To Tell You, which Rachel, Helen and I decided to explore as an example for the workshop.
Figure 1. Logical file size plotted against last modified date: an editing history
We used the sdhash tool (produced by Vassil Roussev of the University of New Orleans and incorporated within the BitCurator framework). Like the ssdeep fuzzy hashing tool (which has been incorporated into Forensic Toolkit, FTK), it identifies similarities among files but uses a distinct approach.
With BitCurator it is possible to direct sdhash at a set of files and ask the tool to first create the similarity digests and then to make pairwise comparisons across the similarity digests for all files: each pair of files being assigned a similarity digest score.
Figure 2. Similarity score (sdhash) plotted against absolute difference in indicated dates (days) between files (each point represents a pair of draft files): apparently and generally the greater the number of days between the files of a pair, the lower the similarity score
This is a preliminary analysis and readers of this blog entry who are familiar with statistical methods may recognise that it might be better to use partial regression or a similar statistical approach. A further small point, as Dr Roussev has emphasised, a 100% similarity does not mean that the files are identical; cryptographic hashes can serve this purpose and are to be incorporated in future versions of the sdhash tool which is still under active development.
Following the more formal talks we began an open discussion with the aim to identify some priority topics, and subsequently we divided into three groups to address: metadata, access and sensitivity, respectively, concluding the first day. On the second day, we focussed the conversation even more and as two groups addressed cataloguing and metadata on the one hand, and tools and workflows on the other hand.
Steps towards specific conclusions and recommended actions were made in preparation for publication and dissemination.
The desire to continue and extend the collaboration was strongly expressed, and fittingly Cal Lee concluded the workshop by updating us on developments of the BitCurator platform and the launch of the BitCurator Consortium, an important invitation for institutions to participate and for individuals to collaborate.
Many congratulations to Fran and Caroline on their email project becoming a finalist for the Digital Preservation Awards 2014: the University of Manchester Library’s Carcanet Press Archive project which among many things explored the use of the forensic tool Email Examiner along with Aid4Mail (which incidentally has a forensic version).
The workshop was jointly organised by me, Cal Lee (University of North Carolina at Chapel Hill) and Susan Thomas (Bodleian Library, University of Oxford).
Very many thanks to the delegates for all of their participation over the two days.
Jeremy Leighton John, Curator of eMSS