12 May 2011

Tales from the Digital Archive

If you would like to hear the BBC Radio 4 programme 'Tales from the Digital Archive' in which archaeologist Dr Christine Finn (a real digital archaeologist) explores personal digital archives, you can do so for a few more days via the BBC Radio 4 website.  

 

It features the British Library's eMSS Lab and three of the library's curators (myself and literary curators Rachel Foss and Helen Broderick) who are working with digital archives; as well as writers, biographers and archival specialist Erika Farr.  

 

Very many thanks to Christine for independently initiating the very timely visit to the eMSS Lab, just as the library was about to announce the purchase of the Wendy Cope Archive, and to Marya Burgess the programme's BBC producer, who kindly arranged for an extension for the programme to be heard via iPlayer.

Bookmark and share this post with Digg, del.icio.us, etc

15 March 2011

Digital Film Making as Digital Curation

I am a member of the Digital Research & Curator Team, newly formed at the British Library as a component of a broader programme and organisational unit of Digital Scholarship. One of our first endeavours has been to reflect on the meaning and nature of Digital Curation and our role in it. One of my team mates, Maureen Pennock, who has been playing a prominent role in this process, reminded me yesterday that it clearly means a wide variety of things to a wide variety of people. Perhaps that is natural, expected and good.

 

In the context of personal digital archives, I and others have been elaborating the concept of Enhanced Curation where curators not only collect the original archive but add value to it. Thus we produce high resolution and interactive panoramic images of the creative environments of writers and scientists: their studios, studies and laboratories. Similarly, oral history audio interviews can be integrated with the collection and cataloguing of a scientific archive. Writers can be encouraged to speak about individual notebooks, drafts and correspondence.

 

We are also exploring the use of digital video as a means of capturing the personal landscape along with the thoughts and memories of the individual. One of my favourite books at the moment is Digital Film-Making by Mike Figgis. Modest in size, it is wonderfully pragmatic, wise and perceptive. 

 

Digital Film Making cmp
 

 

Of course, as curators we are not really in the profession of film making. Where should we draw the line in our use of digital video? It seems to me that it is not our primary aim to produce fully fledged documentaries. Instead our intention ought to be to capture some of what would otherwise be lost; and let others use the resulting unedited material as they feel is useful, which of course will change over time. Thus we seek the contextual information that will not be represented in the original archive, or at least not represented in the same detail or form. The room where the writer sat at his or her computer, surrounded by papers and books, drawing out inspiration. The garden where the physicist whiled away the hours absentmindedly pondering on deep and not-so-deep things. The personal library that supported the activities of the campaigning politician or social reformer. The studio, instruments and equipment that provided the composing musician with the necessary tools and space.

 

In a way that is what curators and archivists have always done - as far as current technology and resources permitted. Yet sometimes I wonder if this objective is forgotten. Technology changes, making some activities unnecessary, and yet perhaps we carry on doing them (because these processes are of course tried and tested, and therefore 'necessary'). Figgis makes a similar point in the context of digital video.

 

Ever thought about why the movie studios of Hollywood are in southern California? It is, he says, because of the plentiful light. But digital video does not need as much light and – according to Mike Figgis – artistically it is not necessary to make everything well lit. Digital video does not need to look like film.

 

Similarly, he argues that many film makers have become addicted to camera movement: the camera in pursuit of the individual, the grand sweeping from high up outside the house, down and in to the dining room where the family is gathered. “My quarrel with camera movement is at the point where the intelligence that had gone into deciding why the camera should move changes to the demand that the camera has to move....” “For me, the function of camera movement is to assist the storytelling. That’s all it is. It cannot be there just to demonstrate itself.”

 

So maybe that is what we need to do as we review the notion of digital curation. Reflect on what is really necessary and what is not, what is necessary in the new digital era and what is not. When we catalogue and process personal digital objects, it must be to add value; to show the characteristics and relationships of digital objects that would not be discernible and discoverable from the objects themselves (with or without forensic techniques). One of the things that we shall be learning over the coming months and years is just what can be reliably gleaned from the object itself and what cannot.

 

Some things really don’t change. In two sections entitled “Pre-Planning” and “The Good Clerk”, Figgis emphasises the absolutely essential role of information management, of note taking, in pre-planning and in post-production, activities that are deeply engrained in all archivists and curators. 

 

“For all its freedoms – its ease of movement, the lightness and cheapness of equipment, the availability of stock, all these wonderful factors that have liberated the film-making process – if there is not one person on set whose sole function is to help the editor by having a meticulously maintained book which is a log of everything that is being shot, then you will have chaos. And related to that log, every tape has to have a label on it, a colour code to say which camera it’s from. And you need a detailed shot-list. The imposition of this discipline on the camera operators is of fundamental importance”.

 

There are lots of other useful pieces of advice.

 

On envisaging the outcome: the writer suggests that film makers must learn to think not only about what they are doing with the camera in the present tense but also about what will emerge from it: “the image in the future tense”.

 

On the importance of the sound quality: “the sound of the voices and the music and the quality of the mix need to be of the highest possible level”, a sentiment that audio colleagues at the library have long expressed to me for it is very easy to think that you can neglect audio quality.

 

On the use of the close-up, especially with people: you must not hide behind the camera’s technical ability to zoom in. In short, “You can’t steal the shot with a long lens or a zoom”, citing no less an authority than Henri Cartier-Bresson (1908-2004).

 

The aspect of the book that I most like, however, is its appreciation of technology. Figgis is an artist, no doubt about it. Yet, he really respects the technology. He explores it, he tests it, he experiments with it, he loves it.  And he doesn’t just respect the equipment conceptually, he means each individual piece of kit.

 

“If it breaks and you need to throw it away, fine. But while it’s functioning, it has to be treated with love and respect”.

 

“If that seriousness doesn’t exist, if there’s a disdainful or disrespectful attitude to the camera, then the result will not be as good. I would extend that philosophy all the way through the digital film-making process and for all the tools you use – the camera, the tape, the computer. These things are yours for the period of creation, and they have to be imbued with the correct significance and seriousness, as befits the film-making process. If they’re not, then it will show”.

 

This quote happens to bring me back to my conversation with Maureen yesterday, and the sentiment shared with many archivists: digital curation must be about the whole lifecycle

 

Bookmark and share this post with Digg, del.icio.us, etc

12 March 2011

Personal Conversations & Breaking the Semantic Barrier

A tweet that I received some days ago concerning a TED presentation reminded me of the work that Deb Roy is doing with the Human Speechome Project at MIT (see earlier blog entry). Currently he is on leave and working as Founder and CEO of Bluefin Labs on communication strategies for integrating mass media (delivers online or offline same media content to many individuals) and social media (personally created content of one individual shared with other individuals). 

 

A pdf can be downloaded from the Bluefin Labs website; it requires registration but the file is supplied directly. Entitled "Mass Media, Social Media & the Semantic Barrier", it begins with a succinct outline of the impact of new communication technologies on earlier ones: "For the most part, new communication technologies push their way into human ecologies by integrating with other modes of communication, shaping but not replacing the other modes". 

 

Dr Roy is interested in the way social media can influence the impact of mass media on audience responses. He characterises clickstreams as 'linked audience response' data, and contrasts them with 'unlinked audience responses' that often emanate from social media conversations; by 'unlinked' he means that the responses to a piece of mass media are directed towards other people without being explicitly linked to the stimulating media itself. "When someone tweets out to their friends about a line they just heard on TV or posts an update about an ad they just saw, they have generated an unlinked response". While people do sometimes include a url, frequently they do not: "most comments about real-time media streams (including live events) contain no links". 

 

The paper suggests that unlinked audience response data are growing in value much faster than linked data, and suggests that this is because social media are "unleashing those conversations" that people have about what they encounter in mass media. Dr Roy argues that the unlinked data are rich in emotional and semantic content. These data are little used due to the difficulty of characterising and understanding a large number of 'personal conversations'. "While it is easy for a person to understand a handful of sentences, it becomes impossible to do the same for a few million (let alone a few billion) sentences". "At scale, organizations [that are concerned about large numbers of people] become blind to the semantic links that bind conversations to their source". This is the semantic barrier.

 

Deb Roy’s research has been directed at just this kind of problem: the automation of language grounding, the creation of machines or tools that “learn to link language to context by observing and modeling human communication strategies”. Two ideas are employed: (i) deep machine learning algorithms that allow the capture of semantic connections between speech and video, eg during show-and-tell interactions between machine and humans; and (ii) observation and algorithmic analysis of human communication dynamics in naturalistic settings ‘in the wild’ such as the Speechome project. 

 

Bluefin has developed an automated media analysis platform to map “social media comments to mass media events such as TV shows and ads in real-time. Its purpose is to harness the power of unlinked audience response data to help organizations better communicate with individuals, and to help individuals navigate mass media in new ways”. 

 

The processing pipeline is as follows:

  • tv broadcasts are ingested and analysed using computer vision algorithms to automatically find events in video streams;
  • events are semantically analysed to create ‘audience receptors’: software programs that find and link social comments about the associated media event;
  • millions of audience receptors semantically analyse billions of unstructured social media comments and bind selectively social media comments with source media events;
  • improvements to audience receptors are continuously sought; and
  • the result is “a large, continuously updated database called the audience response matrix" that provides a comprehensive picture of the relationship between mass media and social media. 

 

Dr Roy concludes: “I believe real time and historic audience response data will turn out to be of immense and lasting value”. 

 

Further information can be obtained at the technorati website.

 

It is worth reflecting on the notion and value of  'archives in the wild' highlighted by the Digital Lives research. These 'unlinked audience responses', these 'personal conversations' made through social media in response to mass media, are nothing other than a major component of an individual's personal digital legacy. The Bluefin paper further observes that many audience responses are made deliberately in a public forum, and argues that in this context "people want their voices to be heard". Equally, they may want their voices to be heard by future generations. On the other hand, as the Digital Lives Synthesis notes, some conversations remain less public, and a careful and sensitive mediation through trusted archival processes may make it possible to make use of these less public 'conversations' too: a resource as indicated by the Bluefin paper "of immense and lasting value". 

 

This new technology clearly has potential implications beyond the matters discussed in the current blog entry. Future blog entries may pick up the topic again. 

 

Bookmark and share this post with Digg, del.icio.us, etc

07 March 2011

The Personal Cloud and the Plug Computer

"Your data, at home, in your house": Eben Moglen.

 

One of the points expressed in the Digital Lives Synthesis is the view that people need to retain (and regain) control of their personal information, their personal digital objects whether through legal requirement or technical empowerment: to have the ability to download their information locally to their own computer, to take full possession of it.

 

This is not to lament the arrival of computing in the cloud, of social networks, of crowdsourcing: far from it. These phenomena are truly wonderful, liberating, emancipating.

 

It is about what happens next.

 

Is it possible that the next phase has begun - a few small steps? Perhaps. Professor Eben Moglen of Columbia University, a law professor, has recently launched the Freedom Box Foundation. His aim is to rebuild the Internet "without governments and big companies" (to quote Jim Dwyer in The New York Times, 15 February 2011). "Federated, rather than centralised, microblogging, social networking, photo exchange, anonymous publication platforms based around cloudy webservers".

 

Ambitious and optimistic no doubt, but a key component already exists: the plug computer or server.

 

For examples of plug computers in their current manifestation, see the Pogoplug and the GuruPlug.

 

Cheap, small, with low power requirements and the potential to be adapted for more comprehensive internet applications, these are hardware devices that can be plugged in the wall, are widespread even now, and – moreover – "will get very cheap, very quick".

 

Freedom Box Foundation was set up to encourage the development of the necessary free software to make these devices easy to use.

 

A recent lecture given by Professor Moglen at the FOSEM Conference in Belgium: Free and Open Source Software Developers’ European Meeting, 5-6 February 2011 spells out the thinking in more detail: there is a transcript available. There is also a Kickstarter appeal for funding.

 

Alongside an article about the Begram ivories of Afghanistan (previously missing but now on temporary view at the British Museum), there was a piece picking up the same theme, by John Naughton in The Observer (a Sunday newspaper in the UK), 27 February 2011, and entitled: “At long last, there's a silver lining in the age of cloud computing”.

 

He quotes Professor Moglen’s vision from an earlier lecture: “We need a really good web server you can put in your pocket and plug in any place …”.

 

"In other words, it shouldn't be any larger than the charger for your cell phone, and you should be able to plug it in to any power jack in the world, and any wire near it, or sync it up to any Wi-Fi router that happens to be in its neighbourhood. It should have a couple of USB ports that attach it to things. It should know how to bring itself up. It should know how to start its web server, how to collect all your stuff out of the social networking places where you've got it. It should know how to send an encrypted backup of everything to your friends' servers".

 

It is necessary only to add that it needs to incorporate digital preservation and digital curation thinking and functionality. Could a reborn and advanced Hoppla play a role?

 

A specific concern raised by the Digital Lives Synthesis is the ability of individuals to be able to pass their personal information to others, to subsequent generations of their family, to make it available to bona fide researchers when they choose to do so, for instance.

 

Whether or not the Freedom Box lives up to its promise; and however cloudy the future, the potency of family history, of memory and archival legacy, the fundamental need for mediated and authenticated access for research by scientists and scholars, may help to drive a need for a less monopolised and centralised holding of personal data.

Bookmark and share this post with Digg, del.icio.us, etc

01 March 2011

On Fuzzy Hashes and Relatedness

At #pda2011, I reported briefly on some early tests I've done with fuzzy hashing at the British Library. Although I don't think the concept is discussed in the wonderful and definitive CLIR (Council on Library and Information Resources) document on digital forensics that emerged from the workshop held at the University of Maryland in Spring 2010 (sponsored by Mellon), it was at that workshop (organised by Matt Kirschenbaum) that I first heard of 'fuzzy hashes'.

(The report is entitled: Digital Forensics and Born-Digital Content in Cultural Heritage Collections and is coauthored by Matthew G. Kirschenbaum, Richard Ovenden and Gabriela Redwine with research assistance from Rachel Donahue. A pdf can be found here: Digital Forensics CLIR)

Barbara Gutmann of the National Institute of Standards and Technology kindly pointed me towards the relevant literature and in particular towards the work of Jesse Kornblum. (The NIST is already collecting fuzzy hash values and you can have a look at a sample of them on their website: Fuzzy Hash Datasets.)

A key publication is entitled: Identifying almost identical files using context triggered piecewise hashing, Digital Investigation 3S (2006) S91-S97. Jesse Kornblum has established the ssdeep computer program which applies fuzzy hash values in a forensic context, while paying generous tribute to Andrew Tridgell for his invention of the spamsum algorithm designed to identify related spam.

As many readers will be well aware, cryptographic hashing algorithms are typically and desirably highly sensitive to change in a file (alteration of one bit results in a different hash value or 'digital fingerprint'). However, sometimes it is useful to be able to identify files that are closely related to each other. The fuzzy hashing technique does this by combining piecewise hashing (the hashing of segments or portions of a file) with a method for creating a rolling hash. The rolling hash employs a trigger value to define the portions. In this way it is able to detect files with corresponding portions. A measure can also be produced that gives a sense of how closely related are a pair of files.

At one place in his paper, Kornblum states: "Files with one-bit changes are almost entirely identical and share a large ordered homology. Borrowing from genetics, two chromosomes are homologous if they have identical sequences of genes in the same order. Similarly, two computer files can have ordered homologous sequences if they have large sequences of identical bits in the same order. The two files are identical except for a set of insertions, modifications, and deletions of data".

Although ssdeep itself is more powerful, AccessData has incorporated some of the functionality in recent versions of Forensic ToolKit (eg FTK 3.2); see a pdf of notes from Dustin Hurlbut. The Forensics Wiki also provides useful details.

Some prelminary testing with the Ronald Harwood Papers and the W. D. Hamilton Archive at the British Library has demonstrated the considerable potential of this approach in quickly identifying related digital drafts for example; and in giving an approximate measure of 'similarity'.

It is still early days and there are limitations but early signs are that a very useful tool is already emerging and developing that can be adapted for archival use.

Bookmark and share this post with Digg, del.icio.us, etc

The eMSS Lab: 2.0

As mentioned in my talk at the Personal Digital Archiving Conference 2011 in San Francisco, the British Library is upgrading its laboratory facility for personal digital archiving. It will be physically closer to the Digital Preservation Team and to the space used by the Open Planets Foundation which will help to encourage a Community of Practice (not to mention general discussion).

The transfer of the equipment from the existing Digital Scriptorium will begin next week. Watch this blog for posts as we install and test familiar and new equipment and procedures as part of the Personal Digital Manuscripts Project at the British Library.

The new space will cater for: Digital Forensics, Ancestral Computing, Curatorial Examination, Enhanced Curation  (including Community Participation) and Adaptive Curatorship.  

Bookmark and share this post with Digg, del.icio.us, etc

Happy Saint David's Day

800px-Narzisse 

Photo from Wikipedia entry on the Daffodil

Bookmark and share this post with Digg, del.icio.us, etc

Personal Digital Archiving 2011 at the Internet Archive

Ia_logo

Following a stimulating visit to San Francisco to attend the PDA 2011 conference which was held at the Internet Archive, this blog is being reinvigorated.

750px-Christian_science_church122908_02 

Many thanks to Jeff Ubois for inviting me to the meeting and program committee. You can get a good feel for the conference from twitter using #pda2011. See blogs from The Litbrarian and The WakiLibrarian. (Apologies for not writing my own notes for the blog but I need to read the speakers' lips; and unlike my daughter I have not mastered the art of writing without looking down :-).

Bookmark and share this post with Digg, del.icio.us, etc

03 July 2010

Seminar Update

Here is the latest version, as a pdf file, of the outline of the Digital Lives Research Seminar, from the Personal Digital Manuscripts Project at the British Library: Download Digital Lives Seminar 5July2010 v10

Please aim to arrive before 09:45 if not at 09:30

Bookmark and share this post with Digg, del.icio.us, etc

01 July 2010

A Seminar on Authenticity, Forensics, Materiality, Virtuality and Emulation

There are still some places at the forthcoming Digital Lives Research Seminar: Monday 5 July 2010. To book a place please contact Jeremy Leighton John at jeremy.john@bl.uk indicating which sessions you would like to attend and to which institution you are associated in some way (eg as a student).

If there is a place available and reserved for you, you will receive a confirmatory email.

Invited speakers include Christine Finn, Jussi Parikka (Anglia Ruskin University), Erika Farr and Naomi Nelson (Emory University), Daniela Petrelli (University of Sheffield), Seth Shaw (Duke University), Michael G. Olson (Stanford University), Gabriela Redwine (Harry Ransom Center, University of Texas at Austin), Kieron Wilkinson and Istvan Fabian (Software Preservation Society), Matthew G. Kirschenbaum (University of Maryland), Erika Farr and Naomi Nelson (Emory University), Vincent Joguin (Joguin SAS), Matt Shreeve (Curtis+Cartwright Consulting), Simon P. Wilson (Hull History Centre) and Jeff Ubois (Fujitsu Labs of America).

There will also be presentations from members of the Personal Digital Manuscripts Project at the British Library.

There will be lectures, demonstrations and discussions: all addressed at recent and emerging advances and issues for personal digital archives.

Topics will range from authenticity and the use of forensic technologies, emulation and portable emulators, digital preservation services, initiatives with digital archives, software preservation, ancestral computers and the classic computer community, low level disk analysis and capture, and the concepts of digital materiality and universal virtual machines.

Further details can be found in this pdf file: Download Digital Lives Seminar 5July2010 v8.

 

Bookmark and share this post with Digg, del.icio.us, etc