Digital scholarship blog

Enabling innovative research with British Library digital collections


Tracking exciting developments at the intersection of libraries, scholarship and technology. Read more

01 October 2014

The Art of Data

Add comment Comments (0)

Last month the Digital Research team organized another succesful Digital Conversations event. The evening, chaired by Anthony Lilley, brought together artists, researchers and art critics to reflect on projects and ideas around the use of digital data in contemporary artistic expression. Ernest Edmonds started the discussion by defining data as something that is constantly moving around and how this movement is, by its turn, constantly transforming data. Take for example communication. When we communicate our ideas and conceptions of the world, we are producing some sort of data that is transmitted from us to be perceived and given a meaning by others. This interactive process implies that data is transformed over time as different people can interpret the same data in different ways. Art, conceived as a mode of expression, follows the same logic: a single work can trigger different interpretations even by the same person: when we see a work of art for the first time we have a reaction that might change when are exposed to it on a second occasion. What digital data offers to artists is the possibility to explore new ways of representing the world, highlighting how data is constantly changing our perceptions. Digital data, according to Edmonds, is the ‘new canvas’ that artists use to make us aware of the transformation of data over time.


Michael Takeo Magruder continued the discussion by presenting some of his own artistic projects, arguing that the adoption of real time data by artists emphasizes the work of art as something in constant change, that is, something which is never really finished. Michael’s Data_Plex (economy) project was used as an example to illustrate this argument. The project was based on live data produced by the Dow Jones Industrial Average (DJI) index, represented in a simulated urban environment in the form of skyscrapers. The virtual buildings were erected or destroyed according to the fluctuations of the stock market. The audience was not only able to visualise a complex data structure in a more intelligible representation system but, more importantly, it became aware of how unstable the whole economic market in the USA was, as buildings were constantly changing in size, colour and shape to represent the variations of the market. During times of financial crisis, the audience could see the virtual buildings falling down as the stocks crashed, revealing in this way how some specific industries in the financial market are more prone to be affected by financial crises than others which, in the virtual urban space created by Michael, were kept intact. In a more metaphorical level, the artwork produced a criticism of capitalism as an unstable economic system that has the power to build up as well as to destroy what it has constructed.

 Julia Freeman spoke of data as something that involves complexity as we all consume data in very different ways. Data is a broad and overused term and therefore we need to think of a ‘taxonomy of data’ in order to try to understand a little more about what makes data important to us. In the digital world there is a whole movement advocating for data to become open for anyone to use it but there is still little understanding on how this data will be used and how it can transform society. Talking about her own work, Julie explained about her interest in using live data – data that comes from biological systems – to explore new ways in which we can connect with our environment beyond sensorial perception. In one of her projects, The Lake, Julie tagged 16 fishes from different species in a particular lake containing a population of circa 3,000 other fishes. The idea of the project was to track the movements of the different schools by translating it into visual and acoustic data. The result was a complicated network of sounds and images that created different patterns by showing the levels of activity between fish species according to different times of the day. In a period of six weeks the project generated more than 5 million data points producing interesting colour patterns and sound compositions.

  The Lake

The Lake, by Julia Freeman © 

The last speaker of the evening, Kevin Walker, started his presentation by raising a controversial point in arguing that we are living in an age of digital data terror. Digital data is a buzzword and most of us who deal with it rarely question what are the sources that produce data, what are they meaning or even what sort of stories lay behind them. The role of the artist, in this context, is to interrogate technology and the data it produces. When experimenting with digital data, artists often arrive at unusual and surprising results, transforming information into experiences through design. Kevin illustrate his ideas by presenting some of the work done by his students at the Royal Academy of Arts, who are transforming data into perceptual experiences, normally through representing this data by sounds and graphic images. Students enrolled in the Information Experience Design programme run by the RCA are encouraged to integrate digital data from various sources into visual displays that translate the data into meaningful information for the audience in the same way described by the other speakers in the evening. These works emphasize digital data as a re-usable source that can be recycled and transformed into art. The question for the future, as Kevin points out, is how artists will move from working with digital to deal with quantum data. This question remains open so we should watch this space….

 The audience participated eagerly in the discussions by posing interesting questions to the panel, much of them around the interactive nature of contemporary art in the digital environment. As explained in the presentations, art can add an essential meaning to real time data by turning it into visual and acoustic representations to the audience. As this data changes constantly, so does the work of art. This also suggests another interesting point that relates to the difficult task of preserving these works to future generations. Since transformation in real time permeates the aesthetic concept of digital data in contemporary artistic expression, it would probably be a good time for us to rethink our concept of preservation in a world of constant change.

You can watch the event here and below. Special thanks to Susan Ferreira for recording the video!

Digital Conversations #5: Digital data and artistic expression from Aquiles Baryner on Vimeo.


By Aquiles Alencar-Brayner

Curator, Digital Research

26 September 2014

Applying Forensics to Preserving the Past: Current Activities and Future Possibilities

Add comment Comments (0)

First Digital Lives Research Workshop 2014 at the British Library

DL Workshop Holder


With more and more libraries, archives and museums manifestly adopting forensic approaches and tools for handling and processing born digital objects both in the UK and overseas it seemed a good time to take stock. Archivists and curators were invited (via professional email listservs) to submit a short paper for an inclusive and interactive workshop stretching over two days in London. 

Institutions are applying digital forensics across the entire lifecycle from appraisal through to content analysis, and have begun to establish workflows that embrace forensic techniques such as the use of write blockers for the formation of disk images, the extraction of metadata and the searching,  filtering and interpreting of digital data, notably the appropriate management of sensitive information. 

There are two sides to digital forensics for it begins with the protection of digital evidence and concludes with the retrospective analysis of past events and objects. Papers reflecting both aspects were submitted for the workshop (download DLRW 2014 Outline). 

It provided participants with opportunities to report on current activities, highlight gaps, constraints and possibilities, and to discuss and agree collective steps and actions. 


DLRW2014 delegates v3


As the following list demonstrates delegates came from a diverse range of institutions: universities, libraries, galleries and archives, and the private sector.

Matthew Addis, Arkivum

Fran Baker, John Rylands Library, University of Manchester 

Thom Carter, London School of Economics Library

Dianne Dietrich, Cornell University Library

Rachel Foss, British Library

Claus Jensen, Royal Library of Denmark and Copenhagen University Library

Jeremy Leighton John, British Library

Svenja Kunze, Bodleian Library, University of Oxford

John Langdon, Tate Gallery

Cal Lee, University of North Carolina at Chapel Hill

Caroline Martin, John Rylands Library, University of Manchester (contributor to paper)

Helen Melody, British Library

Stephen Rigden, National Library of Scotland

Elinor Robinson, London School of Economics Library

Susan Thomas, Bodleian Library, University of Oxford 

Dorothy Waugh, Emory University




I gave an introduction to the original Digital Lives Research project and a brief overview of the ensuing internal projects at the British Library (Personal Digital Manuscripts and Personal Digital Archives), while Aquiles Alencar-Brayner gave an introduction to Digital Scholarship at the British Library including the award winning BL Labs project. 

Short talks presented overviews of current activities at the National Library of Scotland, University of Manchester and London School of Economics and the establishment of forensic and digital archiving at these institutions, including the value of a secure and dedicated workspace, the use of a forensic tool for examining large numbers of emails, the integration of forensic techniques within existing working environments and practices, and the importance of tailored training. 

Other talks were directed at specific applications of forensic tools in the preservation of complex digital objects in the Rose Goldsen Archive of New Media at Cornell University Library, the capture of computer games at the Royal Library of Denmark, and the challenges of capturing the floppy disks of poet and author Lucille Clifton at Emory University, these media being derived from a Magnavox Videowriter


  HKA scrnsht


My colleagues Rachel Foss and Helen Melody and I presented a paper on the Hanif Kureishi Archive, a collection of paper and digital materials, recently acquired by the British Library’s literary curators: specifically, outlining the use of digital forensics for appraisal and textual analysis.  

Prior to acquisition Rachel and I previewed the archive using fuzzy hashing (a technique for quickly identifying similar files). 


HKA1 fmpro scrnsht

After the archive was obtained and forensically captured, metadata were extracted from the digital objects and made available along with curatorial versions of the text documents, and Helen catalogued them using the British Library’s Integrated Archives and Manuscripts System


  HKA1 catalogue scrnsht2


HKA1 catalogue scrnsht3

HKA1 catalogue scrnsht1

One of the most exciting aspects of the archive is a set of 53 drafts of Hanif Kureishi’s novel Something To Tell You, which Rachel, Helen and I decided to explore as an example for the workshop. 



  HKA1 Graph Logical Size (log) vs Modif Date

Figure 1. Logical file size plotted against last modified date: an editing history


We used the sdhash tool (produced by Vassil Roussev of the University of New Orleans and incorporated within the BitCurator framework). Like the ssdeep fuzzy hashing tool (which has been incorporated into Forensic Toolkit, FTK), it identifies similarities among files but uses a distinct approach.

  Sdhash STTY scrnsht2

With BitCurator it is possible to direct sdhash at a set of files and ask the tool to first create the similarity digests and then to make pairwise comparisons across the similarity digests for all files: each pair of files being assigned a similarity digest score. 


  Sdhash apparent date diff modulus dots 2 crp


Figure 2. Similarity score (sdhash) plotted against absolute difference in indicated dates (days) between files (each point represents a pair of draft files): apparently and generally the greater the number of days between the files of a pair, the lower the similarity score


This is a preliminary analysis and readers of this blog entry who are familiar with statistical methods may recognise that it might be better to use partial regression or a similar statistical approach. A further small point, as Dr Roussev has emphasised, a 100% similarity does not mean that the files are identical; cryptographic hashes can serve this purpose and are to be incorporated in future versions of the sdhash tool which is still under active development.  


Following the more formal talks we began an open discussion with the aim to identify some priority topics, and subsequently we divided into three groups to address: metadata, access and sensitivity, respectively, concluding the first day. On the second day, we focussed the conversation even more and as two groups addressed cataloguing and metadata on the one hand, and tools and workflows on the other hand. 

Steps towards specific conclusions and recommended actions were made in preparation for publication and dissemination. 

The desire to continue and extend the collaboration was strongly expressed, and fittingly Cal Lee concluded the workshop by updating us on developments of the BitCurator platform and the launch of the BitCurator Consortium, an important invitation for institutions to participate and for individuals to collaborate. 

BitCurator is going from strength to strength: receiving an extension of the project, formally launching the BitCurator Consortium, and releasing Version 1.0 of the BitCurator software.  


Many congratulations to Fran and Caroline on their email project becoming a finalist for the Digital Preservation Awards 2014: the University of Manchester Library’s Carcanet Press Archive project which among many things explored the use of the forensic tool Email Examiner along with Aid4Mail (which incidentally has a forensic version). 




The workshop was jointly organised by me, Cal Lee (University of North Carolina at Chapel Hill) and Susan Thomas (Bodleian Library, University of Oxford).  

Very many thanks to the delegates for all of their participation over the two days. 

Jeremy Leighton John, Curator of eMSS 


15 September 2014

Finding Jokes - The Victorian Meme Machine

Add comment Comments (0)

Posted on behalf of Bob Nicholson.

The Victorian Meme Machine is a collaboration between the British Library Labs and Dr Bob Nicholson (Edge Hill University). The project will create an extensive database of Victorian jokes and then experiment with ways to recirculate them out over social media. For an introduction to the project, take a look at this blog post or this video presentation.

Stage One: Finding Jokes

Whenever I tell people that I’m working with the British Library to develop an archive of nineteenth-century jokes, they often look a bit confused. “I didn’t think the Victorians had a sense of humour”, somebody told me recently. This is a common misconception. We’re all used to thinking of the Victorians as dour and humourless; as a people who were, famously, ‘not amused’. But this couldn’t be further from the truth. In fact, jokes circulated at all levels of Victorian culture. While most of them have now been lost to history, a significant number have survived in the pages of books, periodicals, newspapers, playbills, adverts, diaries, songbooks, and other pieces of printed ephemera. There are probably millions of Victorian jokes sitting in libraries and archives just waiting to be rediscovered – the challenge lies in finding them.   

In truth, we don’t know how many Victorian gags have been preserved in the British Library’s digital collections. Type the word ‘jokes’ into the British Newspaper Archive or the JISC Historical Texts collection and you’ll find a handful of them fairly quickly. But this is just the tip of the iceberg. There are many more jests hidden deeper in these archives. Unfortunately, they aren’t easy to uncover. Some appear under peculiar titles, others are scattered around as unmarked column fillers, and many have aged so poorly that they no longer look like jokes at all. Figuring out an effective way to find and isolate these scattered fragments of Victorian humour is one of the main aims of our project. Here’s how we’re approaching it.

Firstly, we’ve decided to focus our attention on two main sources: books and newspapers. While it’s certainly possible to find jokes elsewhere, these sources provide the largest concentrations of material. A dedicated joke book, such as this Book of Humour, Wit and Wisdom, contains hundreds of viable jokes in a single package. Similarly, many Victorian newspapers carried weekly joke columns containing around 30 gags at a time – over the course of a year, a regularly printed column yields more than 1,500 jests. If we can develop an efficient way to extract jokes from these texts then we’ll have a good chance of meeting our target of 1 million gags.


Our initial searches have focused on two digital collections:

1)      The 19th Century British Library Newspapers Database.

2)      A collection of nineteenth-century books digitised by Microsoft.

In order to interrogate these databases we’ve compiled a continually-expanding list of search terms. Obvious keywords like ‘jokes’ and ‘jests’ have proven to be effective, but we’ve also found material using words like ‘quips’, ‘cranks’, ‘wit’, ‘fun’, ‘jingles’, ‘humour’, ‘laugh’, ‘comic’, ‘snaps’, and ‘siftings’. However, while these general search terms are useful, they don’t catch everything. Consider these peculiarly-named columns from the Hampshire Telegraph:


At first glance, they look like recipes for buckwheat cakes – in fact, they’re columns of imported American jokes named after what was evidently considered to be a characteristically Yankee delicacy. I would never have found these columns using conventional keyword searches. Uncovering material like this is much more laborious, and requires us to manually look for peculiarly-named books and joke columns.

In the case of newspapers, this requires a bit of educated guesswork. Most joke columns appeared in popular weekly papers, or in the weekend editions of mass-market dailies. So, weighty, morning broadsheets like the London Times are unlikely to yield many gags. Similarly, while the placement of jokes columns varied from paper to paper (and sometimes from issue to issue), they were typically placed at the back of the paper alongside children’s columns, fashion advice, recipes, and other miscellaneous tit-bits of entertainment. Finally, once a newspaper has been proven to contain one set of joke columns, the likelihood is that more will be found under other names. For example, initial keyword searches seem to suggest that the Newcastle Weekly Courant discontinued its long-running ‘American Humour’ column in 1888. In fact, the column was simply renamed ‘Yankee Snacks’ and continued to appear under this title for another 8 years.

Tracking a single change of identity like this is fairly straightforward; once the new title has been identified we simply need to add it to our list of search terms. Unfortunately, the editorial whims of some newspapers are harder to follow. For example, the Hampshire Telegraph often scattered multiple joke columns throughout a single issue. To make things even more complicated, they tended to rename and reposition these columns every couple of weeks. Here’s a sample of the paper’s American humour columns, all drawn from the first 6 months of 1892:

For papers like this, the only option is to manually locate jokes columns one at a time. In other words, while our initial set of core keywords should enable us to find and extract thousands of joke columns fairly quickly, more nuanced (and more laborious) methods will be required in order to get the rest.

It’s important to stress that jokes were not always printed in organised collections. Some newspapers mixed humour with other pieces of entertaining miscellany under titles such as ‘Varieties’ or ‘Our Carpet Bag’. The same is true of books, which often combined jokes with short stories, comic songs, and material for parlour games. While it’s fairly easy to find these collections, recognising and filtering out the jokes is more problematic. As our project develops, we’d like to experiment with some kind of joke-detection tool that pick out content with similar formatting and linguistic characteristics to the jokes we’ve already found. For example, conversational jokes usually have capitalised names (or pronouns) followed by a colon and, in some cases, include a descriptive phrase enclosed in brackets. So, if a text includes strings of characters like “Jack (…):” or “She (…):“ then there’s a good chance that it might be a joke. Similarly, many jokes begin with a capitalised title followed by a full-stop and a hyphen, and end with an italicised attribution. Here’s a characteristic example of all three trends in action:


Unfortunately, conventional search interfaces aren’t designed to recognise nuances in punctuation, so we’ll need to build something ourselves. For now, we’ve chosen to focus our efforts on harvesting the low-hanging fruit found in clearly defined collections of jokes.

                The project is still in the pilot stage, but we’ve already identified the locations of more than 100,000 jokes. This is more than enough for our current purposes, but I hope we’ll be able to push onwards towards a million as the project expands. The most effective way to do this may well to be harness the power of crowdsourcing and invite users of the database to help us uncover new sources. It’s clear from our initial efforts that a fully-automated approach won’t be effective. Finding and extracting large quantities of jokes – or, indeed, any specific type of content – from among the millions of pages of books and newspapers held in the library’s collection requires a combination of computer-based searching and human intervention. If we can bring more people on board we’ll be able to find and process the jokes much faster.

Finding gags is just the first step. In the next blog post I’ll explain how we’re extracting joke columns from the library’s digital collections, importing them into our own database, and transcribing their contents. Stay tuned!