Digital scholarship blog: Experiments

@HumaBirdProject + #inspiringwomen

Here at #bldigital we enjoy the chance to experiment with new tools. Two in particular ranged into view this week. The first, Serendip-o-matic, is the product of the ‘One Week | One Tool’ project and allows users to query content in Europeana, DPLA, Flickr Commons and Trove using a block of text or their entire Zotero library; or, as Ben O’Steen – the recently appointed British Library Labs Technical Lead – puts it: ‘They used plain text tokenisation and then used dbpedia spotlight to identify entities which it then used in the search’ (for a fuller explanation see Brian Croxall’s description of the project at ProfHacker). In short, the tool is a useful starting point towards figuring out how to discover stuff without word searches, an important challenge as our digital collections become ever more bloated and ever more rich.

The second tool is the Huma Bird Project, a project which aims to build easy to use tools for researchers who are light on time. There are some nice, simple text analysis tools in there, as well as ways to pull down and analyse data from Wikipedia and Twitter. After some time, I was drawn to the later, not because it is an enticing shiny thing, but because #12wmn12hrs was floating around my twitter timeline and seemed an easy candidate for visualisation. And so I made a network visualisation of #12wmn12hrs and tweeted it. Fern Riddell then asked me to visualise #inspiringwomen and after some trial, some error, and an appeal to the Huma Bird Project developer (thanks Pat!) I now have – see below.

I don’t normally take requests of this kind, but the chance to highlight the value of digital methods whilst contributing something to an ongoing protest related to the use and misuse of social media was too good to miss. The visualisations were also very simple to create and (unlike some network visualisations) very readable, and so might prove a useful addition to the now standard post-conference hashtag analysis and visualisation package.

@j_w_baker

***

The visualisations, with some notes.

The network graphs below are Twitter networks displaying the occurrence of tweets containing the hashtag #inspiringwomen between Monday 5 August 00:45 and Monday 5 August 21:13. For further explanation on what the types of graph show, see the Huma Bird Project examples page. The raw data can be downloaded here.

All network (view high quality image).

Bidirectional network (view high quality image; explanation from Huma Bird Project: ‘white lines show people who have @’ed each other during the conversation (a red line means the @ wasn’t replied to, or the person wasn’t @’ed by the person at any point in the conversation by this person)’).

Only Bidirectional network (view high quality image).

Network of users who have tweeted more than 3 people during the time period (high quality image). Networks for 5, 7 and 9 are here.

These images are available under a Public Domain Mark which indicates that there are no copyright restrictions on reproduction, adaptation, republication or sharing.

Posted by James Baker at 2:10 PM in Data , Experiments | Permalink | Comments( 0)

Seeking trends in article titles

Metadata can offer an interesting perspective on what has been published. And as you might expect, at the British Library we hold plenty of metadata on work published over the last 40-50 years. Included within this is metadata for journal articles.

Inspired by Ben Schmidt’s recent work on trends within academia (in this case, theses), I set about looking for trends within a set of metadata for Paleontology journal articles (circa 20,000 between 1991 and 2011) which we share openly (see link above). As Franco Moretti argues (Critical Inquiry, 2009), titles of works are powerful things: offering (often simultaneously) summaries, puffs and descriptions of the content within. This holds true for journal article titles, the standard function of which is to entice the reader, to demonstrate to the reader why they might wish to read the content within. (As a related aside, a wonder whether [citation needed!] article titles from a pre-search term era display different trends to those in a post-search term era: for I’d expect that in an age of Google, discoverability via search is a key component in the construction of article titles)

I decided to see if article titles differed between the journals they were published in, and began this by making a list of top keywords in those article titles as a means of rationalising them. The top 25 words (filtered for stop words) is as follows:

new,2,354 late,2,275 early,1,508 formation,1,325

upper,1,226 basin,1,213 cretaceous,1,185 china,1,136

middle,1,131 lower,1,087 southern,1,060 implications,954

central,855 south,829 northern,825 miocene,799

jurassic,793 western,737 triassic,732 evolution,717

evidence,683 record,677 fossil,675 species,666

ordovician,657

From this list I decided to exclude non-specific adjectives relating to time and geography (perhaps in doing so showing my ignorance toward my data, more on which later), and so ended up with the following top 10 list:

formation basin cretaceous

china implications miocene

jurassic triassic evolution

evidence

At this point I could have counted the occurrences of these words in the article titles published in each journal. Instead I chose to process the data in Gephi. Though aware of the reservations around the readability of network graphs, having had some success in the past with using Gephi to get a sense of data I gave it a go.

To get the data into Gephi, I had to convert it into ‘nodes’ and ‘edges’ (data: nodes, edges). Both the article title keywords and the journal titles were mapped as nodes (the round blobs in the image below), with an edges (the lines connecting the nodes) encoded for every occurrence of a keyword within a journal (with the force binding the network directed from the keyword to the journal). In short, the bigger the node the more occurrences, the bigger the edge the more connections. I then applied a Force Atlas algorithm to the data, processed the data for groupings, and pressed go. After some manual adjustments for legibility (the Gephi project can be downloaded here if you are interested in the nuances of my logic), I ended up with is this: (full size png)

Paleontology-top10keywords-forceatlas-1992-2011

Paleontology top ten article keywords against journal titles (1992-2011), Force Atlas network built using Gephi (CC BY 2.0 UK)

What does it tell us? First of all, and unsurprisingly, the top keyword - ‘formation’ - is found at the heart of the network. Similarly with occurrences put into the network as raw data (as opposed to as a percentage of all titles in the journal) those journals better represented in the data are closer to the centre. Finally, the graph shows that the top 10 keywords are strongly related.

Frankly, I really don’t know what to make of it: I certainly don’t want to make any bold claims (such ‘China and evidence are unrelated!’ or ‘Lethaia doesn’t accept articles about China!). What I will say is that I do know that if I knew more about the data (Know Your Big Data!), had used a larger number of keywords and had introduced time into the proceedings (change over time being, of course, the historians bread and butter) I might be able to start seeing some interesting trends.

And that is precisely what I am going to do. Shortly, I will be taking delivery of a similar dataset for journals published in the field of History (which I know well) and will be repeating the exercise (with the tweaks mentioned above). If you can think of anything in particular you’d like me to look out for or want to comment on what I’ve done thus far, please let me know.

James Baker

@j_w_baker

Posted by James Baker at 12:45 PM in Data , Experiments | Permalink | Comments( 0)

A novel, a writing machine and a leafy square in London

Front cover of Len Deighton's classic of realistic fiction, a detailed and perceptive account of a bombing raid in 1943 through the eyes of protagonists on both sides of the Channel, published by Jonathan Cape in 1970. Image of Raymond Hawkey designed dust-jacket graciously supplied by Edward Milward-Oliver

According to Matthew Kirschenbaum of the Maryland Institute for Technology in the Humanities (MITH), the first novel to be written on a word processor was Bomber published by Len Deighton in 1970. He explained his thinking in an article in The Slate, 1 March 2013, “The Book-Writing Machine”. The subtitle reads: What was the first novel ever written on a word processor?

The machine was IBM’s Magnetic Tape Selectric Typewriter (MT/ST). Its primary unit weighed 200lbs (ca 91 kg) and when Len Deighton leased it in 1968, a crane had to be used to get it into his house on Merrick Square, just south of the Thames in London. It is not far from Borough Market, a mere six Underground stops on the Northern Line from where the British Library stands today.

Houses on Merrick Square as it is today. Screenshot image from Google Maps street view

It was Deighton’s personal assistant Ellenor Handley who mastered the new technology and there is a wonderful account in the article of the integral role that she played in bringing the writing to fruition, along with the author’s use of maps, weather charts, colour-coded and cross-reference notes, and tags in what was a very complex creative operation.

The article briefly explores the role of the MTST: “At the same instant a character was imprinted on the page from the Selectric’s typing mechanism, that keystroke was recorded as data on a magnetic tape cartridge. There was no screen...”.

When I visited Maryland a few weeks ago I was given the opportunity to see a similar machine that MITH has obtained. It is awaiting some technical care and restoration to bring it back into operation but it already sits in pride of place along with a copy of Bomber in the newly established home of the institute where some of its collections of computer hardware, disks, manuals and printouts are prominently displayed.

Matthew Kirschenbaum kindly sent me a photograph, indicating that this is a Model II, which has a single tape reel whereas Deighton’s would have had two tape reels.

Photograph of IBM Magnetic Tape Selectric Typewriter (MT/ST) Model II at Maryland Institute for Technology in Humanities, shown with permission. Courtesy © Matthew G. Kirschenbaum

Photographs in the Science & Society Picture Library suggest that there is at least one in the Science Museum of the UK. In Europe the machine was known as the MT72 and it was built in the Netherlands. Some further information about typewriters can be obtained from IBM's website.

For the curator, the article is revealing in other ways. It shows how digital academics and scholars are advancing their research, their concern for the materiality of the intellectual process, their interest in the digital practices of writers, artists and scientists; and it is scholars like Kirschenbaum that libraries and archives will be serving in the coming years and decades.

We can look forward to much more. There will be a book entitled “Track Changes: A Literary History of Word Processing” from Harvard University Press. I cannot wait to see it.

Information about Len Deighton himself can be found on the delightful website The Deighton Dossier. He is someone who has always been technically oriented, as is evident in his diverse writings including topics such as the pen and aeroplane engines; no doubt his use of computer technologies for writing has continued to match ongoing developments over the years. A biography about Deighton is being prepared by Edward Milward-Oliver, who was recently interviewed by Jeremy Duns

Jeremy Leighton John, @emsscurator

Posted by Jeremy John at 11:17 AM in Experiments , Tools | Permalink | Comments( 0)

BL Labs Launch Event

Picture of Mahendra Mahey manager of British Library Labs

I would like to introduce myself as the newest member of the British Library’s Digital Scholarship team. My name is Mahendra Mahey and I will be working as the project manager of British Library Labs, building on the work of Stella Wisdom and Nora McGregor who have been helping get things started.

Previously, I was at UKOLN at the University of Bath, where I worked on several projects. The most recent one being the Jisc funded Developer Community Supporting Innovation (DevCSI) initiative which worked with software developers and researchers in UK Further and Higher Education to help create a 'community' to facilitate the sharing of their experience, ideas and expertise in software development, through events such as Dev8D. The project stimulated technical innovation in the sector by getting developers / researchers working with stakeholders such as librarians and academics to work on ideas, data and tools to develop prototypes, new tools and services through activities such as hack days, competitions and national / international developer challenges. Research was also carried into the value and impact that software development brings to UK Further and Higher Education. Previous projects I have worked on include; one which focussed on how UK academic institutions could manage their research information using a common European metadata standard and another supporting research in digital repositories of scholarly outputs.

Before working at the university, I was an adviser for the Jisc Regional Support Centres in the West Midlands and Scotland encouraging academics and librarians in Further and Higher education to use electronic learning resources and make effective use of e-learning technologies and techniques in their practice. For over 10 years, I worked as a lecturer of Social Sciences, Computing, Multimedia and English for Speakers of Other Languages in Further and Higher Education colleges in the UK and in Poland.

I am really excited to be involved in the British Library Labs project, which for me is about getting researchers to use their ideas, skills, experience and techniques to create new narratives from the British Library’s vast incredible digital collections from 19th Century books to archived websites and wildlife sounds to manuscripts to name but a few examples. This project will help the Library to understand the tools and services that researchers need to unlock these fascinating and diverse digital collections.

"Every book tells a story, but what can 68,000 books tell you?"

We are holding a launch event at the Library on Monday the 25th of March 2013, between 1045-1500 to promote the project. Invited speakers include:

Marc Alexander, University of Glasgow
Stewart Brookes, King’s College London
Adam Crymble, King’s College London
Paul Gooding, University College London
Tony Hirst, The Open University
Jonathan Hope, University of Strathclyde
Yves Raimond, BBC

If you are interested in coming (there are only a few places left), want to know more about the project, attend future events, have a fantastic idea, or just want to be involved, please email [email protected]. Also, please check out our new website, where we will be adding more information over the next few weeks. This will include details of our competitions, events and other activities where we will be encouraging you to submit ideas and to start a discussion on working in new and exciting ways to analyse and experiment with British Library collections (the website is wiki based). Watch out for the first opportunity to participate, where you could win a £3,000 cash prize and a residency to develop your idea / research project further at the British Library in London and work with our world renowned experts.

Follow us at http://labs.bl.uk and this blog

twitter @BL_Labs (hash tag #BL_Labs)

or me on twitter @mahendra_mahey

Posted by Mr Mahendra Mahey at 10:00 AM in Collaborations , Data , Events , Experiments , Projects , Tools | Permalink | Comments( 0)

Tags: academic , BL Labs , British Library Labs , competition , digital collections , ngrams , researcher , researchers , text mining , universities

TreeCurator, and 3D Visualisation of Computer Directories

(or, the case of the monotomous nodes)

A screenshot of the file tree of a hard drive of John Maynard Smith using Walrus with Phylo3D

As mentioned in a blog on 1 March 2013 about the use of phylogenetic software for visualising the arrangement of directories and folders in computer media, the Newick file format is used by tree viewers to construct and present the tree. Usually phylogeneticists obtain their Newick file directly from the software that undertakes the phylogenetic analysis. In order to use phylogenetic tree viewers in another context it is necessary to create the Newick file independently.

The eMSS Lab at the British Library has been writing programs in Python for creating the necessary files in Newick format, and they may be seen as an initial component of a tool to be known as TreeCurator. In the first instance code has been directed at depicting the arrangement of computer files and folders but the same approach can also be used to show the arrangement trees of analogue objects notably the papers in a personal archive of letters, diaries and notebooks. This delivery and presentation may facilitate the integration of analogue and digital entities in a hybrid personal archive, for example.

Although Newick may be seen as a kind of standard, there is in reality quite a bit of diversity in interpretation by software and there are a number of variants such as NHX (New Hampshire Extended) and NEXUS, with their XML derivatives phyloXML and NeXML. (It is worth bearing in mind too that digital curators and preservation practitioners working with scientific archives can expect to encounter these variants in personal archives.)

There are, moreover, some important differences between file trees and phylogenetic trees. For example, computer file trees commonly have folders which contain just one folder, whereas phylogenetic trees typically have bifurcating or multifurcating nodes (a single parent with 2 or more descendants)

Some software such as FigTree seems to be able to handle monotomes (monotomous nodes with not only a single parent but also a single descendant) but other software such as Phylo3D is not able to do so, and it is necessary to adapt the Newick tree file data accordingly.

One of the approaches towards visualising trees of objects not mentioned in the blog entry for 1 March 2013 is the use of 3D visualisation.

It is still early days in the case of phylogenetic trees and so far the emerging possibilities have had an ambivalent reception but there have been some important efforts. Among the most notable are Paloverde and Phylo3D (which makes it possible to use Walrus).

Three screenshots of the visualisations of a hard drive created using Paloverde: circle, cone, spiral

Walrus requires a special (some might say, esoteric) version of graph file format known as LibSea. (It is possible to create directory trees directly from a hard drive using the utility called dirgraph which produces LibSea files but the aim of Tree Curator and this brief exploration of 3D is to be able to maximise usability by working directly with Newick and its variants.) The tool Phylo3D was developed by Dr Timothy Hughes for converting Newick (and its relatives) to the format necessary for Walrus, and I thank him for confirming that monotomy was the issue that I needed to address in order to use his program.

Although limited in their functionality pioneering 3D tree visualisation software do illustrate the potential benefit of interactive 3D trees. In occupying the third dimension the leaf tips of the tree may be presented more compactly and in a way that suits the viewer. Indeed this is manifested in the way in which living trees occupy space in order to maximise access to sunlight and meet the gaze of the sun as it moves across the sky.

The following pictures show the file tree of a hard drive of John Maynard Smith at a number of angles and proximities using Walrus. These are static images. Active use of Walrus, allows the viewer to move the 3D image around for viewing from various directions as well as zooming in and out.

Three screenshots of the file tree of a hard drive using Walrus: the lower two images are close ups

Annotation is possible but currently limited. No doubt if phylogenetic trees had always been prepared in 3D, an enterprising researcher would have invented 2D trees. In truth both have advantages and disadvantages. (For an example of discussion see the article "Crunching the Data for the Tree of Life" in a New York Times article.) Future Digital Scholarship blogs will continue the examination of potentially useful phylogenetic software in the context of computer media and digital curation.

Jeremy Leighton John, @emsscurator

Posted by Jeremy John at 12:40 PM in Experiments , Tools | Permalink | Comments( 0)

Phylogenetic Tree Visualisation and Annotation

Low resolution image of the file tree of a hard drive using FigTree

Forensic techniques make it possible to extract large volumes of information from the storage media of personal computers. A subsequent challenge lies in the effective presentation of this information. Paper archives are commonly represented by a set of arrangement records, which taken together essentially map the way the papers such as letters and notes were originally held in envelopes within folders within bundles of folders. Computer files are similarly arranged in a logically hierarchical system within digital media.

Phylogenetic software has been designed to show the evolutionary relationships between living organisms, and can enable annotation at each internal node (the nodes akin to folders) as well as at the leaf nodes (the species or entities at the tips of the branches). Annotation may take the form of core metadata created automatically by a program, and supplementary metadata compiled by a scientist or curator, and there may be links to pertinent ancillary information or objects including digital images of the species. There are a number of websites dedicated to the tree of life such as ToL. There is an article about the rationale of the approach in the journal Zootaxa. It also points to some of the other tree of life websites.

The javascript library jsPhyloSVG makes it possible to create very attractive trees (both rectangular and circular) combined with binary and bar charts. The maps may be interactively dynamic using vector based (suitable for close resolution without pixelation) SVG and HTML5.

The Personal Digital Manuscripts project at the British Library is exploring these and other techniques to see which are most suited for specific purposes.

The emphasis in recent years has been on increasing the number of species or entities to be handled. As phylogeneticists themselves have pointed out (eg in an article in Trends in Ecology and Evolution February 2012), there are three approaches to viewing complex trees and networks: (i) multiscreen mosaic; (ii) pre-emptive tiling of very large images; and (iii) focus and context using special geometric effects. Each has its merits and is likely to be most effective in combination with one or more of the others.

Recently fractal algorithms have been adopted by OneZoom Explorer and the DeepTree system enabling interactive visualisation of the tree of life. One manifestation has been a multitouch table connected to a multiscreen system.

It is usually necessary first to get the structural information into a special form such as the Newick format (named after a celebrated seafood restaurant in New Hampshire where the standard was first agreed).

The versatile FigTree software was used to create a circular tree of nearly 14,000 files from one of the hard drives of John Maynard Smith at the British Library based on Newick (seen at the beginning of this blog). The software Archaeopteryx is able to handle large trees and may be used to convert Newick into other forms such as phyloXML which can in turn be viewed and edited using an XML editor.

Three very low resolution images of the file tree of the same hard drive using Archaeopteryx. The bottom one is the result of clicking on one of the nodes in the second one. This is one of several ways of quickly navigating the tree. The top image of the three images displays the tree in an unrooted form

All of the software mentioned in this blog (except DeepTree) is up and running in Digital Scholarship Labs at the British Library, along with a high resolution multimonitor system. There is a lot more to explore and this will be one of a series of Digital Scholarship blogs about adapting existing technologies for showing large tree and network visualisations with annotations.

Jeremy Leighton John @emsscurator

Posted by Jeremy John at 7:46 PM in Experiments , Tools | Permalink | Comments( 0)

Digital scholarship blog

207 posts categorized "Experiments"

@HumaBirdProject + #inspiringwomen

Seeking trends in article titles

A novel, a writing machine and a leafy square in London

BL Labs Launch Event

"Every book tells a story, but what can 68,000 books tell you?"

TreeCurator, and 3D Visualisation of Computer Directories

Phylogenetic Tree Visualisation and Annotation

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs