Digital scholarship blog

Enabling innovative research with British Library digital collections

207 posts categorized "Experiments"

12 December 2013

A million first steps

We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images to us, allowing us to release them back into the Public Domain.

The images themselves cover a startling mix of subjects: There are maps, geological diagrams, beautiful illustrations, comical satire, illuminated and decorative letters, colourful illustrations, landscapes, wall-paintings and so much more that even we are not aware of.

Which brings me to the point of this release. We are looking for new, inventive ways to navigate, find and display these 'unseen illustrations'. The images were plucked from the pages as part of the 'Mechanical Curator', a creation of the British Library Labs project. Each image is individually addressible, online, and Flickr provies an API to access it and the image's associated description. 

We may know which book, volume and page an image was drawn from, but we know nothing about a given image. Consider the image below. The title of the work may suggest the thematic subject matter of any illustrations in the book, but it doesn't suggest how colourful and arresting these images are.

(Aside from any educated guesses we might make based on the subject matter of the book of course.)

11075039705_36900f9312

See more from this book: "Historia de las Indias de Nueva-España y islas de Tierra Firme..." (1867)

Next steps

We plan to launch a crowdsourcing application at the beginning of next year, to help describe what the images portray. Our intention is to use this data to train automated classifiers that will run against the whole of the content. The data from this will be as openly licensed as is sensible (given the nature of crowdsourcing) and the code, as always, will be under an open licence.

The manifests of images, with descriptions of the works that they were taken from, are available on github and are also released under a public-domain 'licence'. This set of metadata being on github should indicate that we fully intend people to work with it, to adapt it, and to push back improvements that should help others work with this release. 

There are very few datasets of this nature free for any use and by putting it online we hope to stimulate and support research concerning printed illustrations, maps and other material not currently studied. Given that the images are derived from just 65,000 volumes and that the library holds many millions of items.

If you need help or would like to collaborate with us, please contact us on email, or twitter (or me personally, on any technical aspects)

The Initial Layout

The images have been tagged to aid browsing and to provide new views on the works themselves. They are tagged by publication year (eg 1764, 1864, 1884), by book (eg 003927270000149253), by author (eg Charles Dickens) and by other means.

This structure is helpful but we can do better! We want to collaborate with researchers and anyone else with a good idea for how to markup, classify and explore this set with an aim to improve the data and to improve and add to the tagging. We are looking to crowdsource information about what is depicted in the images themselves, as well as using analytical methods to interpret them as a whole.

We are very interested to hear what ideas and projects people use these images for and we would ideally like to collaborate with those who have been inspired to explore them.

Finally, while they have been released into the public domain, we would like to direct you to a post by Dan Cohen titled "CC0 (+BY)" There is no obligation for you to attribute anything to us, but we'd appreciate it. The dataset will develop over time, and will improve after all!

Some examples

11223149846_449d526f31_z

"Manners and Customs of the ancient Egyptians, ... Illustrated by drawings, etc. 3 vol. (A second series of the Manners and Customs of the Ancient Egyptians. 3 vol.)" by WILKINSON, John Gardner - Sir

11305478975_8d6c506459

"The United States of America. A study of the American Commonwealth, its natural resources, people, industries, manufactures, commerce, and its work in literature, science, education and self-government. [By various authors.] Edited by N. S. Shaler ... With many illustrations" by SHALER, Nathaniel Southgate.

11307227433_e5bb52c3ba_z

"Comic History of Greece from the earliest times to the death of Alexander the Great ... Illustrated, etc" by SNYDER, Charles M.

11228106243_cfaba62d0f_z

"The Coming of Father Christmas" by MANNING, Eliza F.

11232670175_86031d436a_z

"The Casquet of Literature, being a selection of prose and poetry from the works of the most admired authors. Edited with biographical and literary notes by C. Gibbon ... and M. E. Christie. Illustrated from original drawings by eminent artists" by GIBBON, Charles - Esq., and CHRISTIE (Mary Elizabeth) Miss

25 November 2013

Mixing the Library: Information Interaction & the DJ - Origins

Posted on behalf of Dan Norton - British Library Labs 2013 Competition Winner

Following the completion of my PhD at the University of Dundee, I spent a period of time as Artist in Resident at Hangar Centre for Art and Research, Barcelona. There, I collaborated closely with the Department of Library Science and Documentation at the University of Barcelona to explore the potential value of the DJ’s model of information interaction in the field of Library Science and particularly with the use of digital collections.

It was with this background, I decided to enter the British Library Labs 2013 competition after participating in an online meeting: 

British Library Labs Virtual event, May 2013
British Library Labs Virtual Event (Google hangout), 17 May 2013

http://www.youtube.com/watch?v=RFt0NvbTFHs

My project idea was to apply the DJ's way of working with music, their model of information interaction (developed as part of my doctoral study entitled: "Mixing the Library: Information Interaction and the DJ”), in a prototype for working with multiple data-representations from the British Library digital collections. This practice-led investigation would be used to describe the interface requirements for collecting, enriching, mixing/linking, and visualizing information from large digital libraries.

The prototype would be built on previous work and would attempt to combine the two essential interface features of the DJ's model: continual visual presence of the collection as a writeable menu system; and a "mixing screen" that allows two (or more) data fragments from the collection to be presented, combined, and linked. These simple interface requirements would be used to build sequences through an archive by linking articles, and later be evaluated and extended as a tool for developing semantically rich data, to be incorporated into the RCUK SerenA project (serena.ac.uk) of which my doctoral study was a part.  

The tool would hopefully  operate as a powerful scholarly interface for learning and development in digital collections, for sharing annotated semantically rich data, and for communicative exchange of findings. A screencast demonstrating the interface is available here:

Prototype Mixing Interface
Prototype Mixing Interface

 

http://ablab.org/BLLabs/test/index2.html

I think it would be important to give some background and explanation as to why I think the DJ’s model of Information Interaction is valuable for working with multiple data types (images, video, sound, text etc.)

Background

'The significance of the 'information explosion' may lie not in an explosion of quantity per se, but in an incalculably greater combinatorial explosion of unnoticed and unintended logical connections.' (Swanson 1996).

 

Dan Norton DJaying
Wakanegra
Sonidero and Greenpoint Reggae support Mungos HiFi, Palma. 2012

Creativity in the DJ’s system apparently simple. It is entirely reliant upon bringing together sequences of material and exploring the possible combinations. It can do this simply as a sequence of tracks one after another, or in my complicated way which involve over layering, sampling, and mashups.

The DJ’s creativity enters the system in two fundamental information behaviours: selecting and mixing. With these two behaviours alone personal expression and intent can enter an activity that always reuses stored content.

Selecting is the principle of creative behaviour. It reduces information volume, builds useful groups of related material, and in the live event is done responsively to feedback and personal idea.

Mixing is the second creative information behaviour. It combines the material and explores the connection between articles. Mixing formulates previously unnoticed connections from within the archive.

A Model of Learning

Learning is intrinsic to the DJ’s model of interaction. Learning and memory develop through retrieval, organisation, classification, and addition of metadata. The association of human memory to the digital image of the collection (digital memory), creates a system in which the DJ can work in a creative flow with information, moving between idea and information.

The model incorporates retrieval and learning, with creative development and a publication workflow. Newly created texts are directly tested in a field of listeners, and may also be published and released as informational resources. The model is described in the image below.

 

DJ's Model of Information Interaction (Norton, 2013)
DJ's Model of Information Interaction (Norton, 2013)


The next few blogposts will discuss my experiences of using the British Library’s collections, working with Labs to develop the first functioning iteration of the interface and the future developments of the work.

 

20 November 2013

The georeferencer is back!

Coinciding nicely with GIS Day 2013 the British Library's Lead Curator of Digital Mapping Kimberly Kowal has just released online the biggest batch yet of digitised maps needing georeferencing: 2,700 of them! 

She explains, "We're asking the public to help "place" them, ie identify their locations by assigning points using modern mapping. It can be a challenge, but it is an opportunity to discover familiar areas as they existed around one hundred years ago." 

Read more about it over on the British Library Maps & Views Blog.

Or just dive right in and get georeferencing!

 

 

 

08 November 2013

The Sample Generator - Part 1: Origins

Posted on behalf of Pieter Francois.

Imagine being asked to describe the tool you always wanted when you were writing your PhD.

Imagine being asked (without having to worry too much about technical implementations), to make a case for a digital tool that would have:

  • saved you enormous time
  • allowed you to expand drastically the number of sources to study
  • allowed you to ask new and more relevant research questions 

Which digital tool would you choose?
What functionality seems crucial to you but is surprisingly lacking in your research area? 

It was with this frame of mind I decided to enter the 2013 British Library Labs competition with the idea to create a Sample Generator, i.e. a tool which is able to give me an unbiased sample of texts based on my search criteria. Being one of the chosen winners provided me with an opportunity to put together a small team of people from both within and outside the British Library to make it reality.

When studying the world of nineteenth-century travel for my PhD I used the collections of the British Library extensively. Being able to look for relevant material in roughly 1.8 million records is a researcher's dream. It can also be a curse.

Travel_query_primo

Snapshot of catalogue window with search word "travel"

How did the material I decided to look at fit into the overall holdings? Sure, my catalogue searches did produce plenty of relevant material, but how representative was the material I looked at for the overall nineteenth-century publication landscape? Even when assuming the British Library holdings are as a good a proxy as any for the entire nineteenth-century British publication landscape, this is a very difficult question to answer. Historians and literary scholars have designed many clever methodological constructs to tackle such issues of representativity, to tackle potential biases of the studied sources and to deal with gaps in their source material. Yet very few attempts have been made to deal with these issues in a systematic way.

The ever growing availability of large digital collections has changed the scale of this issue, but it did not change its nature. For example, the wonderful digital '19th Century books' collection of the British Library provides you access to approximately fifty thousand books in digital form and to the enthusiast of text and sentiment mining or scholars interested in combining distant and close reading its potential is phenomenal. However, the impressive size of the collection does not deal with the crucial questions:

How these books relate to the approximately 1.8 million nineteenth-century records the library holds?

How the digitized books of the '19th Century books' collection fit into the overall nineteenth-century publication landscape.?

Large numbers can create a false sense of completeness.

The Sample Generator does provide researchers a way to understand more fully the relation between the studied sources and the overall holdings of the British Library. Whereas a traditional title word search in the British Library Integrated catalogue generates an often long list of hits, the use of the Sample Generator allows you with a few additional clicks to generate structured unbiased samples from this list. The key innovation is that these samples mimic the rise and fall in popularity of the searched terms over the nineteenth-century as it is found in the entire British Library holdings for this period.

Depending on the amount of research time available it is possible to change the sample size (or, for cross-validation purposes, to create several samples based on the same search criteria). Furthermore, as the Sample Generator not only works with the catalogue data (metadata) of all nineteenth-century books the British Library holds, but also keeps a special focus on the metadata of the digital '19th Century book' collection (see, http://britishlibrary19c.tumblr.com/ for a representative sample), it is possible to create samples of only digitized texts. These samples can then be further queried by using advanced text analysis and data mining tools (e.g. geo-tagging). As all the samples generated by the various searches will be stored with a unique URL, the samples become citable and they can be shared with peers and be more easily used in collaborative research.

Whereas in this phase of the project the Sample Generator has only been tried out on the the nineteenth-century holdings of the British Library and on the digital '19th Century book' collection, its application is nearly universal. The Sample Generator can be implemented on any catalogue (or even bibliography) and, if relevant, links can be made to one or more digital collections.

Adding such a link with a digital collection allows users to make a different type of claim. For example, the finding 'I observed trend X in digital collection Y' is replaced by the finding 'I observed trend X in a structured unbiased sample Y which is representative of the entire catalogue/bibliography Z'. This adds an important functionality to the increasing number of large digital collections as it removes the inherent, yet often poorly documented, biases of the digitization process (although it introduces the curatorial biases of the much larger collections which are fortunately usually better documented and understood as generations of scholars have come to term with these).

Finally, the Sample Generator is a great hypothesis testing tool. Its use allows scholars to cover a lot of ground fairly quickly by testing a range of hypotheses and ideas on relatively small sample sizes. This allows for a creative, yet structured and well documented, going back and forth between the conceptual drawing board and the data. Whereas such a structured dialogue is fundamental in the natural and social sciences, it is largely lacking in the humanities where this dialogue between ideas and data has tend to happen in a more haphazardly fashion.

The past four months were spent on turning this general idea (which at times felt as overly ambitious) into reality. We faced several challenges, for example the catalogue data was incomplete and inconsistent. Furthermore, I firmly believed that it was essential to accompany this tool with some case studies highlighting its transformative potential. Given the amount of labour and the range of skill sets necessary to complete both tasks, the project had to be team based. Without both the time and intellectual contributions of Mahendra Mahey, Ben O'Steen, Ed Turner and Justin Lane the Sample Generator would still simply be the digital tool I always wanted to have. 

Pieter Francois ~ is one of the winners of the 2013 British Library Lab competition. He works at the Institute of Cognitive and Evolutionary Anthropology, University of Oxford, where he specializes in longitudinal analysis of archaeological and historical data.

The next blogposts of this short series, written by various members of the team, will focus on how to use the Sample Generator, on explaining the technical nuts and bolts at the back end of the tool, and on recounting the experiences of collecting the necessary data to test drive the tool.

30 October 2013

Guess the journal!

Over recent months I’ve been working on-and-off with a collection of metadata relating to articles published since 1995 in journals the library have categorised under the ‘History’ subject heading. 382497 rows of data (under CC0) about publication habits in the Historical profession, which lend themselves to some interesting analysis and visualisation.

HJA_30Js_ii
To recap from previous posts on this blog and on another, I started this work by extracting words which frequently occurred within journal article titles. Having filtered out words whose meaning was fuzzy (‘new’, ‘early’, ‘late’, ‘age’) or whose presence was not helpful (‘David’), I was left with this list of topwords (I’ve avoided ‘keywords’, I just don’t like the word at the moment):

africa america archaeology art britain british china chinese cultural culture development empire england europe france historical history identity life making medieval national policy political politics power revolution social society state study women world

Next I created a .csv where each row represented an occurrence of a one of these 33 topwords in an article title. This totalled 209210 rows; and though this was less than the total number of rows, as many titles contained more than one of these words some articles were represented more than once.

Before we get to the fun bit, there are a number of problems with the data that need pointing out:

  • There are some odd gaps and declines in article volume for some journals around 2005. This isn’t due to actual publication trends, so we are working on why the data isn’t accurate – huge thanks to the Metadata Services team (especially Corine Deliot) for their hard work.
  • The volume of English language titles smother the various English, Italian and – notably thanks to Zeitschrift für Geschichtswissenschaft – German titles, leaving us with very Anglophonic data. I’d like to do some translating, but for now I’ll restrict myself to trends in English language articles.
  • The data isn’t smoothed by articles per journal issue (or articles per journal per year), thus ‘power’ journals are created on sheer volume of output alone (and, as we all should know and should hope to be the mantra of future academic publication, less can be more…).
  • The data includes reviews, though this isn’t necessarily a bad thing as it adds book titles to the list of titles mines (hence why ‘David’ is one of the unfiltered topwords).
  • Some words have multiple meanings (china) or are ill-suited to simple text mining (art), but then corpus linguists have known this for years.
  • Some journals in the data are not really history journals, but rather politics and current affairs publications with a sprinkling of historical content. Archaeology is similarly problematic, but I’ve left these journals in for now out of a sense of GLAM solidarity.

Despite all of this, I’d like you to play a game of guess the journal from a network graph; a network graph representing data for the 30 highest ranking English language History journals in terms of article volume published between 1995 and 2013. On one hand you doing this will help me validate that my data – and this particular way I’ve chose to represent it (a force-directed ‘Force Atlas’ graph generated using Gephi) – has some value; Adam Crymble has a nice example of how this can be useful. On the other it should be a bit of fun.

HJA_30Js_i
So, onto that long promised fun bit. Knowing the following:

  • That each number on the network represents a journal name,
  • that each word within square brackets is a topword from an article title,
  • that the thickness of the line between the word and the number represents the occurrence of that topword in the numbered journal,
  • and that the colouring represents the group (or modularity) the numbered journal has been assigned to based on the structure of the network;

can you guess which number the following journal is represented by? (Or is this whole thing meaningless?)

  • Antiquity
  • English Historical Review
  • International Journal of African Historical Studies
  • International Journal of Maritime History
  • Journal of American History
  • Journal of Asian Studies
  • Journal of Social History
HJA_30Js
Bimodal Force Atlas graph for History Journal Articles published 1995-2013. For more detail (and with apologies for the fuzzy compression above, you'll probably need it!), download the PNG or SVG version.

To start of you off, I’ll gift you that American Historical Review is number 34 – right at the heart of the network, not surprising given the volume of output. I’ll also give you a little derived data to help you make up your mind.

Answers in the comments please!

@j_w_baker

07 October 2013

Peeking behind the curtain of the Mechanical Curator

The "Mechanical Curator" is an experiment, providing undirected engagement with the British Library's digital content. Undirected? 

Wordcloud

Random, fortuitous, haphazard, undirected, unplanned, and most importantly, unpredictable. There are already many ways to discover great content that you know you like, but how do you find things that you cannot begin to describe?

The majority of researchers begin their search for content using a general purpose search engine (Ithaka S+R | Jisc | RLUK: UK Survey of Academics 2012 [PDF]). It is easy to forget just how phenomenally powerful these can be, leading researchers to content that they know they want. This is also its shortcoming. The normal mode of searching makes it very difficult to find things that are not known yet. Keyword searches do not make it easy to collide ideas and concepts together, and to view things from different perspectives and to see what might fit together. 

While many of the major providers have made attempts to provide related content to search results, these fall short of being serendipitous. The idea of searching for content fails when the researcher does not even know what they might want to see or how to describe it in words. The Mechanical Curator approaches discovery from the opposite angle, publishing content as it sees fit without an outside agent directing what it should publish.

"I don't know art but I know it when I see it."

A small book illustration is chosen at random* from the pages of the digitised book collection and posted to a tumblr account with some description information about what book it was taken from, along with its entry in the library catalogue (insofar as this is currently possible.) The images are eclectic and seemingly unthemed, ranging from curious illustrations of animals to ornate, illuminated letters and from drawings of archaeological relics to complex crystal structures.

* - The selection process is not entirely random any more, but more on this development later.

Bugs

Image from ‘The British Miscellany: or, coloured figures of new, rare, or little known animal subjects, etc. vol. I., vol. II’, 003450253 page 275 by SOWERBY, James.

 
Minerals

Image from ‘A System of Mineralogy … Fifth edition, rewritten and enlarged … With three appendixes and corrections. (Appendix I., 1868-1872, by G. J. Brush. Appendix II., 1872-1875, and Appendix III., 1875-1882, by E. S. Dana.)’, 004117752

Relic

Image from ‘The Struggle of the Nations. Egypt, Syria, and Assyria … Edited by A. H. Sayce. Translated by M. L. McClure. With map … and … illustrations’, 002415000 page 696 by MASPERO, Gaston Camille Charles.

The Mechanical Curator. How? Why?

James Baker has written a post illuminating some of the feelings behind the Mechanical Curator so I have constrained myself to write about how I gathered the images from the books, why I did so in the first place, my explorations with the images and why I think it is more interesting that the Mechanical Curator selects images in an almost random fashion, rather than being completely random.

Gathering the images: Context

Microsoft ran a digitisation campaign to provide content for their 'Live Book Search' from around 2005 to 2008. They partnered with a number of libraries and provided the funds and teams to digitise their partner's content.

I have spent some time reorganising and exploring the 65,000 volumes they digitised from the British Library's collection. The years the works cover range from the 14th century right up to the 20th century, with the vast majority being published in the late 19th century. This also means that it has been straightforward to licence these works as being in the public domain, which is why the images are being released with an explicit CC0 licence.

The collection consists of:

  • ~65,000 zipped archives of JPEG2000 image files, with a file per page. Images of the covers and flysheets are also often included,
  • The same number of zipped archives of OCR metadata, encoding the words and letters recognised by the OCR process,
  • Simple METS metadata for the original (physical) item, which was supplied to Microsoft by the British Library,
  • Directories of the unpacked OCR XML and METS metadata, organised by the identifier of the work.

Exploring the pages

Face_detected

I was interested in mechanical ways of exploring the works - can I reuse existing techniques to detect faces to find out how the depiction of faces changes over time? How to hone in on pages with interesting content like maps, people and diagrams? (given that we cannot tell this from the metadata we have about these works.) 

Most importantly, how can I do this with the limited compute power I have? 

Guideline #1: "Make Effective Filters, and use them early and often"

There are several million pages in this collection, each one potentially containing something of interest. Not all pages will have illustrations on them however and processing them to detect faces would likely be a waste of time and effort. On inspection, the OCR XML occasionally contained information on areas where the OCR software believed it found images:

...
<ComposedBlock ID="P153_CB00001" HPOS="94" VPOS="22" WIDTH="833" HEIGHT="225" STYLEREFS="TXT_0 PAR_LEFT" TYPE="Illustration">
    <GraphicalElement ID="P153_CB00001_SUB" HPOS="94" VPOS="22" WIDTH="833" HEIGHT="225"/>
</ComposedBlock>
...

As we cannot say if the OCR process missed any images, this isn't information enough to guarantee finding all of the illustrations. However, it was enough information to build a list of the 1.1 million pages which might be interesting to scan. So, did I write some code to take the XML, parse it, pull out the right nodes... no. This leads me onto my second guideline:

Guideline #2: "Simple tools are your friends. Learn to love grep, sed, cat, awk and *nix pipes."

In this case, it was clear that the sequence of characters "<GraphicalElement ID=" was going to appear in the XML only to indicate the location of an illustration on a page. Using grep and a little bash scripting, I was able to build a list of OCR XML files which was worth looking at in more detail.

[Example path to the OCR xml for ID: '000000206':  "by_id/0000/000000206/ALTO"' (bold for emphasis.)]

$ for division in `ls by_id`
> do
>   for id in `ls by_id/$division`
>   do
>     grep -l "GraphicalElement" by_id/$division/$id/ALTO/*.xml >> ~/illustrations.txt
>   done
> done

Much of the code above is there to loop through all the XML files I needed to. There are much more concise (but less instructive!) ways to do so, and this way is hopefully clear. The bulk of the work is done by "grep -i" which causes it to print out the names of the files that contain the matching text. The '>>' pipes the output into the 'illustrations.txt' file, appending it to the end of whatever is in there already.

Guideline #3: Break down what you want to do into small sets of simple tasks, rather than trying to do it all at once*.

Instead of trying to create a big application th0at would automatically find, parse and act on the OCR XML in one go, I broke it into separate and straightforward tasks. Tasks like the simple one above that filtered out the OCR XML and created a list of pages that contained illustrations.

* This guideline stems from something I have learned from experience: You should avoid writing 'clever' code unnecessarily. You will revisit old code on occasion and you will be surprised at how quickly you forget all the clever tricks you used! Especially if you have changed programming languages and libraries since then too!

As you perhaps can see from the code, there are small scripts to create a queue of jobs from those pages, and other pieces of code that take those jobs and perform a single task on it. In this case, it was to create a jpeg image from any small (<~8in2) book illustration, as indicated by the OCR XML. The idea was to create a shareable collection of the small images such that they would fit on a USB flash drive. I now have a collection of 394,882 small illustrations, which occupy 41Gb of space. As 64Gb USB drives are not too expensive, I'd argue that I achieved that goal!

It was a straightforward piece of work to write a script that pushed a random image to a tumblr account, with a caption containing the small amount of metadata we had on it. This idea was driven by informal conversations the Digital Scholarship team had about how to make this collection of public domain works more accessible and remixable. The Mechanical Curator was born at that point and has posted a random image to tumblr every hour since that point.

Well, it used to post images it selected on a purely random basis. It used to.

Interests and Mood for the Mechanical Curator

The random images are, by their nature, potentially very interesting and often surfaced some interesting works that would have been otherwise ignored, due to the poor information we had on it, such as a single word title like 'London' or a more mundane description. With the Mechanical Curator, you don't know what is going to be posted next and it became clear that this inherent randomness was intriguing and somewhat addictive.

But what if the Mechanical Curator 'curated' its output in some way? What if it gained a very slight bias in what it posted?

I worked on some code that would allow the curator to assess how similar two images were - not just in terms of the the book's age, author and so on, but how similar the images were visually. This was written using OpenCV, and generates gauges of the content in ways that are described in the code as 'slantyness', 'bubblyness' or simply, size of the image. It adds its judgement to the posted image using tags, saying why it finds the images similar and whether or not it has detected a face or profile within the image and whereabouts it believes it is.

The Mechanical Curator now looks through a number of randomly selected images, and will post an image if it is similar enough to the one it most recently uploaded, both visually and by metadata similarities. However, I didn't want to tip the scales of randomness too much. If it cannot find a match after checking eight images, it gets bored and posts the eighth one as a '#new_train_of_thought'.

Due to its low boredom threshold, it regularly starts a new chain of thought and so doesn't get stuck in a loop of posting floral designs, cartoon line-art images, or of etchings!

30 September 2013

The Mechanical Curator

Last week she plucked out a lion. In portrait the beast looked solemn, pensive; yet constructed with a careful craftsmanship that burst from the page, from the screen. The caption -‘The lion with the buful voice’ - confused. “Buful”? Is this London E20? And “voice”, wherefrom? whereto? Do not lions roar rather than sing in “buful” tones?

Such is the whim of our newest colleague, The Mechanical Curator. She plucks from obscurity, places all before you, and leaves you to work out the rest. Or not. For sometimes laughter is the only valid response: why illuminate the letter ‘O’ with a grumpy looking child; a grumpy illumination, an oxymoron? Well no, screams out the title of the host work from the metadata: ‘Face to Face with the Mexicans: the domestic life, educational, social, and business ways’. The Mechanical Curator has put us ‘face to face’ with a Mexican circa 1890, or an engraved portrait of a Mexican, or an American representations of a Mexican. And so as what at first seemed simple descends into complexity the Mechanical Curator achieves her peculiar aim: giving knowledge with one hand, carpet bombing the foundations of that knowledge with the other.

In a sense the pursuit of knowledge is not the point. Two or so weeks ago, this hourly stream of randomly selected small illustrations and ornamentations started life as a sea of faces. A facial recognition tool was set to scan the illustrations the ALTO XML said the corpus of some fifty-thousand books, some sixty-plus-thousand volumes, contained - we could, we thought, push them out in some way, reveal a collage of past faces. The tool returned many faces, or illustrations of faces to be precise, mostly female, rarely with hats, always looking forward with mouth, nose, forehead on the vertical. Something was up. Ben explained that the tool (a pesky blackbox) was trained on modern passport photos: faces looking forward, rarely smiling, never wearing hats, always on the vertical. This couldn’t hope to capture the faces of the long-eighteenth and long-nineteenth centuries; it was to be no Invisible Australians. The tool was then turned 90° both ways. The hit count expanded, but was beset by failure: contained in physical blemishes, details in walls, flourishes and ornamentation were Victorian smilies, disguised not on purpose but by accident: happy accidents maybe, but not data pertinent to past phenomena.

The flourishes and ornamentation were, however, beautiful, arresting. Nora was enchanted, vocally. Can we capture these? Can we capture and present these? Can we capture and present these in a way that gives pause for the appreciation of even the smallest embellishment to the text?

The answer, of course, was ‘we can’. And hence we have.

Over the next year The Mechanical Curator will keep picking and publishing. She is agnostic to notions of quality, of fashion, of art. And she will change, respond, improve. She may even be given a larger pot to pick from. She is, afterall, but a bantling. As she grows she will embody all the while the values of British Library Labs and the Digital Curator team: to use digital technologies to enable novel interaction with the digital collections we hold and to (where possible) publish into the public domain, or similar, those digital collections over which neither we nor anybody else have asserted copyright claims.

Without our Mechanical Curator we might never have known of the lion with the buful voice; his capacity to do something other than roar may have been left undiscovered. And from the warmth with which many of you have embraced her this, it seems, is something many of you value. So we hope you, like us, look forward to checking in over the coming months to pick through the curiosities that she plucks from obscurity.

For more details on the hows and whats, look out for an update from Ben in the coming days.

@j_w_baker

@j_w_baker
@j_w_baker

16 September 2013

Data exploration through visualisation

The impact that a thoughtful visualisation has cannot be underestimated. However, it's easy to forget how tremendously useful they are for understanding your own data, before you even know what you have.

The questions "Is there...?" and "What if...?" drive the exploration of data. Often, these questions are best answered by creating something in reply: "This is what it looks like in that context." This can be as simple as creating a chart from a spreadsheet, or pulling out all the key words and phrases and putting them all together on a single page. There are also a number of tools that will take structured data and provide different ways to examine them.

This post will not be a survey of visualisation software, but it will contain a few worked examples of how I approached some data and how I went about exploring it using the occasional visualisation. There is python code below but you won't need to know how to program to understand what it does.

Case Study: A "Random" identifier?

I was given access as part of an exploratory project to a large amount of media 'stuff', for which the only metadata for it is that each item had a long hexadecimal string for an identifier, its Unique Material Identifier, or UMID. The file metadata and folder structure reflected the date at which the item was accessioned into the system, not the date that the item was published, recorded or created. I found an explanation of the UMID's internal structure here [NB: PDF] but, a standard is only as standard as its implementations are and I couldn't be confident that the UMIDs I had fit this structure. However, a random sample of UMIDs - randomly selected, not just taken 'at random' by hand - fit the above pattern:

UMID

The 'Label' and 'L' held the same values in every UMID and could be ignored for now. (I have masked out the 'Label' number together with a portion of the UMID at the end as these contained identifying but irrelevant information about the system.) The 'L' value (0x13) simply stated the number of bytes - 19 in decimal - that followed it. We can therefore ignore both of these parts.

The 'Instance number' and the 'Material number' were far more interesting however. Maybe this might hold some sort of pattern? A counter that goes up gradually perhaps? I noted that the instance number is made up of three bytes, just like HTML colour codes. Maybe the data will be clearer if it was plotted out as colour information, using HTML and a browser? Worth a try!

Let's start with a list of these UMIDs, in a file called 'umids.txt'. You can get a sample file to work with from here. The file contents will look like this:

0xFFFFFFFFFFFFFFFFFFFFFFFF13354abaf43bfe04396505805ef6FFFFFFFFFFFF
0xFFFFFFFFFFFFFFFFFFFFFFFF13f558cb8b406d0140650580733dFFFFFFFFFFFF
  ... 
  etc

I use python to manipulate and massage data and I use it here so you'll have to install it to follow along on your own machine. (I highly recommend learning the basics of it if you are planning to dig into some data at some point.)

Python 2.7.5 (default, Jul 30 2013, 14:34:22)
[GCC 4.8.1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> umid_file = open("umids.txt", "r")      # open the umid list for reading ("r")

Now to build up a list of the numbers we care about in a variable called 'umid_list' by going through each line of the file and pulling out the number we want.

>>> umid_list = []
>>> for line in umid_file:
>>>   # The digits of the instance number after the 28th character, 
...   # and ends on the 34th character. We can slice out this section and
...   # add it to the umid_list like this:
...   umid_list.append(line[28:34])
...
>>>

We now have a list of things called 'umid_list', and each thing is a 6-digit hexadecimal number. If I want to specify a colour using CSS in an HTML page, it would look like this: "color: #cbe4e4;". Let's write a little bit of code to make coloured bars in a webpage, the colours corresponding to the instance numbers:

>>> def bars(html_filename, list_of_stuff):
...   with open(html_filename, "w") as htmlfile:
...     htmlfile.write("<html><head><style>.bar { width: 100%; height: 0.3em; } </style></head><body>")
...     for instance_number in list_of_stuff:
...       htmlfile.write(
          '<div class="bar" style="background-color: #{0};"> </div>\n'.format(instance_number))
...     htmlfile.write("</body></html>")
...
>>> bars("bars_of_umids.html", umid_list)
>>>
Example code to create bar and block HTML pages, as well as means to create random and ordered 'stuff' to visualise can be found here.

What this does is create a method that takes a filename to write to and a list of stuff to use to create the webpage. It opens the file, writes the basic HTML head including the styling for the 'bar'. In this case, it makes it as wide as the page and "0.3em" high - 1 em is the corresponds to the current font size.

The code then adds an element for each instance number, and styles the element to use the instance number as a background colour. The generated pages look something like this:

Instance_bars

I wasn't able to discern much from this very long page. Instead of bars of colour, how about small blocks of colour? I edited the style information in the browser (via Ctrl-Shift-I in Chrome) to do so and it turned out like this:

Instance_blocks

This is far more interesting! Note how there is a massive change in the range of numbers used after a certain point and the colours tend towards the red after this point too. Compared to random numbers, the difference is noticeable:

Randomvsumids

Already, we understand that this section is not random, that there might be some interesting information in there that is worth a little more exploration. The visualisation may not convey any specific information and publishing it may be wasted effort, unless you are writing a piece about usefulness of these intermediate visualisations of course!

Visualising 'Post-Excel' data

A final example will use the published RDF data from the British National Bibliography (BNB) which comprises around 90 million 'triples'. This is too much information to put into Excel, but it isn't really big enough to count as 'Big Data'. That said, the same toolsets that have been developed to handle the very large datasets are very capable when manipulating these big-but-not-big datasets. I used Apache Pig on Hadoop to do the work. I will skip the details about how I did the number-crunching for now as deserves its own post later!

I wanted to know, proportionally, how the number of works in a given subject varies across the years that the BNB covers. In this example, I created a list with rows containing the Dewey Decimal Classification (DDC) including the version of DDC, the year of publication and the number of works published in that year in that DDC range. I then created a webpage that uses a javascript library called D3 to draw and place circles on the page to represent this data. As this uses javascript and SVG, the visualisation should work in most normal browsers (i.e. not Internet Explorer).

Dewey

The DDC17 to DDC23 checkboxes will allow you to hide or show the circles corresponding to the various different versions of the Dewey Decimal Classification. I recommend zooming out (Ctrl -) to get general feel of the entire distribution. Note that for some publication years, multiple versions of the DDC are used to categorise the works, due to the retrospective nature of cataloguing. There are works classified with Dewey that pre-date the entire system.

While this visualisation has done its job and given me and my colleagues a better idea of what is in there, it doesn't answer a specific question as many well-regarded visualisations or infographics do. However, I hope that the usefulness of making these exploratory images has been made clear!

Please comment below if you would like more detail on how I adapted some code from an existing D3 visualisation to create this one.

In summary, try to view your data in a number of ways as you go, rather than just as a means to publish it.

Digital scholarship blog recent posts

Archives

Tags

Other British Library blogs