Digital scholarship blog: September 2013

3 posts from September 2013

30 September 2013

The Mechanical Curator

Last week she plucked out a lion. In portrait the beast looked solemn, pensive; yet constructed with a careful craftsmanship that burst from the page, from the screen. The caption -‘The lion with the buful voice’ - confused. “Buful”? Is this London E20? And “voice”, wherefrom? whereto? Do not lions roar rather than sing in “buful” tones?

Image from ‘The Little Lady of Lavender. [A tale.] … Illustrated, etc’

Such is the whim of our newest colleague, The Mechanical Curator. She plucks from obscurity, places all before you, and leaves you to work out the rest. Or not. For sometimes laughter is the only valid response: why illuminate the letter ‘O’ with a grumpy looking child; a grumpy illumination, an oxymoron? Well no, screams out the title of the host work from the metadata: ‘Face to Face with the Mexicans: the domestic life, educational, social, and business ways’. The Mechanical Curator has put us ‘face to face’ with a Mexican circa 1890, or an engraved portrait of a Mexican, or an American representations of a Mexican. And so as what at first seemed simple descends into complexity the Mechanical Curator achieves her peculiar aim: giving knowledge with one hand, carpet bombing the foundations of that knowledge with the other.

Image from ‘Face to Face with the Mexicans: the domestic life, educational, social, and business ways … of the Mexican People … With 200 illustrations. [With musical notes.]’

In a sense the pursuit of knowledge is not the point. Two or so weeks ago, this hourly stream of randomly selected small illustrations and ornamentations started life as a sea of faces. A facial recognition tool was set to scan the illustrations the ALTO XML said the corpus of some fifty-thousand books, some sixty-plus-thousand volumes, contained - we could, we thought, push them out in some way, reveal a collage of past faces. The tool returned many faces, or illustrations of faces to be precise, mostly female, rarely with hats, always looking forward with mouth, nose, forehead on the vertical. Something was up. Ben explained that the tool (a pesky blackbox) was trained on modern passport photos: faces looking forward, rarely smiling, never wearing hats, always on the vertical. This couldn’t hope to capture the faces of the long-eighteenth and long-nineteenth centuries; it was to be no Invisible Australians. The tool was then turned 90° both ways. The hit count expanded, but was beset by failure: contained in physical blemishes, details in walls, flourishes and ornamentation were Victorian smilies, disguised not on purpose but by accident: happy accidents maybe, but not data pertinent to past phenomena.

The flourishes and ornamentation were, however, beautiful, arresting. Nora was enchanted, vocally. Can we capture these? Can we capture and present these? Can we capture and present these in a way that gives pause for the appreciation of even the smallest embellishment to the text?

The answer, of course, was ‘we can’. And hence we have.

Over the next year The Mechanical Curator will keep picking and publishing. She is agnostic to notions of quality, of fashion, of art. And she will change, respond, improve. She may even be given a larger pot to pick from. She is, afterall, but a bantling. As she grows she will embody all the while the values of British Library Labs and the Digital Curator team: to use digital technologies to enable novel interaction with the digital collections we hold and to (where possible) publish into the public domain, or similar, those digital collections over which neither we nor anybody else have asserted copyright claims.

Without our Mechanical Curator we might never have known of the lion with the buful voice; his capacity to do something other than roar may have been left undiscovered. And from the warmth with which many of you have embraced her this, it seems, is something many of you value. So we hope you, like us, look forward to checking in over the coming months to pick through the curiosities that she plucks from obscurity.

For more details on the hows and whats, look out for an update from Ben in the coming days.

@j_w_baker

Posted by James Baker at 11:52 AM

Tags

BL Labs, Data, Experiments

27 September 2013

Digital Conversations Event on The Scholarly Use of Web Archives

The Digital Research & Curator Team had organised a number of thought-provoking events throughout 2011 and 2012. However, these were only open to British Library staff and we thought that it was now time to share Digital Conversations with a wider audience.

Snazzy direction signs were placed around the building to help guide attendees to the venue

Our theme for the first public Digital Conversations event: The Scholarly Use of Web Archives, had been suggested by our colleagues in the British Library’s Web Archiving Team and we were delighted to invite international experts to speak: Richard Rogers, Professor of New Media and Culture and Chair of Media Studies at the University of Amsterdam and Niels Brügger, Director and an Associate Professor of Internet Studies at the Centre for Internet Studies, Aarhus University, Denmark.

They were joined by Helen Hockx-Yu, Head of Web archiving at the British Library; David Berry, Reader in Digital Media in the School of Media, Film and Music at the University of Sussex and Michel Hockx, Professor of Chinese at SOAS, University of London.

David Gauntlett, Professor of Media and Communications at the University of Westminster, moderated the event and effectively kept the schedule running on time.

Stella Wisdom (left) introduces the event, to the right is Richard Rogers, Michel Hockx, David Berry & David Gauntlett

Many interesting questions were raised by both the speakers and the audience; these included:

Is the term web archive holding back archiving the web?
How do we capture the ephemerality and changeability of websites?
Should we and can we preserve apps?
Do we need an international project for web archiving?
What analytical tools and skills do we need to effectively use web archive resources in research?

Many of the issues raised are far too complex and profound to be solved by a single conversation! However, we hope to have raised awareness of of the importance of web archives and started people thinking about how they may engage with them in their research.

There was active tweeting throughout the event using the #BLdigital hashtag and you can see these tweets on Storify.

Slides from some of the presentations can be viewed on SlideShare and you can watch the "Google and the Politics of Tabs" video shown by Richard Rogers on YouTube.

The next Digital Conversations event will be on Interactive Narratives on 4 November 2013; watch this blog for more information, including how to book a place, coming soon.

Attendees enjoying the event over a glass of wine

Posted by Stella Wisdom at 7:16 PM

Tags

Events

16 September 2013

Data exploration through visualisation

The impact that a thoughtful visualisation has cannot be underestimated. However, it's easy to forget how tremendously useful they are for understanding your own data, before you even know what you have.

"Textual cross references found in the Bible" © Chris Harrison http://www.chrisharrison.net/index.php/Visualizations/BibleViz

The questions "Is there...?" and "What if...?" drive the exploration of data. Often, these questions are best answered by creating something in reply: "This is what it looks like in that context." This can be as simple as creating a chart from a spreadsheet, or pulling out all the key words and phrases and putting them all together on a single page. There are also a number of tools that will take structured data and provide different ways to examine them.

This post will not be a survey of visualisation software, but it will contain a few worked examples of how I approached some data and how I went about exploring it using the occasional visualisation. There is python code below but you won't need to know how to program to understand what it does.

Case Study: A "Random" identifier?

I was given access as part of an exploratory project to a large amount of media 'stuff', for which the only metadata for it is that each item had a long hexadecimal string for an identifier, its Unique Material Identifier, or UMID. The file metadata and folder structure reflected the date at which the item was accessioned into the system, not the date that the item was published, recorded or created. I found an explanation of the UMID's internal structure here [NB: PDF] but, a standard is only as standard as its implementations are and I couldn't be confident that the UMIDs I had fit this structure. However, a random sample of UMIDs - randomly selected, not just taken 'at random' by hand - fit the above pattern:

The 'Label' and 'L' held the same values in every UMID and could be ignored for now. (I have masked out the 'Label' number together with a portion of the UMID at the end as these contained identifying but irrelevant information about the system.) The 'L' value (0x13) simply stated the number of bytes - 19 in decimal - that followed it. We can therefore ignore both of these parts.

The 'Instance number' and the 'Material number' were far more interesting however. Maybe this might hold some sort of pattern? A counter that goes up gradually perhaps? I noted that the instance number is made up of three bytes, just like HTML colour codes. Maybe the data will be clearer if it was plotted out as colour information, using HTML and a browser? Worth a try!

Let's start with a list of these UMIDs, in a file called 'umids.txt'. You can get a sample file to work with from here. The file contents will look like this:

0xFFFFFFFFFFFFFFFFFFFFFFFF13354abaf43bfe04396505805ef6FFFFFFFFFFFF
0xFFFFFFFFFFFFFFFFFFFFFFFF13f558cb8b406d0140650580733dFFFFFFFFFFFF
  ... 
  etc

I use python to manipulate and massage data and I use it here so you'll have to install it to follow along on your own machine. (I highly recommend learning the basics of it if you are planning to dig into some data at some point.)

Python 2.7.5 (default, Jul 30 2013, 14:34:22)
[GCC 4.8.1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> umid_file = open("umids.txt", "r")      # open the umid list for reading ("r")

Now to build up a list of the numbers we care about in a variable called 'umid_list' by going through each line of the file and pulling out the number we want.

>>> umid_list = []
>>> for line in umid_file:
>>>   # The digits of the instance number after the 28th character, 
...   # and ends on the 34th character. We can slice out this section and
...   # add it to the umid_list like this:
...   umid_list.append(line[28:34])
...
>>>

We now have a list of things called 'umid_list', and each thing is a 6-digit hexadecimal number. If I want to specify a colour using CSS in an HTML page, it would look like this: "color: #cbe4e4;". Let's write a little bit of code to make coloured bars in a webpage, the colours corresponding to the instance numbers:

>>> def bars(html_filename, list_of_stuff):
...   with open(html_filename, "w") as htmlfile:
...     htmlfile.write("<html><head><style>.bar { width: 100%; height: 0.3em; } </style></head><body>")
...     for instance_number in list_of_stuff:
...       htmlfile.write(
          '<div class="bar" style="background-color: #{0};"> </div>\n'.format(instance_number))
...     htmlfile.write("</body></html>")
...
>>> bars("bars_of_umids.html", umid_list)
>>>

Example code to create bar and block HTML pages, as well as means to create random and ordered 'stuff' to visualise can be found here.

What this does is create a method that takes a filename to write to and a list of stuff to use to create the webpage. It opens the file, writes the basic HTML head including the styling for the 'bar'. In this case, it makes it as wide as the page and "0.3em" high - 1 em is the corresponds to the current font size.

The code then adds an element for each instance number, and styles the element to use the instance number as a background colour. The generated pages look something like this:

I wasn't able to discern much from this very long page. Instead of bars of colour, how about small blocks of colour? I edited the style information in the browser (via Ctrl-Shift-I in Chrome) to do so and it turned out like this:

This is far more interesting! Note how there is a massive change in the range of numbers used after a certain point and the colours tend towards the red after this point too. Compared to random numbers, the difference is noticeable:

Already, we understand that this section is not random, that there might be some interesting information in there that is worth a little more exploration. The visualisation may not convey any specific information and publishing it may be wasted effort, unless you are writing a piece about usefulness of these intermediate visualisations of course!

Visualising 'Post-Excel' data

A final example will use the published RDF data from the British National Bibliography (BNB) which comprises around 90 million 'triples'. This is too much information to put into Excel, but it isn't really big enough to count as 'Big Data'. That said, the same toolsets that have been developed to handle the very large datasets are very capable when manipulating these big-but-not-big datasets. I used Apache Pig on Hadoop to do the work. I will skip the details about how I did the number-crunching for now as deserves its own post later!

I wanted to know, proportionally, how the number of works in a given subject varies across the years that the BNB covers. In this example, I created a list with rows containing the Dewey Decimal Classification (DDC) including the version of DDC, the year of publication and the number of works published in that year in that DDC range. I then created a webpage that uses a javascript library called D3 to draw and place circles on the page to represent this data. As this uses javascript and SVG, the visualisation should work in most normal browsers (i.e. not Internet Explorer).

The DDC17 to DDC23 checkboxes will allow you to hide or show the circles corresponding to the various different versions of the Dewey Decimal Classification. I recommend zooming out (Ctrl -) to get general feel of the entire distribution. Note that for some publication years, multiple versions of the DDC are used to categorise the works, due to the retrospective nature of cataloguing. There are works classified with Dewey that pre-date the entire system.

While this visualisation has done its job and given me and my colleagues a better idea of what is in there, it doesn't answer a specific question as many well-regarded visualisations or infographics do. However, I hope that the usefulness of making these exploratory images has been made clear!

Please comment below if you would like more detail on how I adapted some code from an existing D3 visualisation to create this one.

In summary, try to view your data in a number of ways as you go, rather than just as a means to publish it.

Posted by Ben O'Steen at 4:08 PM

Tags

BL Labs, Experiments, Tools