The Sample Generator - Part 1: Origins
Posted on behalf of Pieter Francois.
Imagine being asked to describe the tool you always wanted when you were writing your PhD.
Imagine being asked (without having to worry too much about technical implementations), to make a case for a digital tool that would have:
- saved you enormous time
- allowed you to expand drastically the number of sources to study
- allowed you to ask new and more relevant research questions
Which digital tool would you choose?
What functionality seems crucial to you but is surprisingly lacking in your research area?
It was with this frame of mind I decided to enter the 2013 British Library Labs competition with the idea to create a Sample Generator, i.e. a tool which is able to give me an unbiased sample of texts based on my search criteria. Being one of the chosen winners provided me with an opportunity to put together a small team of people from both within and outside the British Library to make it reality.
When studying the world of nineteenth-century travel for my PhD I used the collections of the British Library extensively. Being able to look for relevant material in roughly 1.8 million records is a researcher's dream. It can also be a curse.
Snapshot of catalogue window with search word "travel"
How did the material I decided to look at fit into the overall holdings? Sure, my catalogue searches did produce plenty of relevant material, but how representative was the material I looked at for the overall nineteenth-century publication landscape? Even when assuming the British Library holdings are as a good a proxy as any for the entire nineteenth-century British publication landscape, this is a very difficult question to answer. Historians and literary scholars have designed many clever methodological constructs to tackle such issues of representativity, to tackle potential biases of the studied sources and to deal with gaps in their source material. Yet very few attempts have been made to deal with these issues in a systematic way.
The ever growing availability of large digital collections has changed the scale of this issue, but it did not change its nature. For example, the wonderful digital '19th Century books' collection of the British Library provides you access to approximately fifty thousand books in digital form and to the enthusiast of text and sentiment mining or scholars interested in combining distant and close reading its potential is phenomenal. However, the impressive size of the collection does not deal with the crucial questions:
How these books relate to the approximately 1.8 million nineteenth-century records the library holds?
How the digitized books of the '19th Century books' collection fit into the overall nineteenth-century publication landscape.?
Large numbers can create a false sense of completeness.
The Sample Generator does provide researchers a way to understand more fully the relation between the studied sources and the overall holdings of the British Library. Whereas a traditional title word search in the British Library Integrated catalogue generates an often long list of hits, the use of the Sample Generator allows you with a few additional clicks to generate structured unbiased samples from this list. The key innovation is that these samples mimic the rise and fall in popularity of the searched terms over the nineteenth-century as it is found in the entire British Library holdings for this period.
Depending on the amount of research time available it is possible to change the sample size (or, for cross-validation purposes, to create several samples based on the same search criteria). Furthermore, as the Sample Generator not only works with the catalogue data (metadata) of all nineteenth-century books the British Library holds, but also keeps a special focus on the metadata of the digital '19th Century book' collection (see, http://britishlibrary19c.tumblr.com/ for a representative sample), it is possible to create samples of only digitized texts. These samples can then be further queried by using advanced text analysis and data mining tools (e.g. geo-tagging). As all the samples generated by the various searches will be stored with a unique URL, the samples become citable and they can be shared with peers and be more easily used in collaborative research.
Whereas in this phase of the project the Sample Generator has only been tried out on the the nineteenth-century holdings of the British Library and on the digital '19th Century book' collection, its application is nearly universal. The Sample Generator can be implemented on any catalogue (or even bibliography) and, if relevant, links can be made to one or more digital collections.
Adding such a link with a digital collection allows users to make a different type of claim. For example, the finding 'I observed trend X in digital collection Y' is replaced by the finding 'I observed trend X in a structured unbiased sample Y which is representative of the entire catalogue/bibliography Z'. This adds an important functionality to the increasing number of large digital collections as it removes the inherent, yet often poorly documented, biases of the digitization process (although it introduces the curatorial biases of the much larger collections which are fortunately usually better documented and understood as generations of scholars have come to term with these).
Finally, the Sample Generator is a great hypothesis testing tool. Its use allows scholars to cover a lot of ground fairly quickly by testing a range of hypotheses and ideas on relatively small sample sizes. This allows for a creative, yet structured and well documented, going back and forth between the conceptual drawing board and the data. Whereas such a structured dialogue is fundamental in the natural and social sciences, it is largely lacking in the humanities where this dialogue between ideas and data has tend to happen in a more haphazardly fashion.
The past four months were spent on turning this general idea (which at times felt as overly ambitious) into reality. We faced several challenges, for example the catalogue data was incomplete and inconsistent. Furthermore, I firmly believed that it was essential to accompany this tool with some case studies highlighting its transformative potential. Given the amount of labour and the range of skill sets necessary to complete both tasks, the project had to be team based. Without both the time and intellectual contributions of Mahendra Mahey, Ben O'Steen, Ed Turner and Justin Lane the Sample Generator would still simply be the digital tool I always wanted to have.
Pieter Francois ~ is one of the winners of the 2013 British Library Lab competition. He works at the Institute of Cognitive and Evolutionary Anthropology, University of Oxford, where he specializes in longitudinal analysis of archaeological and historical data.
The next blogposts of this short series, written by various members of the team, will focus on how to use the Sample Generator, on explaining the technical nuts and bolts at the back end of the tool, and on recounting the experiences of collecting the necessary data to test drive the tool.