'Enabling Complex Analysis of Large Scale Digital Collections', a project funded by the Jisc Research Data Spring, empowers researchers to turn their research questions into computational queries and gathers social and technical requirements for infrastructures and services that allow computational exploration of big humanities data. Melissa Terras, Professor of Digital Humanities at UCL and Principal Investigator for the project, blogged in May about initial work to align our data - ALTO XML for 60k+ 17th, 18, and 19th century books - with the performance characteristics of UCL's High Performance Computing Facilities. We have been learning a huge amount about the complexities associated with redeploying architectures designed to work with scientific data (massive yet structured) to the processing of humanities data (not massive instead unstructured). As part of this learning, in June we ran two workshops to which we invited a small, hand-picked group of researchers (from doctoral candidates to mid-career scholars) with queries they wanted to ask of the data that couldn't be satisfied by the sort of search and discovery orientated graphical user interfaces typically served up them.
The researchers were clustered into three groups by their interests, with one group looking for words/strings over time, a second for words/strings in context, and a third for patterns relating to non-textual elements. Each group rotated between three workstations. At one workstation James Hetherington worked with them realise their questions as queries that returned useful derived data. At a second they collaborated with Martin Zaltz Austwick to explore and experiment with ways in which they could represent the data visually. And at a third workstation David Beavan captured their thoughts on the process (such as, does the time taken to wait for results to return impact on your interpretation of those results?), their sense of how computational queries could enrich their research, and their learning outcomes in terms of next steps.
Some very sensible best practices emerged from this work: the need to build multiple datasets (counts of books per year, words per year, pages per book, words per book) to normalise results against in different ways; the necessity of explaining and clearly documenting the decisions taken when processing the data (for example taking the earliest year found in the metadata for a given book as the publication year, even if we know that to be incorrect); and the value of having a fixed, definable chunk of data for researchers to work with and explain their results in relation to (and in turn for us, the risks associated with adding more data to the pot at a later date).
Moreover, we have outputs on our Github repos that you can work with. We have queries (written in Python) that provide a framework from which you might search for words, phrases, or non-textual elements in this or comparable collections of digital text. We have data from searches across the whole collection on occurrences of disease related words, on the contexts in which librarians appear, and on the location and relative size in the page of every non-textual element (ergo, in most cases, illustration). And we have visualisations, with associated code and iPython Notebooks, of these results. These include a graph of disease references over time per 1000 words (an interactive version is available if you download this html and open it in your browser); a point map charting the size over time of circa 1 million figures (as a percentage of the size of the page the appear in); and, moving our macroscope closer, graphs that show the size of images across the length of single books, that map the illustrative 'heartbeat' of those books, alongside hacky workflow for getting to that point.
The next step is to package these outputs up as 'recipe books' demonstrative of the steps needed to work with large and complex digital collections. We hope that the community - Systems Architects designing services, Research Software Engineers collaborating in humanities research, Humanists dabbling with data and code - can learn from these, build them into their workflows, and push forward our collective ability to make the best of these digital collections.
James Baker -- Curator, Digital Research -- @j_w_baker