14 September 2015
#CitizenHums and #MakingBigDataHuman
Last week I had the pleasure of attending the Citizen Humanities Comes of Age: Crowdsourcing for the Humanities in the 21st Century symposium. The two-day event was organised by King's College London’s Department of Digital Humanities (DDH) and Stanford University’s Center for Spatial and Textual Analysis (CESTA) with the aim of exploring “ways in which humanities and cultural heritage research is enriched through scholarly crowdsourcing”.
Through the magic of twitter I was also able to (sort of) attend the Making Big Data Human Conference too (luckily my colleague Stella was actually there for the full experience, more on that in a future post). It turned out a really nice correspondence of ideas kept occurring throughout, too numerous really to capture them all here but a couple of good examples came during a talk @jfwinters gave.
This statement simultaneously resonated with a topic we were similarly grappling with over at #CitizenHums. Would it be better practice for institutional crowdsourcing initiatives to be specifically research question led, rather than say, collection led? While making as much of British Library collections accessible to as wide an audience as possible will remain a central driver, it’s a useful reminder that there are always data collecting decisions being made that will impact future reuse of this content by researchers. To avoid creating potentially irreplaceable gaps in our datasets, we’d do well always to, even if informally, consider more specifically the wide variety of explicit future research uses of the data we’re crowdsourcing.
This is true of any data collection exercise as well and resonated with an interesting conversation around, as @Mia Ridge describes it, the machine learning + crowdsourcing ecosystem. The planning of any crowdsourcing project must include consideration of what can be done programmatically at any stage, where pre-processing can bring efficiencies to the tasks we ask of volunteers, while datasets and volunteer responses can be made open to help inform the development of machine learning driven solutions.
When we released 1 Million images online from British Library collections, we hoped to surface tags from the Flickr crowd to make them more discoverable, but we also hoped that the dataset might prove a good training ground for machine learning. Happily both machine and man have taken up the challenge since and the results have been staggering in the terms of making that massive collection more accessible.
Similarly we are posting all the data and contributions from our card catalogue conversion experiment LibCrowds online in the hopes that it might be used to train typeset and handwriting OCR or test other automatic ways in which complex but common library issues such as these can be programmatically resolved alongside human interventions.
All in all a fantastic two days and a many thanks to @StuartDunnCeRch and all the organisers for putting it together!
Nora McGregor
Digital Curator, Digital Research Team
@ndalyrose