Why not cite data?

Rachael Kotarski, our Content Expert for scientific datasets, explains why citing data as well as the article is the way forward.

In a previous post, Lee-Ann Coleman looked at citations in science, asking what should be cited, and what a citation means. The answers to these questions are not necessarily simple, but one response we have been hearing (and that we support), is that data needs to be cited.

Citing data not only gives credit to those who created or gathered it, but can also give some kudos to the repository that looks after it. Despite the fact that data is also key to verifying and validating research, it is not yet standard practice to cite it when writing a paper. And even if it is cited, it is rarely done in a way that allows you to identify and access that data.

Citation should connect the literature to its data foundations. Image source: Shutterstock.

As part of the Opportunities for Data Exchange (ODE) project, we investigated data citation and the ways in which data centres, publishers, libraries and researchers can encourage better data citation.

What does ‘better data citation’ look like and how do we encourage it to happen? We examined three aspects of current practice in order to answer this question:

How data is cited?
What data is cited?
Where is data cited within the article?

How to cite
A data citation needs to contain enough information to find and verify the data that was used, as well as give credit to those who spent considerable time/money/effort generating or collecting the data. The DataCite recommended data citation is just one example of how to include details that support these aims (and it’s pretty simple!):

Creator (publication year): Title. Publisher. Identifier.

What to cite
Data are not necessarily fixed, stable or homogenous objects, so citing them can be considerably more complicated than for articles. It is important for testing reproducibility that regardless of subsequent changes to the data or subsets of it, they are cited as used. Aspects such as the version used or date downloaded should also be encapsulated in the citation, where necessary. Linking users via an identifier (such as a DOI as used by DataCite) to the location of that exact version or subset of the data is also important. An example of citing a specific wave of data from GESIS demonstrates this:

Förster, Peter; Brähler, Elmar; Stöbel-Richter, Yve; Berth Hendrik (2012): Saxonian longitudinal study – wave 24, 2010. GESIS Data Archive, Cologne. ZA6242 Data file version 1.0.0, doi: 10.4232/1.11322

Where to cite in the article
Where you cite data in the article may depend on the form of the data being cited. For example, data obtained via colleagues but not widely available may be best mentioned in acknowledgements, and data identified by accession numbers could be cited inline in the body of the article. But the interviewees who participated in the ODE study largely advocated citation of datasets in the full reference list, to promote tracking and credit. In order to do this, data needs a full, stable citation, which also depends on reliable, long-term storage and management of the data. Of course publisher requirements play an important role. But that’s a post for another day!

These are the three ‘simple’ steps to better citation of data, but there are still cultural and behavioural barriers to sharing data. In the ODE report we concluded that the whole community - researchers, publishers, libraries and data centres - all have a role in promoting and encouraging data citation.

The recent Out of Cite, Out of Mind report has since updated and greatly extended the ODE work, with an excellent set of first principles for data citation:

CODATA-ICSTI Task Group on Data Citation Standards and Practices (2013) Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. Data Science Journal vol. 12 p. CIDCR1-CIDCR75 doi: 10.2481/dsj.OSOM13-043

I recommend it – and encourage anyone thinking about citing their data (or anyone else’s) to stop thinking and start doing it.

Science blog

Why not cite data?

Comments