Clearly most researchers don't anlayse their data with reproducible data analysis tool

Journals make it difficult Some journals do not accept latex

An argument for not using reproducible data analysis tools like knitr, Sweave, etc.? about rmarkdown-rmeetup-2012 HOT 1 CLOSED

jeromyanglim commented on August 11, 2024

An argument for not using reproducible data analysis tools like knitr, Sweave, etc.?

from rmarkdown-rmeetup-2012.

Comments (1)

jeromyanglim commented on August 11, 2024

Journals make it difficult

Some journals do not accept latex or pdf (i.e., the typical output format of reproducible journal article submissions)
Some journals have very specific style and formatting requirements that are difficult to comply with using relatively automated latex-style approaches.
The infrastructure for archiving and sharing such reproducible research documents is not in place, or it is not obvious where to find it or how to use it.

Lack of knowledge of how to perform reproducible research

Researchers may be trained in other workflows that fall short of complete reproducibility. In particular, here I speak about degrees of reproducible research. For instance, by my observation, many researchers in psychology and the social sciences, adopt a workflow that combines GUI-centric data analysis software such as SPSS with GUI based word processors such as MS Word.
Related to this problem is a lack of examples and training material on how to implement reproducible research in a discipline specific fashion.

It creates more work

It is a lot of work to polish a journal article. Opening up the internal workings of a journal article may require the internal workings to have the same degree of polish.
Also, more broadly, complete automation of steps like formatting of tables, numbers, and graphs can be quite time consuming particularly when you are still learning reproducible research tools.

There are no incentives

There are often minimal incentives to share reproducible research.

There are a few exceptions:

Some grants encourage or even notionally require data sharing.
A couple of journals encourage sharing of reproducible research repositories.
You could make an argument that other researchers are more likely to build on your research or cite research that allows them to examine the raw data and the analysis steps.

I also feel that it is not enough to simply share a repository. It's important to make the repository user friendly.
User friendly could mean:

Sharing the repository in a way that it's easy to obtain (e.g., from a google search; a clear link from the journal article; i.e., a single click to obtain the repository; don't put it behind a pay wall; don't require a log-in to obtain it)
Allow web navigation of the repository (e.g., like on github)
Provide documentation of the elements of the repository and how it can be used
Provide adequate information on permitted re-uses (i.e., licencing)
Make it clear what software needs to be installed to run the code
Using open source, cross-platform, and popular software is also an advantage

Deprivation of future papers

Some datasets yield multiple papers. Researchers may fear that publishing the raw data may allow a another researcher to take their raw data and publish a paper based on it.
Even if this is unlikely and even if the researcher has no specific plans to publish anything more, this still might be a concern.

Fear of making it too easy for the competition

I see science as a collaborative process. One of the major benefits of reproducible research is that it helps others see exactly how to analyse research data of a given sort.

However, it is possible that some researchers might see this as a negative thing as they seek to be a dominant figure in a particular area.

Some analysis software makes automation difficult or impossible

Some data analysis software does not have a scripting language that would permit incorporation into an automated reproducible workflow.
Some data analysis software does have a scripting language, but it differs substantially from the primary interface, and thus requires substantial investment in order to learn it.
Some data analysis software provides a poor interface for automating extraction of content.

Naturally, this raises the question of why anyone would use "un-automatable" software. However,

the researcher may be trained in the software that can't be automated.
Such software may have features unavailable in software that can be automated.
Software that can't be automated may be more user friendly for the researcher.

Fear of a mistake being publicly identified

In a reproducible data analysis, not only is the finished product on display, but so is much of the inner workings. If an error did occur in the statistical analysis, this can much more readily be identified by others than if only summary statistics are provided.
There have been various high profile cases of fraudulent data analysis. Some of these have involved creating data that did not exist. Another case involved a practice of selectively deleting data from groups until statistical significance was achieved. In both these cases, detection of such practices would have relatively trivial had the researchers supplied a reproducible research document with raw data. While this is a great argument for requiring researchers to supply reproducible research documents, it is also a reason why researchers might not want to provide reproducible research documents.

There is a wide spectrum of data analytic misconduct. If we take a legal perspective, we can think of different kinds of intentions (intentional, reckless, negligent) and consequences (how consequential was it to the paper's findings, etc.).

I have heard advocates of open source software state that one reason why open source software is better than proprietary software is because such software is on display to the community. A similar process would possibly operate in a reproducible data analysis context. Researchers would be more inclined to adopt workflows and procedures that keep their analyses clean and tidy. They would be more likely to incorporate quality control procedures that check for possible errors.

It would be interesting to see how journal articles deal with potential increases in errata that might emerge. At present while journal articles permit the incorporation of errata, it generally seemed to me to be a fairly big deal. In contrast, software is often framed as a work under development where bugs are identified and gradually fixed. Admittedly in some respects, journal articles are more static in their scope and application than are

Ethical concerns related to data sharing

This mainly applies to sharing raw data. Maintaining absolute anonymity can be challenging. Even if obvious identifying information is removed such as names, phone numbers, email addresses, and home addresses, there are many ways that data can be de-anonymised.
Even in situations where it seems likely that data can not easily be de-anonymised and even the data itself is not sensitive, researchers may still be worried about the faint possibility that it might be de-anonymised.

Compliance with ethics committees

In theory, satisfying ethical concerns and compliance with ethical committees would be the same thing. However, in reality there may be situations where it is ethically acceptable to share anonymised raw data, yet it may be more work to justify this to an ethics committee. Research might be delayed if the ethics committee questions the data sharing policy. Thus, there is an incentive to adopt the easy option of adopting the stricter control over data sharing.
In other cases, ethics applications may be completed quickly without consideration of the aims of data sharing. Thus, at a later point it may be very difficult to share the data once participants have completed consent forms that set out provisions that limit data sharing.

Limitations imposed by collaboration

Even if some researchers understand reproducible data analysis, if a collaborator does not, it may be deemed easier to adopt the common language of Word processors.

Copyright restrictions

In some instances, sharing various algorithms or meta data may be prohibited by copyright restrictions.

Item text for many psychological tests is copyrighted preventing the sharing of this information.
In some cases the copyright status of scientific content may be ambiguous, as often seems to me to be the case with many psychological tests that have been published in scientific journals. Thus, there may be a decision to err on the side of caution.

from rmarkdown-rmeetup-2012.

An argument for not using reproducible data analysis tools like knitr, Sweave, etc.? about rmarkdown-rmeetup-2012 HOT 1 CLOSED

Comments (1)

Journals make it difficult

Lack of knowledge of how to perform reproducible research

It creates more work

There are no incentives

Deprivation of future papers

Fear of making it too easy for the competition

Some analysis software makes automation difficult or impossible

Fear of a mistake being publicly identified

Ethical concerns related to data sharing

Compliance with ethics committees

Limitations imposed by collaboration

Copyright restrictions

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent