datalad / datalad-paper-joss Goto Github PK

Repository for JOSS paper on DataLad

License: MIT License

Shell 6.76% Makefile 1.28% TeX 91.97%

datalad-paper-joss's Introduction

DataLad JOSS paper repository

It was prepared as a separate repository to not impose the burden of carrying figures etc. Later might be moved or included (as submodule) to the main repository.

datalad-paper-joss's People

Contributors

Stargazers

Watchers

datalad-paper-joss's Issues

Primary demo

At the moment there is just one:

datalad search haxby

While it is iconic, it is not connected to the presented added value provided by DataLad (neither in their current form, nor those proposed in #64).

If we keep search as the demo, I propose to switch to another search term that is self-explanatory. I tested a few cases. I would be fine with any of them:

movie fmri
one-back task
diffusion mri
openneuro
emotion face

However, I think a better use case would be datalad run -- as it is (or could be) immediately connected with the presented key features.

Substantial scholarly effort: prior citations

Google survey done on 20210319, stopped on page 6.

"academic papers" by others:

Li, Q.; Xue, R. The Pipeline of Processing fMRI Data with Python Based on the Ecosystem NeuroDebian. Preprints 2019, 2019040027 (doi: https://doi.org/10.20944/preprints201904.0027.v2).
Far, M. S., Stolz, M., Fischer, J. M., Eickhoff, S. B., & Dukart, J. (2021). JuTrack: A Digital Biomarker Platform for Remote Monitoring in Neuropsychiatric and Psychiatric Diseases. arXiv preprint arXiv:2101.10091.
https://arxiv.org/abs/2101.10091
Langer, A., & Hai, D. V. N. Comparison of existing decen-tralized RDM solutions. https://vsr.informatik.tu-chemnitz.de/projects/2019/solidrdp/resources/ComparisonOfExingRdmSolutions.pdf
Ioanas, H. I., Saab, B., & Rudin, M. (2017). Gentoo Linux for Neuroscience-a replicable, flexible, scalable, rolling-release environment that provides direct access to development software. Research Ideas and Outcomes, 3, e12095. https://doi.org/10.3897/rio.3.e12095
Manera, A. L., Dadar, M., Fonov, V., & Collins, D. L. (2020). CerebrA, registration and manual label correction of Mindboggle-101 atlas for MNI-ICBM152 template. Scientific Data, 7(1), 1-9. https://doi.org/10.1038/s41597-020-0557-9
Esteban, O., Ciric, R., Finc, K., Blair, R. W., Markiewicz, C. J., Moodie, C. A., ... & Gorgolewski, K. J. (2020). Analysis of task-based functional MRI data preprocessed with fMRIPrep. Nature protocols, 1-17. https://doi.org/10.1038/s41596-020-0327-3
Esteban, O., Blair, R. W., Nielson, D. M., Varada, J. C., Marrett, S., Thomas, A. G., ... & Gorgolewski, K. J. (2019). Crowdsourced MRI quality metrics and expert quality annotations for training of humans and machines. Scientific data, 6(1), 1-7. https://doi.org/10.1038/s41597-019-0035-4
Visconti di Oleggio Castello, M., Chauhan, V., Jiahui, G. et al. An fMRI dataset in response to “The Grand Budapest Hotel”, a socially-rich, naturalistic movie. Sci Data 7, 383 (2020). https://doi.org/10.1038/s41597-020-00735-4
Arco, J. E., González-García, C., Díaz-Gutiérrez, P., Ramírez, J., & Ruz, M. (2018). Influence of activation pattern estimates and statistical significance tests in fMRI decoding analysis. Journal of neuroscience methods, 308, 248-260. https://doi.org/10.1016/j.jneumeth.2018.06.017
Teijeiro, T. (2020). Recommendations for the MIP Technical Development During HBP SGA3. https://infoscience.epfl.ch/record/276489
Horien, C., Noble, S., Greene, A. S., Lee, K., Barron, D. S., Gao, S., … Scheinost, D. (2020). A hitchhiker’s guide to working with large, open-source neuroimaging datasets. Nature Human Behaviour. doi:https://dx.doi.org/10.1038/s41562-020-01005-4
Mak, M., Ren, L., Kong, L., & Wong, I. (2020). Validity of a physical activity tracker for heart rate measurement during aerobic exercise in people with Parkinson's disease. Parkinsonism & Related Disorders, 79, e42-e43. https://doi.org/10.1016/j.parkreldis.2020.06.172
Keshavan, A., & Poline, J. B. (2019). From the wet lab to the web lab: a paradigm shift in brain imaging research. Frontiers in neuroinformatics, 13, 3. https://doi.org/10.3389/fninf.2019.00003
Das, S., Lecours Boucher, X., Rogers, C., Makowski, C., Chouinard-Decorte, F., Oros Klein, K., ... & Evans, A. C. (2018). Integration of “omics” data and phenotypic data within a unified extensible multimodal framework. Frontiers in neuroinformatics, 12, 91. https://doi.org/10.3389/fninf.2018.00091
Wiener, M., Sommer, F. T., Ives, Z. G., Poldrack, R. A., & Litt, B. (2016). Enabling an open data ecosystem for the neurosciences. Neuron, 92(3), 617-621. https://doi.org/10.1016/j.neuron.2016.10.037
Rakhimov, O., & Umarov, F. (2020). Clinical assessment of ongoing constipation in patients with Parkinson's disease and solution with alternative approach. Parkinsonism & Related Disorders, 79, e41-e42. https://doi.org/10.1016/j.parkreldis.2020.06.169
Huckins JF, daSilva AW, Wang R, Wang W, Hedlund EL, Murphy EI, Lopez RB, Rogers C, Holtzheimer PE, Kelley WM, Heatherton TF, Wagner DD, Haxby JV and Campbell AT (2019) Fusing Mobile Phone Sensing and Brain Imaging to Assess Depression in College Students. Front. Neurosci. 13:248. doi: https://doi.org/10.3389/fnins.2019.00248
Wittkuhn, L., Schuck, N.W. Dynamics of fMRI patterns reflect sub-second activation sequences and reveal replay in human visual cortex. Nat Commun 12, 1795 (2021). https://doi.org/10.1038/s41467-021-21970-2

"academic papers" by us:

DuPre, E., Hanke, M., & Poline, J. B. (2020). Nature abhors a paywall: How open science can realize the potential of naturalistic stimuli. Neuroimage, 216, 116330. https://doi.org/10.1016/j.neuroimage.2019.116330
Yarkoni, T., Markiewicz, C. J., de la Vega, A., Gorgolewski, K. J., Salo, T., Halchenko, Y. O., ... & Blair, R. (2019). PyBIDS: Python tools for BIDS datasets. Journal of open source software, 4(40). https://dx.doi.org/10.21105%2Fjoss.01294
Nastase, S. A., Halchenko, Y. O., Connolly, A. C., Gobbini, M. I., & Haxby, J. V. (2018). Neural responses to naturalistic clips of behaving animals in two different task contexts. Frontiers in neuroscience, 12, 316. https://doi.org/10.3389/fnins.2018.00316
Hanke, M., Pestilli, F., Wagner, A. S., Markiewicz, C. J., Poline, J. B., & Halchenko, Y. O. (2021). In defense of decentralized research data management. Neuroforum, 1. https://doi.org/10.1515/nf-2020-0037
Cheng, C. P., & Halchenko, Y. O. (2020). A new virtue of phantom MRI data: explaining variance in human participant data. F1000Research, 9. https://dx.doi.org/10.12688%2Ff1000research.24544.1
Bannier, E., Barker, G., Borghesani, V., Broeckx, N., Clement, P., Emblem, K. E., ... & Zhu, H. (2021). The Open Brain Consent: Informing research participants and obtaining consent to share brain imaging data. https://doi.org/10.1002/hbm.25351
Ghosh, S. S., Poline, J. B., Keator, D. B., Halchenko, Y. O., Thomas, A. G., Kessler, D. A., & Kennedy, D. N. (2017). A very simple, re-executable neuroimaging publication. F1000Research, 6. https://dx.doi.org/10.12688%2Ff1000research.10783.2
Häusler, C. O. & Hanke, M.. (2021) A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description. F1000Research, 10:54. https://doi.org/10.12688/f1000research.27621.1
Wachtler, T., Bauer, P., Denker, M., Grün, S., Hanke, M., Klein, J., Oeltze-Jafra, S., Ritter, P., Rotter, S., Scherberger, H., Stein, A. & Witte, O.W. (2021). NFDI-Neuro: Building a community for neuroscience research data management in Germany. Neuroforum, 27(1). https://doi.org/10.1515/nf-2020-0036
Dar, A. H., Wagner, A. S. & Hanke, M. (2020). REMoDNaV: Robust Eye-Movement Classification for Dynamic Stimulation. Behavior Research Methods. https://doi.org/10.3758/s13428-020-01428-x

Substantial scholarly effort: age

Development started 8 years ago

commit 8a6d1c60e7fdf41943360f5ae0c8df0ce682c677
Author: Yaroslav Halchenko <[email protected]>
Date:   Tue May 21 16:50:21 2013 -0400

    original pieces for gitweb

We had ~13500 commits in master since. 2500 finished PRs. 700 open and 2300 closed issues.

Invitations (via github) to co-author a DataLad paper for JOSS (last call)

Dear Contributors to DataLad:

We have tried to email but failed for one reason or another.

Please see, follow the following instructions we emailed previously. Currently we aim for next Tue (Apr 20th submission):

Thank you for your previous contribution to DataLad (https://github.com/datalad/datalad), by code, issues, or feedback.

We are working on a manuscript to be submitted to the Journal of Open Source Software (https://joss.theoj.org) to describe DataLad, and would like to acknowledge your contribution(s). We are inviting you to co-author the paper, or, alternatively, give us permission to thank you in the Acknowledgements section of the paper.

If you would like to co-author the paper, please review the authorship criteria of JOSS at https://joss.readthedocs.io/en/latest/submitting.html#authorship and pay particular attention to potential implications of the "co-authors agree to be accountable for all aspects of the work" rule. If you personally consider a co-authorship appropriate under these conditions, please

submit a Pull Request with changes to http://github.com/datalad/datalad-paper-joss/blob/master/paper.md in which you
- uncomment (remove leading #) your record
- add your details (name, ORCID, affiliation) and/or adjust your name (if needed).
vote on your choice of title on #9

If you would like to just be acknowledged, please either reply to this email stating that, or submit a PR with your name appropriately listed in the Acknowledgements section of https://github.com/datalad/datalad-paper-joss/blob/master/paper.md and remove the pre-created record with your name from the header.

If you would like to neither be listed among co-authors, nor acknowledged, we would appreciate if you reply and let us know about that.

We are planing to submit the manuscript next week (on/after April 12), and will appreciate if you act on this invitation by the end of this week.
Co-author records which would remain commented out will be removed before submission.

Thank you again for your contribution to DataLad!

Sincerely,
DataLad Team

Length of manuscript

The present state is approach 2x the suggested maximum length (250-1000 words).

Is this a concern?

Select title

Note mandated, but ": " is a common title pattern in the journal.

Candidates from various source

👍 DataLad: data management system for discovery, management, and publication of digital objects of science
🚀 DataLad: perpetual decentralized management of digital objects for collaborative open science
😄 DataLad: decentralized management of digital objects for open science
❤️ DataLad: decentralized Research Data Management
👀 DataLad: distributed system for joint management of code, data, and computational environments
🎉 DataLad: distributed system for joint management of code and data
👎 DataLad: distributed system for joint management of code, data, and their relationship

clarification: 👎 is the refinement of 🎉, and votes for 🎉 will be added to 👎 (unless double-voted).
I encourage those who voted for 🎉 revote for 👎 if they agree, and if you don't - please comment to support your choice of 🎉 over 👎

Figure selection

According to https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements:

Your paper (paper.md and BibTeX files, plus any figures) must be hosted in a Git-based repository together with your software (although they may be in a short-lived branch which is never merged with the default).

that means we do not have to suffer from a complex, heavy image file in the main repo. Hence I propose to not go for a figure that is not minimized for size. Moreover, it should also not be imaging specific, but still sciency. I propose this one as a starting point:

Scope limit to datalad-core?

Intuitively, I'd say we limit the scope to datalad-core. Leaving space for other focused publications. The notion of extensions is already in the manuscript. Just want to make this explicit.

Paper structure/content

The structure is mandated:

https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain

A list of the authors of the software and their affiliations, using the correct format (see the example below).
A summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience.
A Statement of Need section that clearly illustrates the research purpose of the software.
A list of key references, including to other software addressing related needs. Note that the references should include full names of venues, e.g., journals and conferences, not abbreviations only understood in the context of a specific discipline.
Mention (if applicable) a representative set of past or ongoing research projects using the software and recent scholarly publications enabled by it.
Acknowledgement of any financial support.

I think we should follow this (in particular the order, which is presently not the case).

I would also propose the following:

make it very brief in general: mention things like datasets.d.o and extensions, but not details -- we should be able to reasonably be able to select to JOSS publication as a citation for a broad range of things, but we should not publish some half-baked intro now, that makes subsequent specialized publication more difficult
exceptions to that (that I can see right now) would be aspects related to the development and adoption process, such as
- extensive test, not just unittests in each PR
- extensibility to be able to add features quickly, without having to pay much of a cost in -core, and also to not force independent contributors into a specific process and speed
- how datalad is being used in existing projects by others (rather than speculating what it could be used for)

Revise DataLad additions over git/git-annex section

This section is arguably the key section of "Statement of need" and in turn the entire paper. Currently it puts forth 5 reasons:

They are generic and lack support for domain-specific solutions
They require a layer above to establish a distribution
Modularization is needed to scale
Annotation of changes is not "re-executable"
Git and git-annex do not necessarily facilitate the best scientific workflow

I would propose to trim the list, and to straighten the argument:

A. Seamless nesting of independent modular units (with emphasis on "seamless", which is what DataLad adds to Git's submodules)
B. Reproducible execution (or capture of actionable provenance)
C. Interoperability adapters and interfaces (more of a collection of the former, rather than a definition of the latter)

I think 1-5 are outcomes that can be achieved with A-C, rather than the technological contribution.

The current text seems to be easily sortable under A, B, and C to illustrate more or less intuitive use cases, why one would want such features.

The description of B could be extended to reach beyond provenance capture and hint at a wider metadata support.

complete and harmonize affiliations

may be we should/could just use the institution (like I did Dartmouth College) without specific departments/institutes within? that would collapse many of McGill affiliations. (didn't check yet what JOSS requirement is)
add town, state, United States for those which miss

Authorship: opt-in

The rules are (from https://joss.readthedocs.io/en/latest/submitting.html#authorship):

Purely financial (such as being named on an award) and organizational (such as general supervision of a research group) contributions are not considered sufficient for co-authorship of JOSS submissions, but active project direction and other forms of non-code contributions are. The authors themselves assume responsibility for deciding who should be credited with co-authorship, and co-authors must always agree to be listed. In addition, co-authors agree to be accountable for all aspects of the work, and to notify JOSS if any retraction or correction of mistakes are needed after publication.

If we agree on the scope of the paper being datalad-core #1 this makes 34 contributors to the code on its github repo obvious co-author candidates. I can only think of one person with "directional" influence that is not on this list. My proposal would be to approach them, asking whether they would want to participate in the drafting of the manuscript, and thereby become co-authors under the terms quoted above.

I particularly do not mind a long author list. And I do not see the need for a "minimum code contribution" or anything like that. All these people either are or were active contributors or early adopters that registered that fact with a contribution of some kind. I also do not mind extending that list -- I just thought it would be a good starting point.

edit by @yarikoptic : a list of contributors prepared / acted on in a separate repo (should have just kept everything here) -- https://github.com/datalad/datalad-git-bug-dumps (json files with emails are under annex and not shared ATM)

Stats on datasets.d.o

A testament of this is datasets.datalad.org, created as the project’s initial goal to provide a data distribution with unified access to already available public data archives in neuroscience, such as crcns.org and openfmri.org. It is curated by the DataLad team, and provides, at the time of publication, streamlined access to over 250 TBs of data across a wide range of projects
and archives in a fully modularized way.

The paper has the above, which is critical and the key evidence that the beast works. However, rather than "250TB" (where we claim that git-annex handles any size already on its own), we should add the number of datasets, and number of dataset sources/portals as indicators of how much versatility is captured by DataLad (not just git-annex) in this single collection.

Add section/paragraph on design principles

In technical talks I tend to include the following list of design principles for DataLad:

There are only two recognized entities: datasets and files
A dataset is a Git repository with an optional annex
Minimization of custom procedures and data structures: Users must not loose data or data access, if DataLad would vanish
Complete decentralization, no required central server or service.

I believe in their simplicity the can be instrumental in communicating the underlying mindset. Some aspects are already included in the text, but it still makes sense to me to simply present them in this refined form -- possibly right at the start of Overview of DataLad

Add or purge metadata?

I still think that we should bring 4th point which would touch on metadata support in DataLad:

we do have it, even if it has being reworked
- we do point to other extensions, and I think it would only be beneficial to point to datalad-metalad as "the future"
it would add value for "DataLad" as addition to git/git-annex
- flexible metadata extraction and aggregation is a very unique feature
- it could interest some metadata-savvy folks struggling with their data/metadata archives
we did get questions/queries about metadata on issues and neurostars, so some users are at least interested in it or do use it
Figure 1 mentions it ("and metadata", "metadata aggregation") -- if decide to not bother, we should purge from the figure

WDYT?

It is annoying to get the draft.pdf from github actions -- use magic!

Will make an attempt

Complete acknowledgements

Before this issue can be closed, all eventual authors must have signed off here.

#57
TVB-Cloud
HBP
...

I would prefer to structure the acknowledgements with the following order:

Thx to people of particular importance (possibly indicating their role)
All contributors (list)
Grant numbers (list)

Substantial scholarly effort: lines of code

sloccount on master

Totals grouped by language (dominant language first):
python:       63823 (70.02%)
javascript:     25500 (27.98%)
sh:            1826 (2.00%)

Total Physical Source Lines of Code (SLOC)                = 91,149
Development Effort Estimate, Person-Years (Person-Months) = 22.84 (274.13)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 1.76 (21.10)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 12.99
Total Estimated Cost to Develop                           = $ 3,085,895
 (average salary = $56,286/year, overhead = 2.40).

we should ignore the non-Python lines, which leaves: ~64k lines

Neuroscience focus yes|no?

The present abstract states:

Born from the idea to provide a unified data distribution for neuroscience

While I believe this is not meant to be a scope-limit, it nevertheless makes the impression to me.

Q: Do we agree that there should not be the notion of "datalad is RDM for neuroscience"?

NeuroHub or CONP funding?

I don't know details of financing/accounting, so I'm not sure of if they've directly funded development, but I think I've been to events/hackathons/etc where development was done on DataLad funded by:

NeuroHub
CONP

Though, to be honest, the boundaries of what's funding for DataLad or git-annex or just developers funded from somewhere attending a hackathon and what's significant enough to appear in the funding statement is a little blurry to me.

Reconsider "contributions" section

ATM it states the license, where to find 3rd-party terms, and that there is a CONTRIBUTION file. The first two aspects are only vaguely related to "contributions", the latter is merely a reference.

Shouldn't we rather say:

MIT license being "permissive" -- encourage unconstrained use and re-use in any context -- promote contribution rather than fork
make explicit that anyone is welcome to contribute under these terms (no CLA or other bullshit)
reference CONTRIBUTION as a source for technical and procedural information on how to contribute best

Migrate into the datalad repository once the paper is ready

In order to submit to JOSS: