Provide guidance on GIS reproducible best practice

GIS code often is a challenge. Need some pointers to guidance to not have to do it by hand.

Add link to information what "no license" means

https://choosealicense.com/no-permission/

Bug in link to Danish registers.

guidance/addtl-data-citation-guidance.md

Line 250 in 51f0bc7

    
           In some cases, governments have list of their (named) registers. For instance, Statistics Denmark provides the full list of registers at [http://www.dst.dk/extranet/forskningvariabellister/Oversigt%20over%20registre.html](http://www.dst.dk/extranet/forskningvariabellister/Oversigt%20over%20registre.html). These can be used to craft data citations, for instance

Add DIME/ DRIP guide to list of readings

https://worldbank.github.io/dime-data-handbook/

Provide guidance on when public and confidential data are commingled

Some code commingles the data early on, and prevents the creation of analysis and output even if that analysis does not rely on confidential data.

This may involve creating empty variables, placeholder values, synthetic data, or simply good programming practices.

Talk about author contributions

https://journals.plos.org/plosbiology/s/authorship#loc-author-contributions and https://casrai.org/credit/. While not typical in social sciences (in particular in economics) might warrant a discussion.

The data (and code) authorship != paper authorship is not at all common in Economics. One example I can think of is https://www.aeaweb.org/articles?id=10.1257/pol.20170704, with replication package https://doi.org/10.3886/E110642V1 explicitly single-authored by one of the authors. I've also seen some papers (need to dig out references) where Github with code might be "Author A + RA" whereas paper is "Author A + Author B".

The generic guidance I have in mind is https://casrai.org/credit/, which has been adopted by some journals, such as PLOS https://journals.plos.org/plosbiology/s/authorship#loc-author-contributions, where it applies to the paper. If considering the data + the code as distinct first-class research objects, the CRediT taxonomy directly applies there as well, and there's nothing that requires that the two be the same.

Data citations access date clarification

Question received:

For the access date for cited datasets, what is the definition of “access date”? I believe it is the date the data were actually accessed. Is this correct? The wording of the question was whether it is "the dates we first accessed the actual data itself” or, instead, “the date we accessed the websites describing the data.” I believe it is the former.

It is, in fact the former.

Need to add this to relevant FAQ and guidance.

Suggested improvement: filter-branch for splitting repos

I read this comment in the FAQ:

If not, then what I suggest is to do the following

clean up the repo (possibly in a branch)
on Github, there is no way to fork to your own space, and a fork would carry the entire history anyway. So this assumes manual interaction (I’m going to assume you use the command line for this, this works in git-bash, or bash on Linux/OSX).
create a new clone of your (now cleaned) repo, and switch to the clean branch ``` git clone (WHATEVER) cd whatever git branch “cleaned”

now wipe out all git information: ``` rm -rf .git

create a new repo ``` git init

Add all files ``` git add *

I think there are two things going on here, which can be handled differently:

Author wishes to wipe commit history (e.g. for privacy reasons). Then the advice above is most germane
Author keeps several papers related to an ongoing or long-running project in a single repo, and wants to isolate code for tidy submission alongside an "offshoot paper"

In the latter case, I'd recommend suggesting the git filter-branch approach, which is essentially a way to split off a subset of a repo into a new repo (e.g. a subdirectory paper-1). It will conveniently inherit the commit history as well. See guide:

https://help.github.com/en/articles/splitting-a-subfolder-out-into-a-new-repository

I think it would help if the FAQ delineated these two competing goals and stated more clearly the differences/approaches here.

Templates for README

Fix new SCSS issue

Maybe Github made an update to their minimal theme?

Add / align sample data citations with AJPS/Odum guide

see https://reusableresearch.com/slides/mgooch.pdf

Templates for replication archive

where do data citations go

In the template, we should make clearer where data citations go. I often have authors send me a ReadMe that is very good, but none of them put data citations in the References. Lars explained: in the DAS, the author will explain how to get access to the data, and cite it, but then the reference goes in the reference list. I don't think this is explicit enough.

Example:
" Trade data for 1974-2000 were downloaded from the NBER-UN World Trade Flows dataset (Feenstra & Lipsey, 2005) originally created by Feenstra et al (2005). Data can be directly downloaded using https://cid.econ.ucdavis.edu/nberus.html. We use fileswtf74 through wtf00. The data are in the public domain"

And then later in the references:

Feenstra & Lipsey. 2005. "NBER-United Nations Trade Data 1962-2000". Center for International Trade, UC Davis [distributor]. https://cid.econ.ucdavis.edu/nberus.html accessed on 2021-03-24.

Merge newer document on separate data deposits with existing one

Add Gates open research link to repository list, and to data preparation list

https://gatesopenresearch.org/for-authors/data-guidelines#prepareyourdata

Note that Gates uses F1000 for publication, so there's likely duplication there.

Additional links for sample data citations

https://guides.nyu.edu/c.php?g=276919&p=1846529

Add example for anonymous source to data citation guidance

In a case where the authors cannot name the company, you would cite as follows:

"Anonymous Firm (DATE OF FILE CREATION)"

and have in the references the following entry:

Anonymous Firm. (DATE OF FILE CREATION), Property Insurance claims. Accessed via Cornell Restricted Access Data Center (CRADC). Last accessed on (DATE).

where

Author = "Anonymous Firm"
Title (of dataset) = "Property Insurance Claims"
Distributor = the secure access center where the data reside/were accessed = "CRADC"
Date = date the date were created (authors should know that)
Version = in this case, proxied by "Last accessed on (DATE)"

Possible add datakit as template

See https://datakit.ap.org/ - seems quite flexible, and could be easily adapted to specific labs or work groups.

Downside: requires python (for command line aficionados)

Add examples of institutional-level repositories

Example IZA

https://datasets.iza.org/dataset/1386/the-age-twist-in-employers-gender-requests-evidence-from-four-job-boards

Others should be added to this issue, or a separate pull request made referencing this issue.

Reference Curating4Reproducibility

https://curating4reproducibility.org/

Docker guidance

See draft here: https://github.com/social-science-data-editors/guidance/blob/master/docker-guidelines.md

@floswald Any additions/comments?

Add guidance on citing your own data

Data is published

The DOI is thus public, and all repositories will provide a suggested citation. One can also use https://www.doi2bib.org/ or https://citation.crosscite.org/ to get a citation.

Data is not published

This is trickier. The data does not necessarily have a title that is related to the paper. Some repositories allow authors to "reserve" a DOI (Zenodo) or to delay publication. For some repositories, the DOI, while not officially reserved, can be derived from information already available (see this FAQ for openICPSR, something similar may be possible at Dataverse).

In some cases, authors may be able to delay publication, and coordinate it with the publication of the article (openICPSR, possibly Zenodo).

Add links to discussion of self-contained replication packages

See links in Twitter thread.

discussion about setwd() and similar:

Responses: 1 Jenny Bryant's take

Add guidance about reproducibility of Jupyter and/or Colab notebooks

https://doi.org/10.1371/journal.pcbi.1007007 (via @darribas) but also pointer to https://github.com/jupyter/nbdime

Migrate to better tool

https://jupyterbook.org/en/stable/publish/gh-pages.html might do the trick - ability to classify articles by type, search efficiently, and possibly include some computational examples.

Add Statistics Canada guide to data citations

https://www150.statcan.gc.ca/n1/en/catalogue/12-591-X

Add guidance on citing your own data

Add to to guidance on Citing Data and Code

Data is published

The DOI is thus public, and all repositories will provide a suggested citation. One can also use https://www.doi2bib.org/ or https://citation.crosscite.org/ to get a citation.

Data is not published

This is trickier. The data does not necessarily have a title that is related to the paper. Some repositories allow authors to "reserve" a DOI (Zenodo) or to delay publication. For some repositories, the DOI, while not officially reserved, can be derived from information already available (see this FAQ for openICPSR, something similar may be possible at Dataverse).

In some cases, authors may be able to delay publication, and coordinate it with the publication of the article (openICPSR, possibly Zenodo).

In all cases

The data deposit should be cited in the main manuscript, and referenced in the data availability statement (some journals) or the README (other journals).

Downside: requires Python

Add GIS/ Python training resources

https://twitter.com/OmerOzakEcon/status/1359882541977714692

In case it is useful to someone, I have written an intro to python and GIS for my students

https://github.com/ozak/CompEcon

You can even try it out and use the system online for basic stuff. Or follow instructions to have a working GIS environment on your computer.

Need better logo

Add the Stata "project" command as a suggested solution (Stata specific)

https://ideas.repec.org/c/boc/bocode/s457685.html

Add guidance and requirement/suggestion for creating feasible computation when full computation is long

Some computations, as presented in the paper, may run for a very long time.

Authors should be encouraged (required?) to provide instructions on how to run "feasible" computations (instead of 1,000 bootstraps, run only 10; instead of full sample, run with a 1% sample) and how it might impact the output.

This is different from creating synthetic data.

Nature's list of repositories is the only one [...] that a requirement that the repository support anonymous sharing for the purpose of double-blind peer review. [...] The discovery of such an existing list of repositories was very helpful for me in drafting our own data policy because many journals and publishers that are currently requiring data sharing either (a) use single-blind or transparent peer review or (b) only require data and code sharing at the publication stage, not at the submission stage.

social-science-data-editors / guidance Goto Github PK

guidance's People

Stargazers

Watchers

Forkers

guidance's Issues

Data is published

Data is not published

Data is published

Data is not published

In all cases

Recommend Projects

Recommend Topics

Recommend Org