repronim / module-reproducible-basics Goto Github PK

Module 0: Reproducible Basics

Home Page: http://www.repronim.org/module-reproducible-basics

License: Other

Makefile 0.85% HTML 4.75% JavaScript 76.44% R 0.63% Python 16.24% Shell 0.06% Ruby 0.07% SCSS 0.97%

module-reproducible-basics's Introduction

ReproMan

ReproMan aims to simplify creation and management of computing environments in Neuroimaging. While concentrating on Neuroimaging use-cases, it is by no means is limited to this field of science and tools will find utility in other fields as well.

Status

ReproMan is under rapid development. While the code base is still growing the focus is increasingly shifting towards robust and safe operation with a sensible API. There has been no major public release yet, as organization and configuration are still subject of considerable reorganization and standardization.

See CONTRIBUTING.md if you are interested in internals and/or contributing to the project.

Installation

ReproMan requires Python 3 (>= 3.8).

Linux'es and OSX (Windows yet TODO) - via pip

By default, installation via pip (pip install reproman) installs core functionality of reproman allowing for managing datasets etc. Additional installation schemes are available, so you could provide enhanced installation via pip install 'reproman[SCHEME]' where SCHEME could be

tests to also install dependencies used by unit-tests battery of the reproman
full to install all of possible dependencies, e.g. DataLad

For installation through pip you would need some external dependencies not shipped from it (e.g. docker, singularity, etc.) for which please refer to the next section.

Debian-based systems

On Debian-based systems we recommend to enable NeuroDebian from which we will soon provide recent releases of ReproMan. We will also provide backports of all necessary packages from that repository.

Dependencies

Python 3.8+ with header files possibly needed to build some extensions without wheels. They are provided by python3-dev on debian-based systems or python-devel on Red Hat systems.

Our setup.py and corresponding packaging describes all necessary python dependencies. On Debian-based systems we recommend to enable NeuroDebian since we use it to provide backports of recent fixed external modules we depend upon. Additionally, if you would like to develop and run our tests battery see CONTRIBUTING.md regarding additional dependencies.

A typical workflow for `reproman run`

This example is heavily based on the "Typical workflow" example created for ///repronim/containers which we refer you to discover more about YODA principles etc. In this reproman example we will follow exactly the same goal -- running MRIQC on a sample dataset -- but this time utilizing ReproMan's ability to run computation remotely. DataLad and ///repronim/containers will still be used for data and containers logistics, while reproman will establish a little HTCondor cluster in the AWS cloud, run the analysis, and fetch the results.

Step 1: Create the HTCondor AWS EC2 cluster

If it is the first time you are using ReproMan to interact with AWS cloud services, you should first provide ReproMan with secret credentials to interact with AWS. For that edit its configuration file (~/.config/reproman/reproman.cfg on Linux, ~/Library/Application Support/reproman/reproman.cfg on OSX)

[aws]
access_key_id = ...
secret_access_key = ...

Disclaimer/Warning: Never share or post those secrets publicly.

filling out the ...s. If reproman fails to find this information, error message Unable to locate credentials will appear.

Run (need to be done once, makes resource available for reproman login or reproman run):

reproman create aws-hpc2 -t aws-condor -b size=2 -b instance_type=t2.medium

to create a new ReproMan resource: 2 AWS EC2 instances, with HTCondor installed (we use NITRC-CE instances).

Disclaimer/Warning: It is important to monitor your cloud resources in the cloud provider dashboard(s) to ensure absent run away instances etc. to help avoid incuring heavy cost for used cloud services.

Step 2: Create analysis DataLad dataset and run computation on aws-hpc2

Following script is an exact replica from ///repronim/containers where only the datalad containers-run command, which fetches data locally and runs computation locally and serially, is replaced with reproman run which publishes dataset (without data) to the remote resource, fetches the data, runs computation via HTCondor in parallel across 2 nodes, and then fetches results back:

#!/bin/sh
(  # so it could be just copy pasted or used as a script
PS4='> '; set -xeu  # to see what we are doing and exit upon error
# Work in some temporary directory
cd $(mktemp -d ${TMPDIR:-/tmp}/repro-XXXXXXX)
# Create a dataset to contain mriqc output
datalad create -d ds000003-qc -c text2git
cd ds000003-qc
# Install our containers collection:
datalad install -d . ///repronim/containers
# (optionally) Freeze container of interest to the specific version desired
# to facilitate reproducibility of some older results
datalad run -m "Downgrade/Freeze mriqc container version" \
    containers/scripts/freeze_versions bids-mriqc=0.16.0
# Install input data:
datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata
# Setup git to ignore workdir to be used by pipelines
echo "workdir/" > .gitignore && datalad save -m "Ignore workdir" .gitignore
# Execute desired preprocessing in parallel across two subjects
# on remote AWS EC2 cluster, creating a provenance record
# in git history containing all condor submission scripts and logs, and
# fetching them locally
reproman run -r aws-hpc2 \
   --sub condor --orc datalad-pair \
   --jp "container=containers/bids-mriqc" --bp subj=02,13 --follow \
   --input 'sourcedata/sub-{p[subj]}' \
   --output . \
   '{inputs}' . participant group -w workdir --participant_label '{p[subj]}'
)

ReproMan: Execute documentation section provides more information on the underlying principles behind reproman run command.

Step 3: Remove resource

Whenever everything is computed and fetched, and you are satisfied with the results, use reproman delete aws-hpc2 to terminate remote cluster in AWS, to not cause unnecessary charges.

License

MIT/Expat

Disclaimer

It is in a beta stage -- majority of the functionality is usable but Documentation and API enhancements is WiP to make it better. Please do not be shy of filing an issue or a pull request. See CONTRIBUTING.md for the guidance.

module-reproducible-basics's People

Contributors

Stargazers

Watchers

Forkers

jgrethe rbuccigrossi dbkeator yarikoptic chrispycheng sarazm jdkent mjtravers snastase josephmje djarecka adswa sotnir

module-reproducible-basics's Issues

bring in additions/changes made in customized extracts of the module

See e.g. http://www.repronim.org/sfn2018-training/02-01-shell/ where we had some additional hints like to quit vim and some visualization

Solution Provided for Exercise requires unfamiliar command variant-shell change

2 things:

tcsh hasn't been functionally introduced yet, even though it is listed as one of the commonly used neuroimaging projects shells. Novice user will be distracted by 'how do I give the command' or 'is there a general command' rather than identifying which shell to list, and may not ever reference the list.
a novice may read the solution and conclude that tcsh is a general command to change shells in a current session, not having seen that command before (I did not see it anywhere in tutorial). but IF I understand the question, there are numerous correct answers (e.g any of the shells listed for example).
One way to clarify might be to just edit the solution, with provision of two different examples of shell change commands, e.g.

e.g Change shell TO KornSHell
% ksh or $ ksh (same command, different prompt)
OR
e.g. Change shell TO a C-programming language type shell
% csh (or tcsh)

Plus a comment: Successful shell change will be indicated with ~>

Q: Update to more state-of-the art .gitattributes suggestions?

The version control part in this module suggests the following addition to .gitattributes in the task "How can we add the file a.txt directly under git, and file b.dat under git-annex?":

% cat << EOF > .gitattributes
* annex.largefiles=(not(mimetype=text/*))
*.dat annex.largefiles=anything
EOF

I have two remarks:

First, if someone follows the exercise, this command may fail or do something unexpected, because if executed in the git-annex repo of the previous exercise, this here doc would overwrite an existing .gitattributes file.

Second, with this configuration, empty (text)files would be annexed (see datalad/datalad#3663). I would suggest using the current configuration that datalad's text2git procedure uses: ((mimeencoding=binary)and(largerthan=0)).

I'll send a PR later today that demonstrates what I mean.

place a common thread through all the lessons

@ReproNim/trd-training
Examples and exercises could reflect back to the simple workflow paper thus providing a real life example of applicability of the presented/learned skills, and bind together all modules (not only the reproducible-basics)

Extend the wrap-up section

In https://github.com/ReproNim/module-reproducible-basics/pull/21/files#diff-9a1015a3c8a282fec2e64768cd192adbR13 @chrispycheng added a (self)wrap up section.
I wonder if we could/should extend it somehow. E.g.:

recommend to submit a pull request adding new resources of interest for specific topic
establish a little survey (need first to check if may be it is already planned within the larger scope of repronim training) and point to it

Update DataLad excersise

The current datalad exercise doesn't work "out of the box", and will likely fail soon even if one sets up webdav access as Box.com drops webdav support, it seems.
I volunteer to add a new short exercise in the next few days.

Undefined Concepts - Environment Variables, Environment

Brief definitions would be helpful.

'Environment Variables' introduced as 'not a feature of a shell' --
a) would be helpful to have defining statement re: what environment variables ARE
b) organization and content of this section suggests that environment variables are paths (and only paths, as it looks as if the examples are all paths - LD_LIBRARY_PATH, PYTHONPATH). Is this true?
if not, contextual explanation would be helpful

'Environment' = computational environment? or something more specialized?
would be helpful to define term in a few words, also likely helpful to have Environment introduced before Environment Variables

I see it first under Possible Conflicts:
"It might happen that PATH points to one environment first, while LD_LIBRARY_PATH points to libraries from another environment...."

Final Exercise in Shell Line/Command section Very Challenging-not solvable without external reference?

Is this solvable without referring to external materials? I found these very challenging, and did not solve, which is still bugging me, Tetris might have helped recharge my brain.

Exercise
Choose shunit2 or bats (or both) and

re-write above test for 1dsum using one of the frameworks. If you do not have AFNI available, you could test generic bc or dc command line calculators possibly available on your system.
Add additional tests to “document” behavior of 1dsum whenever
    input file is empty
    multiple files are provided
    some values are negative

Unfamiliar commands and usage- Unit Testing example script

Command Line/Shell section (Unit-Testing)

Sample script for testing basic correct operation of AFNI 1dsum command

Sorting this example out was very challenging (for me) and required some sleuthing and a lot of time to think out the commands in use, particularly the populate file and compare result with target value lines. This could be ok if the expectation is that that trainees will be independent and spend the time, it may defeat some though.

My own bigger concern was that after poring over it, 'maybe' I got it, but very possibly I didn't get it well enough to apply to a different problem, it was fleeting intuition that got me to some things.

(-1dsum command itself is unfamiliar to novice, but not essential to following the script, it seems)

Challenge Solution requires Unfamiliar Command/Concept- Shebang

Command Line/Shell section challenge-

Shebang apparently not previously introduced, (did not see anywhere in Software Carpentry tutorial or Repronim material up to this point)

Move/absorb the section on virtualized/containerized environments

Attn: @djarecka @satra

shell: mention ability to tune up PS* variables (in particular PS1)

apparently some terminals/shells have "short" version of the PS1 which shows only the current directory name (not a full path etc). This makes it difficult sometimes to have a coherent mental picture about actual location within the hierarchy, leading to all kinds of misdemeanor. Knowing where you are is kinda important for reproducibility

Commonly used shells section; elaborate on POSIX

What's the point of listing all these shells? Am I supposed to be able to access each of them? Which are relevant to this material? I presume that bash (and perhaps csh/tcsh) is what we're really supposed to be concentrating on, so this material should be presented in a way that makes that clear. Also, what is POSIX and why do I care???

grammatical tense needs checking for all instancees of 'run'

'ran' should often be 'was run' or 'has been run' -- this can be checked in later version, just a general comment that the tenses need to be checked throughout for this particular verb

Git-annex exercise is broken

(In the wake of reviewing training modules for the fellowship workshop, I'll report bugs I find in the form of issues)

The git remote add command that adds datasets.datalad.org/workshops/nipype-2017/ds000114/.git as a remote lacks a protocol specification.

Here is what happens if you execute the commands as they are given in the exercise on git-annex in http://www.repronim.org/module-reproducible-basics/02-vcs/:

(handbook) ╭─adina@muninn /tmp
╰─➤ git clone https://github.com/datalad/ds000114
Cloning into 'ds000114'...
remote: Enumerating objects: 150826, done.
remote: Total 150826 (delta 0), reused 0 (delta 0), pack-reused 150826
Receiving objects: 100% (150826/150826), 24.55 MiB | 5.33 MiB/s, done.
Resolving deltas: 100% (57100/57100), done.
(handbook) ╭─adina@muninn /tmp
╰─➤ cd ds000114
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git annex get sub-*/anat/sub-*_T1w.nii.gz
(merging origin/git-annex into git-annex...)
(recording state in git...)
get sub-01/anat/sub-01_T1w.nii.gz 
  Remote origin not usable by git-annex; setting annex-ignore
(from web...) 
download failed: Not Found

download failed: Not Found

download failed: Not Found

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
   	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
   	30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
get sub-02/anat/sub-02_T1w.nii.gz (from web...) 
download failed: Not Found

download failed: Not Found

download failed: Not Found

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
   	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
   	30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
(recording state in git...)
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git annex get derivatives/freesurfer/sub-*/mri/T1.mgz                   1 ↵
get derivatives/freesurfer/sub-01/mri/T1.mgz (not available) 
  Try making some of these repositories available:
  	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
get derivatives/freesurfer/sub-02/mri/T1.mgz (not available) 
  Try making some of these repositories available:
  	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git remote add datalad datasets.datalad.org/workshops/nipype-2017/ds000114/.git
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git fetch datalad
fatal: 'datasets.datalad.org/workshops/nipype-2017/ds000114/.git' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Here is a fix:

-> > % git remote add datalad datasets.datalad.org/workshops/nipype-2017/ds000114/.git
+> > % git remote add datalad http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git

I can get everything afterwards:

(handbook) ╭─adina@muninn /tmp
╰─➤ git clone [email protected]:datalad/ds000114.git blub                               128 ↵
Cloning into 'blub'...
remote: Enumerating objects: 150826, done.
remote: Total 150826 (delta 0), reused 0 (delta 0), pack-reused 150826
Receiving objects: 100% (150826/150826), 24.55 MiB | 5.31 MiB/s, done.
Resolving deltas: 100% (57100/57100), done.
(handbook) ╭─adina@muninn /tmp
╰─➤ cd blub
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤  git annex get sub-*/anat/sub-*_T1w.nii.gz
(merging origin/git-annex into git-annex...)
(recording state in git...)
get sub-01/anat/sub-01_T1w.nii.gz Invalid command: 'git-annex-shell 'configlist' '/~/datalad/ds000114.git''
  You appear to be using ssh to clone a git:// URL.
  Make sure your core.gitProxy config option and the
  GIT_PROXY_COMMAND environment variable are NOT set.

  Remote origin does not have git-annex installed; setting annex-ignore

  This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin
(from web...) 
download failed: Not Found

download failed: Not Found

download failed: Not Found

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
   	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
   	30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
get sub-02/anat/sub-02_T1w.nii.gz (from web...) 
download failed: Not Found

download failed: Not Found

download failed: Not Found

  Unable to access these remotes: web

  Try making some of these repositories available:
  	00000000-0000-0000-0000-000000000001 -- web
   	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
   	30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
(recording state in git...)
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git annex get derivatives/freesurfer/sub-*/mri/T1.mgz                              1 ↵
get derivatives/freesurfer/sub-01/mri/T1.mgz (not available) 
  Try making some of these repositories available:
  	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
get derivatives/freesurfer/sub-02/mri/T1.mgz (not available) 
  Try making some of these repositories available:
  	03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
   	4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
   	b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives

  (Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git remote add datalad http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git fetch datalad
remote: Counting objects: 17874, done.
remote: Compressing objects: 100% (2944/2944), done.
remote: Total 17874 (delta 15630), reused 17035 (delta 14884)
Receiving objects: 100% (17874/17874), 1.97 MiB | 1.07 MiB/s, done.
Resolving deltas: 100% (15630/15630), completed with 3020 local objects.
From http://datasets.datalad.org/workshops/nipype-2017/ds000114/
 * [new branch]          git-annex           -> datalad/git-annex
 * [new branch]          master              -> datalad/master
 * [new branch]          nipype_test1        -> datalad/nipype_test1
 * [new branch]          synced/nipype_test1 -> datalad/synced/nipype_test1
 * [new tag]             1.0.0               -> 1.0.0
 * [new tag]             2.0.0               -> 2.0.0
 * [new tag]             2.0.0+1             -> 2.0.0+1
 * [new tag]             2.0.1               -> 2.0.1
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤  git annex get derivatives/freesurfer/sub-*/mri/T1.mgz
get derivatives/freesurfer/sub-01/mri/T1.mgz (merging datalad/git-annex into git-annex...)
(recording state in git...)
(from datalad...) 
(checksum...) ok                    
get derivatives/freesurfer/sub-02/mri/T1.mgz (from datalad...) 
(checksum...) ok                  
(recording state in git...)
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git annex get sub-*/anat/sub-*_T1w.nii.gz
get sub-01/anat/sub-01_T1w.nii.gz (from web...) 
(checksum...) ok                      
get sub-02/anat/sub-02_T1w.nii.gz (from web...) 
(checksum...) ok                  
(recording state in git...)

cc @yarikoptic:

The task reads as if sub-*/anat/sub-*_T1w.nii.gz should be retrieved successfully, but this is not the case (at least when I tried it) -- is it supposed to fail?
I'll PR a fix for the missing protocol later today

Undefined Terms - Command Line/Shell section

'resolve' - as in 'unresolved commands'. [commands that cannot be completed?]
HPC (acronym). - High Performance Computing?
PID - process ID?
'overlay distribution'. - this seems like an important concept operationally, definition would help me
'readline library' [I derived the following from https://en.wikipedia.org/wiki/GNU_Readline]
Readline library = software library that provides line-editing AND history capabilities for interactive programs with command line interface
Supports both Emacs and vi editing

Command line/shell -unclear learning objective (built-in commands)

Example of unresolvable build in commands 'which' and 'pwd'

I wasn't sure what I was supposed to learn from the illustration

a) what is relationship between 'which' and 'pwd' as not built-in commands, or IS that the relationship between them, that both are NOT built in, and therefore both yield 'invalid option' results?
b) mention of which as built-in command in zsh suggests possibility of including that in the illustration-- paired example of built in versus not built in?

"change the current shell" exercise

So, I'm using a Windows computer. I can see that I'm running the bash shell. However, tcsh is not installed, so I can not start it. It would be helpful if there was a command to tell me what shells I have access to on a given computer, and if you really want me to use tcsh, how to get it if I don't have it.

compare available git tutorials/training materials

It would be nice if we had some "rating" of the materials we present. E.g. some tutorials might be aiming to present tools more as a black box and depict their functionality, and others try to present "philosophy"/internals of the tools, thus theoretically making user more knowledgeable and capable of navigating and recovering from failures better.

E.g. here is a set of git tutorials which imho approach Git from different sides

and it would be good to know on what should be the suggested order of exploring them... either one of them is so superior that the other one wouldn't be needed?

[HWK]: Repronim instructor fellow training feedback

Page 0:
- "reproducibility requires knowledge of what, when, and how" --> I think repetition or re-running requires this, replication or reproduction is a bit of another concept that can use bits of each of these pieces of information, but is broader and relies on much more (at least by the definitions I attribute to each). Would be good to clarify terms often; the paper I link in ReproNim/module-intro#6 is helpful here.
- clarify intention: who is this for? True beginners, or people that know a bit? Is the goal just to have them know basic infrastructure they should use in this space, or understand why using the shell and git are important?
- "very unlikely that you have managed to completely avoid using those tools" --> fix language; feels a bit alienating for those who truly are new to this
Page 1:
- Far too large teaching-to-practicing ratio
- Is it said anywhere why knowing how to use a shell is valuable for scientists? Where will it come in handy? Why can't they just use GUIs and Jupyter Notebooks for everything? (I know the answers to these, but don't think we say them)
- In general, I'm very anti-duplicating content. Would it make sense to just open a PR with the pieces we want to add the SW carpentry lectures, and then this page would just 1) link to it, and 2) have short "practicals" in which we explain a setting we'd encounter in a typical scientific workflow and have them solve it (with hidden solutions available).
- In places there is a lot of detail where I really don't think it's needed for the "basics"... i.e. LD_LIBRARY_PATH could be in a supplementary section, but it's not something that most people will need to pay much attention to, and certainly not before aliases or history will be relevant to them.
- Would break this into several more digestible slides.
Page 2:
- Again, far too large teaching-to-practice (TP) ratio, in my opinion
- Similar comments to above, that it may be worth folding some bits into SWC lecture and leaving our pages to specific case studies that people will care about. Very regularly when teaching I would get a "but why are we using/learning this" unless I situate a tool specifically within a context where they recognize its immediate value. For git, the hilarious xkcd comic on filenames among others can help make these problems feel more real.
Page 3:
- Decrease TP ratio
- no real mention of pip? These are the standard for Python, and Conda really is an extra layer that's not a) shipped with Python, and b) necessary in many situations (like containers, which we'll get to later).
- I also think virtualenv is valuable to mention in this context because you can decouple package managers and where they install their libraries; recognizing libraries as files on a system, you can re-point your package manager and environment to install, recognize, source, uninstall, or test against different versions of similar software.
Page 4:
- Decrease teaching time; this doesn't require 3 hours if we're mostly covering it at a conceptual level
- Add text elaborating on/summarizing the links in the text.
- Link to choose a license
Page 5:
- Have it be more "student led" and provide scaffolding rather than instructions (with hidden answers) --> "how can you identify if the problem is unique to you?" "what do developers need to know so that they can help you?" "if they need to reproduce it, what details do they need to do that?" "how can you record these details most easily when doing your work?" "how can you communicate these details?"

Target audience for this module?

Thinking about this on my own, it appears the target audience for this module substantially overlaps with dataprocessing, such that concepts covered here could be introduced as needed into that module to solve the person's data processing problems. I can see how FAIR-data, statistics, and dataprocessing help solve inter-related but separable problems, but I'm having a harder time placing what problem reproducible-basics solves.

What problems I think the modules solve

FAIR-data: how do I share/find my data?
datapreprocessing: how do I preprocess/analyze my data reproducibly?
statistics: how do I make appropriate models/interpret my data?
reproducible-basics: understand reproducibility???

I'm trying to think from the perspective from someone that would like to attend a workshop, and I'm having trouble thinking about what concrete problem understanding reproducibility is solving or if there is another problem the module is solving that can easily translate to someone's goals.

I do think for the dataprocessing module, there are additional worthwhile concepts to cover that are not in git-novice or shell-novice that are covered in this module and I'm curious what other people think about merging these two modules? (and redistributing/reformatting lessons that don't fit into dataprocessing into the introduction/FAIR-data modules)

TODO4Yarik: review Dataprocessing module

http://www.reproducibleimaging.org/module-dataprocessing/ lessons 1-3

Undefined Concept/Function - Library/Shared Library

Command Line/Shell section

"Library" hasn't been introduced in SC. tutorial or preceding materials
Relationship between shared library and shell also seems important, not clear to me

Library of what, specifically? shell-specific tools associated with particular functions?

Make citeable by linking to zenodo

since based on the work of carpentry, we might to limit author list and provide some joint one for "Carpentry authors" and for that -- generate .zenodo.json file with relevant info

meanwhile I already linked it to zenodo on https://zenodo.org/account/settings/github/

Undefined Command - sudo

Magnitude of potential undesirable consequences of using undefined variables in this example will be lost on anyone who doens't already know the command:
sudo rm -rf

repronim-training --check-environment

probably no the best repo to file issue against, but recording the idea to have a helper command to verify availability and compatibility of the components (e.g. datalad, singularity) used/needed by different training modules. Could be a quick check and ideally be capable of just running all scripted examples from the training materials (like we did half-manually for the datalad/reproin/glm lessons)