repronim / module-reproducible-basics Goto Github PK
View Code? Open in Web Editor NEWModule 0: Reproducible Basics
Home Page: http://www.repronim.org/module-reproducible-basics
License: Other
Module 0: Reproducible Basics
Home Page: http://www.repronim.org/module-reproducible-basics
License: Other
Command Line/Shell section
"Library" hasn't been introduced in SC. tutorial or preceding materials
Relationship between shared library and shell also seems important, not clear to me
Library of what, specifically? shell-specific tools associated with particular functions?
Thinking about this on my own, it appears the target audience for this module substantially overlaps with dataprocessing, such that concepts covered here could be introduced as needed into that module to solve the person's data processing problems. I can see how FAIR-data, statistics, and dataprocessing help solve inter-related but separable problems, but I'm having a harder time placing what problem reproducible-basics solves.
FAIR-data: how do I share/find my data?
datapreprocessing: how do I preprocess/analyze my data reproducibly?
statistics: how do I make appropriate models/interpret my data?
reproducible-basics: understand reproducibility???
I'm trying to think from the perspective from someone that would like to attend a workshop, and I'm having trouble thinking about what concrete problem understanding reproducibility is solving or if there is another problem the module is solving that can easily translate to someone's goals.
I do think for the dataprocessing module, there are additional worthwhile concepts to cover that are not in git-novice
or shell-novice
that are covered in this module and I'm curious what other people think about merging these two modules? (and redistributing/reformatting lessons that don't fit into dataprocessing into the introduction/FAIR-data modules)
(In the wake of reviewing training modules for the fellowship workshop, I'll report bugs I find in the form of issues)
The git remote add
command that adds datasets.datalad.org/workshops/nipype-2017/ds000114/.git as a remote lacks a protocol specification.
(handbook) ╭─adina@muninn /tmp
╰─➤ git clone https://github.com/datalad/ds000114
Cloning into 'ds000114'...
remote: Enumerating objects: 150826, done.
remote: Total 150826 (delta 0), reused 0 (delta 0), pack-reused 150826
Receiving objects: 100% (150826/150826), 24.55 MiB | 5.33 MiB/s, done.
Resolving deltas: 100% (57100/57100), done.
(handbook) ╭─adina@muninn /tmp
╰─➤ cd ds000114
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git annex get sub-*/anat/sub-*_T1w.nii.gz
(merging origin/git-annex into git-annex...)
(recording state in git...)
get sub-01/anat/sub-01_T1w.nii.gz
Remote origin not usable by git-annex; setting annex-ignore
(from web...)
download failed: Not Found
download failed: Not Found
download failed: Not Found
Unable to access these remotes: web
Try making some of these repositories available:
00000000-0000-0000-0000-000000000001 -- web
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
get sub-02/anat/sub-02_T1w.nii.gz (from web...)
download failed: Not Found
download failed: Not Found
download failed: Not Found
Unable to access these remotes: web
Try making some of these repositories available:
00000000-0000-0000-0000-000000000001 -- web
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
(recording state in git...)
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git annex get derivatives/freesurfer/sub-*/mri/T1.mgz 1 ↵
get derivatives/freesurfer/sub-01/mri/T1.mgz (not available)
Try making some of these repositories available:
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
get derivatives/freesurfer/sub-02/mri/T1.mgz (not available)
Try making some of these repositories available:
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git remote add datalad datasets.datalad.org/workshops/nipype-2017/ds000114/.git
(handbook) ╭─adina@muninn /tmp/ds000114 on nipype_test1
╰─➤ git fetch datalad
fatal: 'datasets.datalad.org/workshops/nipype-2017/ds000114/.git' does not appear to be a git repository
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Here is a fix:
-> > % git remote add datalad datasets.datalad.org/workshops/nipype-2017/ds000114/.git
+> > % git remote add datalad http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git
(handbook) ╭─adina@muninn /tmp
╰─➤ git clone [email protected]:datalad/ds000114.git blub 128 ↵
Cloning into 'blub'...
remote: Enumerating objects: 150826, done.
remote: Total 150826 (delta 0), reused 0 (delta 0), pack-reused 150826
Receiving objects: 100% (150826/150826), 24.55 MiB | 5.31 MiB/s, done.
Resolving deltas: 100% (57100/57100), done.
(handbook) ╭─adina@muninn /tmp
╰─➤ cd blub
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git annex get sub-*/anat/sub-*_T1w.nii.gz
(merging origin/git-annex into git-annex...)
(recording state in git...)
get sub-01/anat/sub-01_T1w.nii.gz Invalid command: 'git-annex-shell 'configlist' '/~/datalad/ds000114.git''
You appear to be using ssh to clone a git:// URL.
Make sure your core.gitProxy config option and the
GIT_PROXY_COMMAND environment variable are NOT set.
Remote origin does not have git-annex installed; setting annex-ignore
This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin
(from web...)
download failed: Not Found
download failed: Not Found
download failed: Not Found
Unable to access these remotes: web
Try making some of these repositories available:
00000000-0000-0000-0000-000000000001 -- web
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
get sub-02/anat/sub-02_T1w.nii.gz (from web...)
download failed: Not Found
download failed: Not Found
download failed: Not Found
Unable to access these remotes: web
Try making some of these repositories available:
00000000-0000-0000-0000-000000000001 -- web
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
135caf6c-5166-45e8-8d46-4ff5e08985b3 -- datalad
30a3fc48-d8bf-4c77-9b0b-975a9008b2bb -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
(recording state in git...)
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git annex get derivatives/freesurfer/sub-*/mri/T1.mgz 1 ↵
get derivatives/freesurfer/sub-01/mri/T1.mgz (not available)
Try making some of these repositories available:
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
get derivatives/freesurfer/sub-02/mri/T1.mgz (not available)
Try making some of these repositories available:
03030b7b-99e3-409b-a7ac-bdf8b7bb2ef4 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/nipype-workshop-2017/ds000114
4554a6aa-3346-4092-a88c-202f1681d95f -- yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114
b2096681-bbfc-4e3a-a832-30d7bbf373a0 -- datalad-archives
(Note that these git remotes have annex-ignore set: origin)
failed
git-annex: get: 2 failed
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git remote add datalad http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git fetch datalad
remote: Counting objects: 17874, done.
remote: Compressing objects: 100% (2944/2944), done.
remote: Total 17874 (delta 15630), reused 17035 (delta 14884)
Receiving objects: 100% (17874/17874), 1.97 MiB | 1.07 MiB/s, done.
Resolving deltas: 100% (15630/15630), completed with 3020 local objects.
From http://datasets.datalad.org/workshops/nipype-2017/ds000114/
* [new branch] git-annex -> datalad/git-annex
* [new branch] master -> datalad/master
* [new branch] nipype_test1 -> datalad/nipype_test1
* [new branch] synced/nipype_test1 -> datalad/synced/nipype_test1
* [new tag] 1.0.0 -> 1.0.0
* [new tag] 2.0.0 -> 2.0.0
* [new tag] 2.0.0+1 -> 2.0.0+1
* [new tag] 2.0.1 -> 2.0.1
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git annex get derivatives/freesurfer/sub-*/mri/T1.mgz
get derivatives/freesurfer/sub-01/mri/T1.mgz (merging datalad/git-annex into git-annex...)
(recording state in git...)
(from datalad...)
(checksum...) ok
get derivatives/freesurfer/sub-02/mri/T1.mgz (from datalad...)
(checksum...) ok
(recording state in git...)
(handbook) ╭─adina@muninn /tmp/blub on nipype_test1
╰─➤ git annex get sub-*/anat/sub-*_T1w.nii.gz
get sub-01/anat/sub-01_T1w.nii.gz (from web...)
(checksum...) ok
get sub-02/anat/sub-02_T1w.nii.gz (from web...)
(checksum...) ok
(recording state in git...)
cc @yarikoptic:
sub-*/anat/sub-*_T1w.nii.gz
should be retrieved successfully, but this is not the case (at least when I tried it) -- is it supposed to fail?So, I'm using a Windows computer. I can see that I'm running the bash shell. However, tcsh is not installed, so I can not start it. It would be helpful if there was a command to tell me what shells I have access to on a given computer, and if you really want me to use tcsh, how to get it if I don't have it.
The current datalad exercise doesn't work "out of the box", and will likely fail soon even if one sets up webdav access as Box.com drops webdav support, it seems.
I volunteer to add a new short exercise in the next few days.
In https://github.com/ReproNim/module-reproducible-basics/pull/21/files#diff-9a1015a3c8a282fec2e64768cd192adbR13 @chrispycheng added a (self)wrap up section.
I wonder if we could/should extend it somehow. E.g.:
@ReproNim/trd-training
Examples and exercises could reflect back to the simple workflow paper thus providing a real life example of applicability of the presented/learned skills, and bind together all modules (not only the reproducible-basics)
It would be nice if we had some "rating" of the materials we present. E.g. some tutorials might be aiming to present tools more as a black box and depict their functionality, and others try to present "philosophy"/internals of the tools, thus theoretically making user more knowledgeable and capable of navigating and recovering from failures better.
E.g. here is a set of git tutorials which imho approach Git from different sides
and it would be good to know on what should be the suggested order of exploring them... either one of them is so superior that the other one wouldn't be needed?
Example of unresolvable build in commands 'which' and 'pwd'
I wasn't sure what I was supposed to learn from the illustration
a) what is relationship between 'which' and 'pwd' as not built-in commands, or IS that the relationship between them, that both are NOT built in, and therefore both yield 'invalid option' results?
b) mention of which as built-in command in zsh suggests possibility of including that in the illustration-- paired example of built in versus not built in?
The version control part in this module suggests the following addition to .gitattributes
in the task "How can we add the file a.txt directly under git, and file b.dat under git-annex?":
% cat << EOF > .gitattributes
* annex.largefiles=(not(mimetype=text/*))
*.dat annex.largefiles=anything
EOF
I have two remarks:
First, if someone follows the exercise, this command may fail or do something unexpected, because if executed in the git-annex repo of the previous exercise, this here doc would overwrite an existing .gitattributes file.
Second, with this configuration, empty (text)files would be annexed (see datalad/datalad#3663). I would suggest using the current configuration that datalad's text2git
procedure uses: ((mimeencoding=binary)and(largerthan=0))
.
I'll send a PR later today that demonstrates what I mean.
apparently some terminals/shells have "short" version of the PS1 which shows only the current directory name (not a full path etc). This makes it difficult sometimes to have a coherent mental picture about actual location within the hierarchy, leading to all kinds of misdemeanor. Knowing where you are is kinda important for reproducibility
LD_LIBRARY_PATH
could be in a supplementary section, but it's not something that most people will need to pay much attention to, and certainly not before aliases or history will be relevant to them.pip
? These are the standard for Python, and Conda really is an extra layer that's not a) shipped with Python, and b) necessary in many situations (like containers, which we'll get to later).virtualenv
is valuable to mention in this context because you can decouple package managers and where they install their libraries; recognizing libraries as files on a system, you can re-point your package manager and environment to install, recognize, source, uninstall, or test against different versions of similar software.Magnitude of potential undesirable consequences of using undefined variables in this example will be lost on anyone who doens't already know the command:
sudo rm -rf
Is this solvable without referring to external materials? I found these very challenging, and did not solve, which is still bugging me, Tetris might have helped recharge my brain.
Exercise
Choose shunit2 or bats (or both) and
re-write above test for 1dsum using one of the frameworks. If you do not have AFNI available, you could test generic bc or dc command line calculators possibly available on your system.
Add additional tests to “document” behavior of 1dsum whenever
input file is empty
multiple files are provided
some values are negative
Command Line/Shell section challenge-
Shebang apparently not previously introduced, (did not see anywhere in Software Carpentry tutorial or Repronim material up to this point)
Command Line/Shell section (Unit-Testing)
Sample script for testing basic correct operation of AFNI 1dsum command
Sorting this example out was very challenging (for me) and required some sleuthing and a lot of time to think out the commands in use, particularly the populate file and compare result with target value lines. This could be ok if the expectation is that that trainees will be independent and spend the time, it may defeat some though.
My own bigger concern was that after poring over it, 'maybe' I got it, but very possibly I didn't get it well enough to apply to a different problem, it was fleeting intuition that got me to some things.
(-1dsum command itself is unfamiliar to novice, but not essential to following the script, it seems)
What's the point of listing all these shells? Am I supposed to be able to access each of them? Which are relevant to this material? I presume that bash (and perhaps csh/tcsh) is what we're really supposed to be concentrating on, so this material should be presented in a way that makes that clear. Also, what is POSIX and why do I care???
'resolve' - as in 'unresolved commands'. [commands that cannot be completed?]
HPC (acronym). - High Performance Computing?
PID - process ID?
'overlay distribution'. - this seems like an important concept operationally, definition would help me
'readline library' [I derived the following from https://en.wikipedia.org/wiki/GNU_Readline]
Readline library = software library that provides line-editing AND history capabilities for interactive programs with command line interface
Supports both Emacs and vi editing
See e.g. http://www.repronim.org/sfn2018-training/02-01-shell/ where we had some additional hints like to quit vim and some visualization
'ran' should often be 'was run' or 'has been run' -- this can be checked in later version, just a general comment that the tenses need to be checked throughout for this particular verb
2 things:
e.g Change shell TO KornSHell
% ksh or $ ksh (same command, different prompt)
OR
e.g. Change shell TO a C-programming language type shell
% csh (or tcsh)
Plus a comment: Successful shell change will be indicated with ~>
since based on the work of carpentry, we might to limit author list and provide some joint one for "Carpentry authors" and for that -- generate .zenodo.json file with relevant info
meanwhile I already linked it to zenodo on https://zenodo.org/account/settings/github/
probably no the best repo to file issue against, but recording the idea to have a helper command to verify availability and compatibility of the components (e.g. datalad, singularity) used/needed by different training modules. Could be a quick check and ideally be capable of just running all scripted examples from the training materials (like we did half-manually for the datalad/reproin/glm lessons)
Brief definitions would be helpful.
'Environment Variables' introduced as 'not a feature of a shell' --
a) would be helpful to have defining statement re: what environment variables ARE
b) organization and content of this section suggests that environment variables are paths (and only paths, as it looks as if the examples are all paths - LD_LIBRARY_PATH, PYTHONPATH). Is this true?
if not, contextual explanation would be helpful
'Environment' = computational environment? or something more specialized?
would be helpful to define term in a few words, also likely helpful to have Environment introduced before Environment Variables
I see it first under Possible Conflicts:
"It might happen that PATH points to one environment first, while LD_LIBRARY_PATH points to libraries from another environment...."
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.