datalad-handbook / book Goto Github PK

Sources for the DataLad handbook

License: Other

Makefile 3.65% Python 94.92% Shell 1.43%

book's Introduction

The DataLad handbook 📙

This is a living resource on why and - more importantly - how to use DataLad. The rendered version is here: https://handbook.datalad.org, and is currently under initial development.

The handbook is a practical, hands-on crashcourse to learn and experience DataLad. You do not need to be a programmer, computer scientist, or Linux-crank. If you have never touched your computer's shell before, you will be fine. Regardless of your background and personal use cases for DataLad, the handbook will show you the principles of DataLad, and from chapter 1 onwards you will be using them.

Find more general information about the idea behind the handbook in the poster presented at the 2020 OHBM or dive straight into your DataLad adventure.

Contributing

Contributions in any form - pull requests, issues, content requests/ideas, ... are always welcome. If you are using the handbook and find that something does not work, please let us know. Likewise, if you are using DataLad for your individual project, consider contributing by telling us about your use-case. You can find out more on how to contribute here, and a list of all contributors so far below, in CONTRIBUTORS.md, and in .zenodo.json.

Notes for Instructors

The book is the basis for workshops and lectures on DataLad and data management. The handbook's course repository among other things contains live casts from the code examples in this book and slides. It is constantly growing, and everyone is free to use the material under the license terms below. Contributions and feedback are very welcome.

License

CC-BY-SA: You are free to

share - copy and redistribute the material in any medium or format
adapt - remix, transform, and build upon the material for any purpose, even commercially

under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Adina S. Wagner} 💻 🖋 📖 🎨 🤔 🚇 🚧 📆 👀 📓 📢 ⚠️ 🐛 💡 💬 ️️️️♿️	_{Laura Waite} 🤔 🚧 👀 📢 💬 🖋	_{Michael Hanke} 💬 🐛 💻 🖋 📖 🎨 💡 🤔 🚇 🚧 🔌 📆 👀 🔧 ⚠️ 📢 📓 ️️️️♿️	_{Kyle Meyer} 🐛 👀 💬 🖋 🤔	_{Marisa Heckner} 🤔 📓 🐛 🖋	_{Benjamin Poldrack} 💬 🤔 💡 ✅	_{Yaroslav Halchenko} 👀 🖋 🤔 🐛
_{Chris Markiewicz} 🐛	_{Pattarawat Chormai} 🐛 💻	_{Lisa N. Mochalski} 🐛 🖋 💡 🤔	_{Lisa Wiersch} 🐛	_{Jean-Baptiste Poline} 🖋	_{Nevena Kraljevic} 📓	_{Alex Waite} 👀 🐛 🤔
_{Lya K. Paas} 🐛 💻	_{Niels Reuter} 🖋	_{Peter Vavra} 🤔 📓	_{Tobias Kadelka} 📓	_{Peer Herholz} 🤔	_{Alexandre Hutton} 🖋 🐛	_{Sarah Oliveira} 👀 🤔
_{Dorian Pustina} 🤔	_{Hamzah Hamid Baagil} 📓 🐛	_{Tristan Glatard} 🐛 🖋	_{Giulia Ippoliti} 🖋 💡	_{Christian Mönch} 🖋	_{Togaru Surya Teja} 🖋	_{Dorien Huijser} 🐛 📓
_{Ariel Rokem} 🐛	_{Remi Gau} 🐛 🤔 🚧 👀 🚇 💻 🎨	_{Judith Bomba} 🐛	_{Konrad Hinsen} 🐛	_{Wu Jianxiao} 🐛	_{Małgorzata Wierzba} 📓 👀 ✅	_{Stefan Appelhoff} 🚇 🔧 🐛
_{Michael Joseph} 🤔 🖋 🐛	_{Tamara Cook} 👀 🚇	_{Stephan Heunis} 🐛 🚧 🖋 💡 👀	_{Joerg Stadler} 🐛	_{Sin Kim} 🐛 🖋 👀	_{Oscar Esteban} 🐛	_{Michał Szczepanik} 👀 🐛 🖋
_eort 🐛	_Myrskyta 🐛	_{Thomas Guiot} 🐛	_jhpb7 🐛	_{Ikko Ashimine} 🐛	_{Arshitha Basavaraj} 🖋 🐛 🚧	_{Anthony J Veltri} 📓
_{Isil Bilgin} 🐛 🚧	_{Julian Kosciessa} 🖋	_{Isaac To} 🚧 🖋 🐛	_{Austin Macdonald} 🐛	_{Christopher S. Hall} 🐛	_jcf2 🐛	_{Julien Colomb} 🖋
_{Danny Garside} 🐛 🚧	_{Justus Kuhlmann} 🖋	_melanieganz 🐛	_{Damien François} 🐛 🖋	_{Tosca Heunis} 🐛 📓	_{Jeremy Magland} 🐛

This project follows the all-contributors specification. Contributions of any kind welcome!

book's People

Contributors

Stargazers

Watchers

Forkers

loj kyleam marisaheckner p16i yarikoptic lisanmo jbpoline lilikapa tobiaskadelka bpoldrack lisawiersch96 nevenak davidewarrenphd gi114 aqw christian-monch lnnrtwttkhn ayrustogaru ufangyang josephmje arokem judithbomba wsojka00 fengshan95 jadecci m-wierzba tamaracha mattcieslak jsheunis effigies cpernet kimsin98 oesteban eort mslw tguiot jcolomb jhpb7 eltociear ajveltri22 complexbrains jkosciessa candleindark mih remi-gau jcf2 cs-hall thechymera arshitha da5nsy jkuhl-uni damienfrancois tmheunis geyslein magland

book's Issues

improve the way code is displayed

Quickly making a note of #73 (comment) to remind us of a JS based solution about visually distinguishing comments and code @mih mentioned in the above command.

Cross-reference technical docs

#79 reminded me that it would be good to have a standard way to cross-reference the book content with the rest of the technical docs. For example, every command has a "manpage" that can be found via this pattern (here for create):

http://docs.datalad.org/generated/man/datalad-create.html

and a corresponding page with analog information for Python API users:

http://docs.datalad.org/generated/datalad.api.create.html

ATM when this command is introduced in http://handbook.datalad.org/basics/101-101-create.html no reference to this additional documentation is made.

I see at least to possible approaches:

we make an immediate reference whenever a command first appears (in some kind of a note)
we maintain a proper list of all described commands and their functionality (much like #79 id doing) and we use proper index markup (http://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html?highlight=index%20generating#index-generating-markup) to cross-ref this list with the book content, and only place the cross-ref to other technical docs in this list.

I haven't played with (2) yet, but I somewhat feel that it would be best to do this in a single place, rather than all over the book.

Windows WSL2 installation/usage exploration

On a fresh Win10, immediately after install:

Enable WSL

At the moment, one needs to join the Windows Insider Programm to get access to a build version that has WSL2.

Start the Power Shell as an Administrator.
Run both commands, only restart after the second one (despite being prompted after the first one already)

Enable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

Install Debian app

Same procedure as for WSL1

Enable WSL2 for the Debian app

Could not continue here, because I tried in a VirtualBox VM, and Win10 in the VM needs proper hardware virtualization for WSL2, which VirtualBox cannot provide (nested) virtualization on Intel CPUs (AMD would be fine).

Next attempt with `bnbdatalad`:

To set the WSL version to WSL2, run this command as an administrator in the Power Shell:

wsl --set-default-version 2

In the Microsoft store, search for a Debian distribution. Download and install it, then start it. Pick a user name, and a root password (repeat the password when prompted).

configure the distro to use WSL 2: Run this command as an administrator in the Power Shell:

wsl -l -v

This should say something like this (the important detail is VERSION 2):

    NAME        STATE               VERSION
*   Debian       Running            2

To enable the Neurodebian repository to install datalad with apt-get...
- In your Debian distribution, find out which version you are running -- should be buster (e.g. withcat /etc/*release )
- install a few missing tools: sudo apt-get install wget, sudo apt-get install gnupg (or gnupg2)
- Enable the Neurodebian repository.
- sudo apt-get update, sudo apt-get upgrade
- (one could do sudo apt-get install datalad now, but would not be datalad.0.12)
To install datalad with pip...
- install python3-pip with sudo apt-get install python3-pip
- to install DataLad 0.12 from master: pip3 install --user git+https://github.com/datalad/datalad.git#egg=datalad (but requires a prior sudo apt-get install git)
- add ~/.local/bin to the PATH
- to install git-annex: sudo apt-get install git-annex

🎉

DICOM to BIDS conversion

This is likely to most common use case in neuroimaging. So it would be highly desirable to have this covered. @bpoldrack 's docs for datalad-hirni could serve as a starting point.

In terms of connecting this with the primary content of the book, we would need to introduce the concept of a datalad extension package, because hirni uses quite a few of them.

Adopt DataLad colors

ATM the theme sphinx uses for HTML and PDF output is using the default colors. However, even now there is a figure included that uses the color scheme we have adopted for DataLad elsewhere (website, posters https://f1000research.com/posters/7-1965).

I would make sense to me to adopt the same colors also for the book (notes, todos, highlights, etc.). Let me know, and I'll give it a go.

Uniform user environment for the book examples

At the moment the current system user environment (i.e. us) is used. That means that when I re-run and replace @adswa 's output, there will always be changes.

We could include a helper script that create a dedicated user enironment on our machines that we then use to build the book examples.

FTR: Slides of a short talk on the YODA principles

https://github.com/myyoda/talk-principles

Resources to draw from

Tutorial on a complete, and reproducible neuroimaging analysis: http://www.repronim.org/ohbm2018-training/03-01-reproin/
Slides of a short talk on the YODA principles https://github.com/myyoda/talk-principles
datalad hirni demos https://datalad-hirni.readthedocs.io/en/latest/demos.html
figure of the target API structure

Double "introduction" heading

Check here: https://buildmedia.readthedocs.org/media/pdf/datalad-handbook/latest/datalad-handbook.pdf

1.1 Introduction
1.1.1. Introduction

Fixed sidebars

@marisaheckner noted that it is inconvenient to scroll all the way up to the side bar again to switch sections, and I agree. Will PR a fix in a minute.

Move repo to a dedicated organization

Coming developments like #25 will require a place for more than just the handbook repo. I think we should move to a dedicated GitHub organization.

Look at http://blockdiag.com

for helpers to illustrate Git history and other bits without screenshots and hand-made figures.

Dedicated markup for further reading

Turn constructs like the following into dedicated directives that indicate their purpose

.. container:: toggle
	
	   .. container:: header
	
	      **Addition: More on this...**
```	

Candidate name: `moreinfo`

It would be a toggle box in HTML, and possibly a marginnote in Latex.

FTR: Tutorial on a complete, and reproducible neuroimaging analysis

http://www.repronim.org/ohbm2018-training/03-01-reproin/

Ideas on how to balance file length and recreation of datasets in individual workdirs

When recording code and its output, every single .rst file gets an individual workdir associated with it in docs/_build/wdirs/.

The dataset I'm continuously building up and extending will need to be in every one of these workdirs (as many as .rst files I am writing) in order for the code to work. I don't think it is feasible to create potentially dozens of datasets, slightly increasing in size between different .rst files. But if I write down everything in a single file, I'm creating - at least for the web version of the book - a really really really long page, and in principle, I'd like to chunk it up into individual pages, and also have toc entries for each of these pages.

Does anyone have an idea on how to balance these conflicting demands? Is there a way to "pagebreak" single .rst files and create toc entries referencing specific sections in files? Or, alternatively, specify a common workdir manually?

Tune figure setup for PDF output

At the moment it looks somewhat chaotic.

Windows WSL installation/usage exploration

Conclusion

Files obtained via datalad under WSL1 into the Windows filesystem are only accessible when unlocked. This dramatically limits usability. WSL1 does not allow for GUI tools to run, and Windows apps will only be able to access files that are real files (not symlinks) in the WSL1 filesystem.

Moreover, v7 adjusted branch (unlock) does not work under WSL1 datalad/datalad#3608

I think we cannot recommend WSL1.

Few issues and suggestions

[only relevant if WSL1 is still considered as a viable deployment target]

When installing the Debian App the default release is now buster and the NeuroDebian config selection needs to be adjusted
It may be better to recommend people to copy&paste the exact snippet that is given on the NeuroDebian website. To make it work wget and gnupg have to be installed.
apt-get install datalad will yield a version of datalad that is not compatible with the book (0.11 instead of 0.12). Either people need to follow the pip route, or DataLad 0.12 must be packaged to fix this
for pip to work, it needs python3-pip
was there any need for the first update/upgrade round? If not, it could be stripped, because it may cause download and installation of packages versions that are subsequently replaced by updates from NeuroDebian
to install DataLad 0.12 from master: pip3 install --user git+https://github.com/datalad/datalad.git#egg=datalad (but requires a prior sudo apt-get install git)
if installed via pip install --user the path ~/.local/bin needs to be added to $PATH to get the cmdline entrypoints become usable. It will also need a manual install of git-annex-standalone.

Autorunrecord can't output tree

autorunrecord crashes when trying to display the characters the tree command produces.

Example:

.. runrecord:: _examples/DL-101-5
  :language: console
  :realcommand: cd DataLad-101 && tree

   $ tree

leads to this traceback:

Traceback (most recent call last):
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/cmd/build.py", line 284, in build_main
    app.build(args.force_all, filenames)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/application.py", line 345, in build
    self.builder.build_update()
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 319, in build_update
    len(to_build))
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 332, in build
    updated_docnames = set(self.read())
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 438, in read
    self._read_serial(docnames)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 460, in _read_serial
    self.read_doc(docname)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 504, in read_doc
    doctree = read_doc(self.app, self.env, self.env.doc2path(docname))
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/io.py", line 325, in read_doc
    pub.publish()
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/core.py", line 217, in publish
    self.settings)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/io.py", line 113, in read
    self.parse()
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/readers/__init__.py", line 78, in parse
    self.parser.parse(self.input, document)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/parsers.py", line 94, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 171, in run
    input_source=document['source'])
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2753, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2753, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 1150, in indent
    elements = self.block_quote(indented, line_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 1165, in block_quote
    self.nested_parse(blockquote_lines, line_offset, blockquote)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2326, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2338, in explicit_construct
    return method(self, expmatch)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2081, in directive
    directive_class, match, type_name, option_presets)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2130, in run_directive
    result = directive_instance.run()
  File "/home/adina/repos/autorunrecord/sphinxcontrib/autorunrecord.py", line 56, in run
    self.capture_output(capture_file, work_dir)
  File "/home/adina/repos/autorunrecord/sphinxcontrib/autorunrecord.py", line 99, in capture_output
    stdout.decode(output_encoding),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)

Encoding error:
'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
The full traceback has been saved in /tmp/sphinx-err-t91drlnv.log, if you want to report the issue to the developers.
make[1]: *** [Makefile:50: html] Error 2
make[1]: Leaving directory '/home/adina/repos/datalad-handbook/docs'
make: *** [Makefile:7: html] Error 2

Currently, I'm working around this with ..code-block:: bash.

Import use-case docs/demos from other sources

These are all floating around:

Some are very technical, some are very special interest. If we want to aim for a common narrative (example dataset), all of them would need to be adjusted (but with few exceptions also could be adjusted).

In any case, collecting all these things in a single place, rather than scattering them all over the web would be attractive to me.

What do you think?

Thoughts on making contributing more accessible

I'm sitting in a breakout-session by @KirstieJane on their reproducible science how-to book The Turing Way. They have a wonderful inviting and welcoming contributing culture, and I think we can learn from their approach. One example key take-away from the discussion is to have a label-based way of grouping issues into "levels" that people interested in contributing can work on, but find a level that fits their skill. I'm hopeful that we could have something like this as well, so that the book at some point will not be a three-person-project anymore.
Content-wise it will obviously be a bit hard given that writing a chapter will be impossible for people that haven't used DataLad. A few TODOs for people interested to contribute I can think of right now are

There will always be a need for typo fixes
There are more advanced content TODOs such as the DataLad parable
It will be helpful to get feature requests about content
I realized while teaching people at the Neurohackademy is that it is very useful to have people with different OS run through the commands of the chapters and check whether they work. Some tools I've been using are not preinstalled on OSX, and they've been working right away for me, but not for Mac users. Maybe we can encourage people to do that, and report the outcome.
Especially during alpha stage, we should have a very clear statement on "if things don't work for you as in the book, let us know". Having observed some people trying to use the book, but failing because wget is not installed and giving up in frustration makes me worried that we silently frustrate people away
Given that we try to make the book accessible for non-git-users, we should include detailed steps of how to do pull requests in a contributing.md file, or the contributing chapter

Narrative-related section headings

At some point we should go through and convert some of the rather technical section headings (like "Sharing datasets: Common File systems") to something that is immediately understandable with no technical background. In the above example, an alternative could be to anticipate the situation in which such kind of sharing would take place. So maybe: "Sharing datasets with friends and colleagues" -- building on the assumption that no two random people have access to the same machine.

`datalad status --annex` (without basic) is possible now

Occurrences: https://github.com/datalad-handbook/book/search?q=status+--annex+basic&unscoped_q=status+--annex+basic

See datalad/datalad#3520

Add front-page info on the state of things

Given that this repo and http://handbook.datalad.org are public, it would be sensible to have some brief statement on the state of things on the frontpages. The question came up whether or not it is OK to point people to it, and ask for feedback.

I personally think that is is always OK to point people that ask for more info, but I don't think general call for feedback (#12) is useful ATM, because many things are still unknown and much is in flux.

Ponder on whether to include instructions for venvs in installation

Making note of #70 (comment) in which the question is raised whether the installation instructions should show how to set everything up in a virtual environment.

Potential book structures

Some thoughts emerging from an initial discussion with @loj :

modular approach: The book should not be a "read-from-start-to-end" resource. Users should be able to select modules/chapters/parts based on their knowledge and use case
"Build your own Datalad adventure": @loj proposes this as an educational strategy and we agree that it is worthwhile to try to implement (if it fails, we end up with a modular structure anyways).
We find a The Datalad Parable section in the introduction incredibly relevant and also helpful for us to get a grasp on Datalad ourselves. We've decided that we will start drafting one, building up on domain-agnostic problems and staying shorter than the original git parable if possible. To do this, we want to brainstorm domain-agnostic problems that datalad solves.
Templates for (later) contributions by others for their individual examples with datalad. we cannot anticipate all the ways in which Datalad can be used in different fields and contexts, so these contributions will be especially relevant. With the book, we want to provide the basic building blocks people can use to assemble their use cases, and the community can contribute back by telling us how they did so.
The introduction needs to contain an explanation of what datalad is. Both of us have the feeling that we do not grasp that yet (nor anyone else ;-) ). So our aim is to at the end have an answer to that and put it into a well-phrased summary.
have a "things that can go wrong" section where we can talk about frequent error messages or warnings & their meanings, and where we educate people about important bits and pieces like thats its important to check the version of datalad and git-annex, how to use datalad wtf, how to seek help where, etc

Proposal: "metadata", not "meta data"

DataLad technical docs use that. Google vote is 174M to 7M on it too ;-)

Mention `git config --list --show-origin`

At present the chapters on Git configuration talk about the various files where information can be put, but they do not mention the most straightforward way to figure out where information actually is:

git config --list --show-origin

Personally, I use this quite a bit when dealing with configurations.

Add acknowledgements section

People, institutions, funders, grant numbers, logos, etc.

Have "when things go wrong" sections/boxes

These could be hidden by default, but nevertheless contain essential information. Example from #52:

Remember, if you run a datalad save without specifying a path, all untracked files and all file changes will be committed to the history together!

So what if that happens, because I did not remember this one time? There could be a box that just put it out there that git reset HEAD~1 bring you back to the exact same state as before (within the current dataset), without necessarily having to explain all of git reset

Name inspired by git-annex, e.g.: https://git-annex.branchable.com/walkthrough/removing_files__58___When_things_go_wrong/ https://git-annex.branchable.com/walkthrough/transferring_files__58___When_things_go_wrong/

Example dataset to play with

I want to make note of an idea @mih brought up and continue the discussion about it:

Supply a toy dataset that readers can install and learn with, together with book sections that follow a narrative based on this dataset.

There are some requirements:

it needs to be as domain-agnostic as possible. My initial attempt with studyforrest would confuse everyone who is not a neuroscientist
it should be comparatively small. Even if we only get single files in tutorial snippets, I wouldn't want to pollute readers file systems with GBs of data should they accidentally do a datalad get .
it should be large enough however that files could not live within git
it needs to live somewhere where everyone can ìnstall it from (tbh, I personally don't know how to publish a dataset in such a way that the data is accessible to everyone, but I would like to know how. Once we have a narrative and content, maybe @mih could show us in person how to do that)
a variety of operations should be possible on this dataset:
- show super- and subdataset properties: It should have at least one subdataset with a bit of history, to demonstrate how a superdataset keeps track of the subdataset, and how to work with the subdatasets history, and so forth. With a subdataset we could even demonstrate how to "update" a dataset, by including a subdataset in a not-most-recent state in the superdataset and at one point having them pull the most recent changes.
- add data (in a way that makes sense in the narrative we come up with)
- change at least one larger data file
- Demonstrate a datalad run on the dataset, also in a way that does not appear to be a completely random action. Content-wise it could be something simple to show principles of using and unlocking content, like maybe renaming files with a shell or python script to showcase how one can change existing files (with --input and --output flags). We should also have a datalad run example that creates a completely new file.
- it should come with a well-written commit history that is easy to explore and shows best practices (commit messages, commits consisting of changes belonging together and not many unrelated changes, ...).
- ...

Having such a dataset plus the narrative will make progress on command and workflow explanations much easier I believe. One idea @loj and @mih proposed was a music library. This has the great advantage of easy, almost domain-agnostic narratives, and I think the requirements I came up with could be fulfilled with it. Does any one have additional thoughts on this idea in general, other requirements for such a dataset, or different content/narrative ideas?

What and where of metadata

Recording an aspect that came up in an exchange with @adswa

Datasets are full of metadata, and there are two essential types

(history of) dataset content identity and availability information (i.e. the Git repo)
extracted metadata that is managed by datalad itself as (somewhat hidden) dataset-internal content

The big difference is that (1) is only available after a repository is locally available, while (2) can come with any superdataset. Additionally, both types cover different aspects: (1) metadata geared towards transport and versioning logistics, (2) focused on the semantics/meaning of the managed data. Although recent work in https://github.com/datalad/datalad-metalad aims to blur this boundary and tries to enable a more comprehensive dataset description

We should be clear on what is referred to by "metadata", but proper terms have not been established yet.

Distinguish prompt/code lines from output in code examples

Not doing that can be confusing, in particular in longer examples. I propose:

.. code-block:: bash

   % ls .
   one
   two
   three

and

.. code-block:: python

   >>> print('this')
   this

Dedicated markup for "the lecturer says"

Those bits often represent summaries, or abstract-type statements. I could be nice to have them presented with specific markup (an special icon in the margin of the PDF, like done in 'discovering statistics'). For that to become possible, we would not to start using a dedicated directive, and also have these bits universally serve this purpose.

@adswa how do you feel about this? Do you want to be this strict? Or do you prefer to keep it as is?

Restructure sections

Proposal (only set of files relevant, within-chapter-order would stay the same as it is now):

intro

duction
installation
philosophy

basics

datasets
howto

usecases

remodnav

plus a logical chapter "appendix" (there is not necessarily an editable source file for each bit, some of it is generated by sphinx)

glossary
index
search
contributing

@loj @adswa Are you OK with such a change? Would open a separate PR in that case.

Create and continously update a glossary

Writing this book will occasionally require specific vocabulary or common vocabulary that needs to be interpreted in a specific context. I can't come up with many examples now, but I'm sure we will stumble across many in the wake of writing -- dataset / superdataset, history etc might be words like this.
Whenever I read about new things with specific vocabulary + definitions I keep myself a glossary to remember the terms or being able to look them up again. Maybe we would want to populate such a glossary as well?
I think of it as a dictionary-like structure, which in alphabetical order lists terms and then states a concise definition (like, you know, a glossary). Populating this while writing will be doable, creating one in the very end would be tedious. I propose we start a glossary document which we add to whenever we write and use a term that someone might want to look up at a later time.

Call for contributions

Once the desired structure of the book is determined, and the first pieces are in a shape that communicates the target style and audience, it would good to send out a call for contributions.

There are many small, but useful bits of information floating around that never had a place to live. Some of those might be a good fit for this effort.

Overview of existing and missing content

Based on the diagram in #3, and the DataLad documentation, here is a list of commands and arguments that should be demonstrated in the book. Please add commands or opinions on options you want to see in there.

Currently, I'm just copying the available options and arguments from the docs, and just because it's listed here, the command does not necessarily need to go into the book. Cross out anything you deem unnecessary by surrounding the line with ~~ ~~like this~~.

Everything that is ticked is demonstrated or at least referenced somewhere in the book already.

Eventually, this can become a bit of a guide about possible contributions by others.

Core local DataLad commands

datalad status
- "bare" datalad status
  - explanations of datalad status content types (dataset, directory, file, symlink)
  - explanations of datalad status content states (clean, added, modified, untracked)
  - datalad status with path specification
- datalad status --annex
  - datalad status --annex all
  - datalad status --annex availability
- datalad status --recursive and datalad status --recursive --recursion-limit
- datalad status --untracked
  - datalad status --untracked no
  - datalad status --untracked all
datalad create
- "bare" datalad create
  - When things go wrong: Creation attempt in non-empty directory
- datalad create --dataset
  - When things go wrong: No installed dataset found
- datalad create PATH (not demonstrated, but talked about)
- datalad create --description
- configuration options
  - -c text2git
  - --no annex
  - --nosave False
  - --annex-version
  - --annex-backend
  - --native-metadata-type
datalad diff
- "bare" datalad diff
  - explanations of datalad diff change states (added, copied, deleted, modified, renamed, typechange, unmerged, untracked)
- datalad diff --staged
- datalad diff --revision
- datalad diff --ignore-subdatasets
- datalad diff --report-untracked
- datalad diff --recursive and --recursion-limit
datalad save
- "bare" datalad save
  - When things go wrong: Explain that save saves all modifications and untracked content
- datalad save -m
  - When things go wrong: Forgot the commit message
  - -F/--message-file as an alternative
- datalad save PATH
- datalad save -u
- ~~datalad save -S/--super-datasets~~
- datalad save --version-tag
- datalad save --recursive and --recursion-limit
datalad run
- "bare" datalad run
  - with a script
  - with bash command
- datalad run -m
- datalad run options
  - datalad run --input
  - datalad run --output
  - datalad run --explicit
  - datalad run --sidecar
  - datalad run --expand
datalad containers-run
- datalad containers-add
- datalad containers-run
- datalad containers-remove
- datalad containers-list

Advanced local DataLad commands

datalad rerun
- "bare" datalad rerun
  - with a script
  - with bash command
- datalad rerun -m
- datalad rerun options
  - datalad rerun --since
  - datalad rerun --onto
  - datalad rerun --branch
  - datalad run --report --script
datalad run-procedure
- datalad run-procedure
  - which one do actually exist? What does it do? How to get help on a procedure
  - datalad run-procedure --discover
datalad uninstall
- "bare" datalad uninstall subdataset
  - When things go wrong: specify nothing or non-dataset
- datalad uninstall --nocheck
- datalad uninstall --if-dirty options
- datalad rerun --recursive
datalad remove
- "bare" datalad remove
  - Files
  - (non)empty directories
  - When things go wrong: no remote copies
- datalad remove -m
- datalad remove --nocheck
- datalad remove --nosave options
- datalad remove --if-dirty options
- datalad remove --recursive
datalad publish
- "bare" datalad publish --to
- explain ``--transfer-data {all|auto|none}
- recursive and recursion-limit

[...to be continued (fuck)]

Advanced distributed DataLad commands

!!! datalad install was replaced with datalad clone in #321 and #326 !!!

~~datalad install~~
- ~~"bare" datalad install~~
  - ~~with path~~
  - ~~without path~~
- ~~datalad install -d as subdataset~~
  - ~~with path~~
  - ~~without path~~
- ~~datalad install --recursive --recursion-limit~~
- ~~datalad install --nosave~~
- ~~datalad install --reckless~~
- ~~datalad install --get-data~~
- ~~datalad install --jobs~~
datalad clone
- "bare" datalad clone
  - with path
  - without path
- datalad clone -d as subdataset
  - with path
  - without path
datalad get
- "bare" datalad get
  - with specific path
  - with . (mentioned)
  - files
  - directories
  - subdatasets
  - datalad get --recursive --recursion-limit
- When things go wrong
  - timeouts
  - all kinds of error messages this thing can throw
  - datalad get --verbose
- datalad get --no-data
- datalad get --reckless
- datalad get --jobs
datalad drop
- "bare" datalad drop
- datalad drop --recursive --recursion-limit
- datalad drop --nocheck
- datalad drop --if-dirty
- When things go wrong: If file content is not available elsewhere
datalad siblings
- Concept: What is a sibling
- Actions: query, add, remove, configure
- there are many many options...
datalad update
- "bare" datalad update
  - talk about siblings and sibling option
  - datalad update --merge
  - datalad update --reobtain-data
  - datalad update --recursive --recursion-limit
datalad publish
- minimal datalad publish

Git specific commands and concepts

Git-annex specific commands

Misc

how to get help on any command with -h/--help
man pages
datalad wtf, datalad --version
general explanation of the -d/--dataset option
The datalad superdataset ///
Showcase how changes in subdataset look in superdataset

Have tagged DataLad-101 repo available in the future

Because we're continuously building up content, if people get lost once or fail to execute only one command for whatever reason, in the worst case, they can't follow along anymore.

Once the basics are done, we I think it would be useful to provide a complete dataset with tags corresponding to the different sections/chapters. Readers then have a chance to jump into any part of the book and follow along as well.

Construct all examples to always use `datalad save` without path args

This should communicate that one can employ a "clean-desk" philosophy when working with datasets. A (later) dedicated section can detail how one can also survive in a messy world, and when this might be necessary. But otherwise it just complicates the world -- for now good reason, IMHO.

Windows native installation/usage exploration

FTR: This is what I find to be the simplest installation procedure on Win10 after some trial-and-error

Install Conda

pick latest PY3 installer: https://docs.conda.io/en/latest/miniconda.html
keep everything on default (do not add to PATH)
this will also install Python
from now on any further action must take place in the "Anaconda Prompt" (a preconfigured terminal shell)

Install Git

conda install -c conda-forge git (must be from conda-forge, anaconda version does not provide cp)

Install git-annex

Download installer from https://downloads.kitenet.net/git-annex/windows/current/
Install into the miniconda Library directory, e.g. C:\Users\mih\Miniconda3\Library

Install datalad

it is possible to install datalad via conda too (conda install -c conda-forge datalad) but it only supports the 0.11 version, hence is irrelevant in this context
install datalad via pip as usual

How functional datalad actually is on Win10 if installed this way is subject to further exploration.

(Confirmed) features

OpenSSH: OpenSSH_7.9p1, OpenSSL 1.1.1a 20 Nov 2018
git-annex: All 168 tests passed
datalad.support.tests.test_gitrepo datalad/datalad#3635
datalad.support.tests.test_annexrepo datalad/datalad#3636
datalad.core tests datalad/datalad#3633
datalad.distribution.tests.test_clone datalad/datalad#3640
datalad.distribution.tests.test_install
datalad.distribution.tests.test_get datalad/datalad#3634
datalad.support.tests.test_ssh*

Add "social card"

Can be done here: https://github.com/datalad-handbook/book/settings

Approach to sub-part/multi-chapter structure?

With #52 we get something like this:

All content of the book PART "Basics", but the "Starting from scratch" chapters clearly aim to be more than just subsequent chapters, without technically being a part.

Two alternative approaches that I can see ATM:

'Starting from scratch' becomes a part
There will be a single 'Starting from scratch'.rst with some kind of a preamble and a subordinate TOCtree that lists the actual chapters (create, populate, modify). This way, each of them stays its own chapter, but the hierarchical structure is made clear(er).

or 3): it stays implicit.

None of this has to be dealt with any time soon. I am only collecting issues that might have implications on the desired structure of the book.

Explore constraints for literalblock

There will be a lot of "listings" and they should be readable in all output formats. HTML and PDF look good, but epub is somewhat unusable, because only 57 chars per line are visible (at least in some viewers).

All is just one chapter

Check: https://buildmedia.readthedocs.org/media/pdf/datalad-handbook/latest/datalad-handbook.pdf

There is only a single chapter "Introduction". I think this has something to do with the header markup, but I haven't checked yet.

Consistency in Language

We should be consistent with how we capitalize/separate/hyphenate terms. I think it makes sense to follow the language used on the DataLad website and in the docs. However, I don't think we need to be beholden to how it's currently done. This can be an opportunity to assess if language should be tweaked, and later bring the other resources to the same standard.

Some inconsistencies I've noticed so far are:

DataLad (vs Datalad or datalad)
dataset (vs data set)
subdataset (vs sub-dataset)
metadata (vs meta data)

This doesn't need to be addressed now, and can be cleaned up as the handbook reaches completion. But, I wanted to start the conversation now to develop a running list so we can use these terms intentionally.

Build your own DataLad adventure

Treat this issue as [WIP] - I'm essentially just taking notes to disentangle my brain.

Following up with a suggestion by @loj in #8, I'm brainstorming potential "DataLad adventures" we can let readers "choose" from. If we can find a way to make these "adventures" work, those will be the "Basics" section of the handbook (between introductory sections and use cases). Following up on an idea of a "general example dataset" we could keep whatever example dataset we come up with the theme of each adventure (but we somehow may want to keep in mind that it should not be a read-from-start-to-end book). I believe the general idea behind this is to not mimic a "documentation" structure in which we describe every command sequentially and in detail, but rather identify common, generic workflows. (But I'm not sure whether I am conflating "the basics" meant as building blocks for such an adventure too much with use cases here. plus I'm currently doubting whether this is the modular structure we are also aiming for...) One potential problem I'm seeing while writing this is that we have a lot of duplication about commands between the different "adventures". Also, while simple now with only stable commands, this might become a mess the more complex commands get included. However, it might be more helpful to see commands joined together in different workflows instead of having them stand next to each other unconnected but without duplication.

The tool to visualize this (general idea, not necessarily what I'm sketching out here) likely is this: http://blockdiag.com/en/index.html.

A1 "I'm a new player (no experience in the terminal) and need the pre-game tutorial"
- Chapter 0: General prerequisites
A2 "I want to version control a new project (local only)"
- Chapter 1: Local workflows (Section 1: Starting a new dataset)
  - could include: create, status, diff, save
A3 "I want to start version controlling an existing project (local only)"
- Chapter 1: Local workflows (Section 2: Existing projects)
  - could include: install + create, status, diff, save
A4 "I want to start a collaborative project with DataLad"
- Chapter 2: Shared workflows
  - could include: Read A1, + publish (+ update?)
A5 "I want to use DataLad in an existing collaborative project"
- Chapter 2: Read A2, + publish (+ update?)
A??
- Chapter XY
  - - run
A?? "I want to share data"
- Chapter XY: ??
A42 "I ran into troubles. How can I help myself?"
- Chapter XY: When things go wrong
A0 "I need a quick reference about the commands covered in this book!"
- maybe a "quicklinks" chapter, with all commands, a 1-2 sentence description what they can do, and links to the chapters they are covered in. This can be the most "documentation-like" section and will be helpful for navigation and overview for people who might just want to re-read things.

I will continue developing my messy thoughts on this tomorrow.

Add guidelines on image formats

once this is addressed in one way or another: #45

Use-case contents

Lets start a collection of potentially interesting use-cases here. We can continuously update this list.

REMoDNaV/"writing a reproducible paper"
sharing data
data curation: prepare non-datalad resources as datasets, adoption of formatting standards, metadata "capture"
studyforrest web-catalogue
...

Brainstorm templates

I'm just brainstorming bit and pieces, but something that came to my mind is that recurrent "types" of content -- specifically I am thinking about use-cases -- are easiest digested IMO if they follow a common outline/template/structure.

I'm currently thinking of something like

summary: what is the context, what will you learn from reading this
problem: a relatable, pointed description of a problem that we aim to solve
solution/utopia: a description of the outcome state after applying the datalad procedures necessary. To basically awaken demand to know how to get to the state, and to give a quick overview beyond the summary whether the following recipe is worth the readers time given their interest/motivation
recipe: a step-by-step tutorial how to get from problem to solution, as short as possible, but with as many steps as necessary, and with pointers to more detailed resources (within the book) for particular commands.
additional resources, links, etc for further reading

but that is a preliminary line of thought that could very well not be applicable at all - I just want to keep my thought somewhere. I will write a Remodnav use case and try to find a sensible structure (probably very iteratively over and over again) and maybe we are able to find a generic template structure that is applicable to all kinds of use cases. By supplying a template to fit that structure future contributions by others could be made easy as well.

Discover neuroimaging data

Discovering and obtaining scans that match certain criteria is difficult and requires many working pieces, good metadata, query capabilities, data access procedures. From my POV it would be desirable to show case how connecting all this within a datalad context enables such a use case.

In contrast to other unwritten chapters, this would require some parallel development to smoothly connect the pieces.