Giter Site home page Giter Site logo

book's Introduction

Build status Monthly link check Documentation Status DOI made-with-datalad

All Contributors

The DataLad handbook πŸ“™

This is a living resource on why and - more importantly - how to use DataLad. The rendered version is here: https://handbook.datalad.org, and is currently under initial development.

The handbook is a practical, hands-on crashcourse to learn and experience DataLad. You do not need to be a programmer, computer scientist, or Linux-crank. If you have never touched your computer's shell before, you will be fine. Regardless of your background and personal use cases for DataLad, the handbook will show you the principles of DataLad, and from chapter 1 onwards you will be using them.

Find more general information about the idea behind the handbook in the poster presented at the 2020 OHBM or dive straight into your DataLad adventure.

Contributing

Contributions in any form - pull requests, issues, content requests/ideas, ... are always welcome. If you are using the handbook and find that something does not work, please let us know. Likewise, if you are using DataLad for your individual project, consider contributing by telling us about your use-case. You can find out more on how to contribute here, and a list of all contributors so far below, in CONTRIBUTORS.md, and in .zenodo.json.

Notes for Instructors

The book is the basis for workshops and lectures on DataLad and data management. The handbook's course repository among other things contains live casts from the code examples in this book and slides. It is constantly growing, and everyone is free to use the material under the license terms below. Contributions and feedback are very welcome.

License

CC-BY-SA: You are free to

  • share - copy and redistribute the material in any medium or format
  • adapt - remix, transform, and build upon the material for any purpose, even commercially

under the following terms:

  1. Attribution β€” You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  2. ShareAlike β€” If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Adina S. Wagner
Adina S. Wagner

πŸ’» πŸ–‹ πŸ“– 🎨 πŸ€” πŸš‡ 🚧 πŸ“† πŸ‘€ πŸ““ πŸ“’ ⚠️ πŸ› πŸ’‘ πŸ’¬ ️️️️♿️
Laura Waite
Laura Waite

πŸ€” 🚧 πŸ‘€ πŸ“’ πŸ’¬ πŸ–‹
Michael Hanke
Michael Hanke

πŸ’¬ πŸ› πŸ’» πŸ–‹ πŸ“– 🎨 πŸ’‘ πŸ€” πŸš‡ 🚧 πŸ”Œ πŸ“† πŸ‘€ πŸ”§ ⚠️ πŸ“’ πŸ““ ️️️️♿️
Kyle Meyer
Kyle Meyer

πŸ› πŸ‘€ πŸ’¬ πŸ–‹ πŸ€”
Marisa Heckner
Marisa Heckner

πŸ€” πŸ““ πŸ› πŸ–‹
Benjamin Poldrack
Benjamin Poldrack

πŸ’¬ πŸ€” πŸ’‘ βœ…
Yaroslav Halchenko
Yaroslav Halchenko

πŸ‘€ πŸ–‹ πŸ€” πŸ›
Chris Markiewicz
Chris Markiewicz

πŸ›
Pattarawat Chormai
Pattarawat Chormai

πŸ› πŸ’»
Lisa N. Mochalski
Lisa N. Mochalski

πŸ› πŸ–‹ πŸ’‘ πŸ€”
Lisa Wiersch
Lisa Wiersch

πŸ›
Jean-Baptiste Poline
Jean-Baptiste Poline

πŸ–‹
Nevena Kraljevic
Nevena Kraljevic

πŸ““
Alex Waite
Alex Waite

πŸ‘€ πŸ› πŸ€”
Lya K. Paas
Lya K. Paas

πŸ› πŸ’»
Niels Reuter
Niels Reuter

πŸ–‹
Peter Vavra
Peter Vavra

πŸ€” πŸ““
Tobias Kadelka
Tobias Kadelka

πŸ““
Peer Herholz
Peer Herholz

πŸ€”
Alexandre Hutton
Alexandre Hutton

πŸ–‹ πŸ›
Sarah Oliveira
Sarah Oliveira

πŸ‘€ πŸ€”
Dorian Pustina
Dorian Pustina

πŸ€”
Hamzah Hamid Baagil
Hamzah Hamid Baagil

πŸ““ πŸ›
Tristan Glatard
Tristan Glatard

πŸ› πŸ–‹
Giulia Ippoliti
Giulia Ippoliti

πŸ–‹ πŸ’‘
Christian MΓΆnch
Christian MΓΆnch

πŸ–‹
Togaru Surya Teja
Togaru Surya Teja

πŸ–‹
Dorien Huijser
Dorien Huijser

πŸ› πŸ““
Ariel Rokem
Ariel Rokem

πŸ›
Remi Gau
Remi Gau

πŸ› πŸ€” 🚧 πŸ‘€ πŸš‡ πŸ’» 🎨
Judith Bomba
Judith Bomba

πŸ›
Konrad Hinsen
Konrad Hinsen

πŸ›
Wu Jianxiao
Wu Jianxiao

πŸ›
MaΕ‚gorzata Wierzba
MaΕ‚gorzata Wierzba

πŸ““ πŸ‘€ βœ…
Stefan Appelhoff
Stefan Appelhoff

πŸš‡ πŸ”§ πŸ›
Michael Joseph
Michael Joseph

πŸ€” πŸ–‹ πŸ›
Tamara Cook
Tamara Cook

πŸ‘€ πŸš‡
Stephan Heunis
Stephan Heunis

πŸ› 🚧 πŸ–‹ πŸ’‘ πŸ‘€
Joerg Stadler
Joerg Stadler

πŸ›
Sin Kim
Sin Kim

πŸ› πŸ–‹ πŸ‘€
Oscar Esteban
Oscar Esteban

πŸ›
MichaΕ‚ Szczepanik
MichaΕ‚ Szczepanik

πŸ‘€ πŸ› πŸ–‹
eort
eort

πŸ›
Myrskyta
Myrskyta

πŸ›
Thomas Guiot
Thomas Guiot

πŸ›
jhpb7
jhpb7

πŸ›
Ikko Ashimine
Ikko Ashimine

πŸ›
Arshitha Basavaraj
Arshitha Basavaraj

πŸ–‹ πŸ› 🚧
Anthony J Veltri
Anthony J Veltri

πŸ““
Isil Bilgin
Isil Bilgin

πŸ› 🚧
Julian Kosciessa
Julian Kosciessa

πŸ–‹
Isaac To
Isaac To

🚧 πŸ–‹ πŸ›
Austin Macdonald
Austin Macdonald

πŸ›
Christopher S. Hall
Christopher S. Hall

πŸ›
jcf2
jcf2

πŸ›
Julien Colomb
Julien Colomb

πŸ–‹
Danny Garside
Danny Garside

πŸ› 🚧
Justus Kuhlmann
Justus Kuhlmann

πŸ–‹
melanieganz
melanieganz

πŸ›
Damien François
Damien François

πŸ› πŸ–‹
Tosca Heunis
Tosca Heunis

πŸ› πŸ““
Jeremy Magland
Jeremy Magland

πŸ›

This project follows the all-contributors specification. Contributions of any kind welcome!

book's People

Contributors

adswa avatar aksoo avatar alexandrehutton avatar allcontributors[bot] avatar aqw avatar arokem avatar arshitha avatar ayrustogaru avatar bpoldrack avatar candleindark avatar christian-monch avatar cs-hall avatar da5nsy avatar effigies avatar eort avatar gi114 avatar jbpoline avatar jhpb7 avatar jsheunis avatar kyleam avatar lisanmo avatar loj avatar magland avatar marisaheckner avatar mih avatar mslw avatar remi-gau avatar sappelhoff avatar tobiaskadelka avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

book's Issues

Cross-reference technical docs

#79 reminded me that it would be good to have a standard way to cross-reference the book content with the rest of the technical docs. For example, every command has a "manpage" that can be found via this pattern (here for create):

http://docs.datalad.org/generated/man/datalad-create.html

and a corresponding page with analog information for Python API users:

http://docs.datalad.org/generated/datalad.api.create.html

ATM when this command is introduced in http://handbook.datalad.org/basics/101-101-create.html no reference to this additional documentation is made.

I see at least to possible approaches:

  1. we make an immediate reference whenever a command first appears (in some kind of a note)

  2. we maintain a proper list of all described commands and their functionality (much like #79 id doing) and we use proper index markup (http://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html?highlight=index%20generating#index-generating-markup) to cross-ref this list with the book content, and only place the cross-ref to other technical docs in this list.

I haven't played with (2) yet, but I somewhat feel that it would be best to do this in a single place, rather than all over the book.

Windows WSL2 installation/usage exploration

On a fresh Win10, immediately after install:

Enable WSL

At the moment, one needs to join the Windows Insider Programm to get access to a build version that has WSL2.

Start the Power Shell as an Administrator.
Run both commands, only restart after the second one (despite being prompted after the first one already)

Enable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

Install Debian app

Same procedure as for WSL1

Enable WSL2 for the Debian app

Could not continue here, because I tried in a VirtualBox VM, and Win10 in the VM needs proper hardware virtualization for WSL2, which VirtualBox cannot provide (nested) virtualization on Intel CPUs (AMD would be fine).

Next attempt with bnbdatalad:

To set the WSL version to WSL2, run this command as an administrator in the Power Shell:

wsl --set-default-version 2

In the Microsoft store, search for a Debian distribution. Download and install it, then start it. Pick a user name, and a root password (repeat the password when prompted).

  • configure the distro to use WSL 2: Run this command as an administrator in the Power Shell:
wsl -l -v

This should say something like this (the important detail is VERSION 2):

    NAME        STATE               VERSION
*   Debian       Running            2
  • To enable the Neurodebian repository to install datalad with apt-get...

    • In your Debian distribution, find out which version you are running -- should be buster (e.g. withcat /etc/*release )
    • install a few missing tools: sudo apt-get install wget, sudo apt-get install gnupg (or gnupg2)
    • Enable the Neurodebian repository.
    • sudo apt-get update, sudo apt-get upgrade
    • (one could do sudo apt-get install datalad now, but would not be datalad.0.12)
  • To install datalad with pip...

    • install python3-pip with sudo apt-get install python3-pip
    • to install DataLad 0.12 from master: pip3 install --user git+https://github.com/datalad/datalad.git#egg=datalad (but requires a prior sudo apt-get install git)
    • add ~/.local/bin to the PATH
    • to install git-annex: sudo apt-get install git-annex

πŸŽ‰

DICOM to BIDS conversion

This is likely to most common use case in neuroimaging. So it would be highly desirable to have this covered. @bpoldrack 's docs for datalad-hirni could serve as a starting point.

In terms of connecting this with the primary content of the book, we would need to introduce the concept of a datalad extension package, because hirni uses quite a few of them.

Adopt DataLad colors

ATM the theme sphinx uses for HTML and PDF output is using the default colors. However, even now there is a figure included that uses the color scheme we have adopted for DataLad elsewhere (website, posters https://f1000research.com/posters/7-1965).

I would make sense to me to adopt the same colors also for the book (notes, todos, highlights, etc.). Let me know, and I'll give it a go.

Uniform user environment for the book examples

At the moment the current system user environment (i.e. us) is used. That means that when I re-run and replace @adswa 's output, there will always be changes.

We could include a helper script that create a dedicated user enironment on our machines that we then use to build the book examples.

Fixed sidebars

@marisaheckner noted that it is inconvenient to scroll all the way up to the side bar again to switch sections, and I agree. Will PR a fix in a minute.

Dedicated markup for further reading

Turn constructs like the following into dedicated directives that indicate their purpose

.. container:: toggle
οΏΌ	
οΏΌ	   .. container:: header
οΏΌ	
οΏΌ	      **Addition: More on this...**
```οΏΌ	

Candidate name: `moreinfo`

It would be a toggle box in HTML, and possibly a marginnote in Latex.

Ideas on how to balance file length and recreation of datasets in individual workdirs

When recording code and its output, every single .rst file gets an individual workdir associated with it in docs/_build/wdirs/.

The dataset I'm continuously building up and extending will need to be in every one of these workdirs (as many as .rst files I am writing) in order for the code to work. I don't think it is feasible to create potentially dozens of datasets, slightly increasing in size between different .rst files. But if I write down everything in a single file, I'm creating - at least for the web version of the book - a really really really long page, and in principle, I'd like to chunk it up into individual pages, and also have toc entries for each of these pages.

Does anyone have an idea on how to balance these conflicting demands? Is there a way to "pagebreak" single .rst files and create toc entries referencing specific sections in files? Or, alternatively, specify a common workdir manually?

Windows WSL installation/usage exploration

Conclusion

Files obtained via datalad under WSL1 into the Windows filesystem are only accessible when unlocked. This dramatically limits usability. WSL1 does not allow for GUI tools to run, and Windows apps will only be able to access files that are real files (not symlinks) in the WSL1 filesystem.

Moreover, v7 adjusted branch (unlock) does not work under WSL1 datalad/datalad#3608

I think we cannot recommend WSL1.

Few issues and suggestions

[only relevant if WSL1 is still considered as a viable deployment target]

  • When installing the Debian App the default release is now buster and the NeuroDebian config selection needs to be adjusted
  • It may be better to recommend people to copy&paste the exact snippet that is given on the NeuroDebian website. To make it work wget and gnupg have to be installed.
  • apt-get install datalad will yield a version of datalad that is not compatible with the book (0.11 instead of 0.12). Either people need to follow the pip route, or DataLad 0.12 must be packaged to fix this
  • for pip to work, it needs python3-pip
  • was there any need for the first update/upgrade round? If not, it could be stripped, because it may cause download and installation of packages versions that are subsequently replaced by updates from NeuroDebian
  • to install DataLad 0.12 from master: pip3 install --user git+https://github.com/datalad/datalad.git#egg=datalad (but requires a prior sudo apt-get install git)
  • if installed via pip install --user the path ~/.local/bin needs to be added to $PATH to get the cmdline entrypoints become usable. It will also need a manual install of git-annex-standalone.

Autorunrecord can't output tree

autorunrecord crashes when trying to display the characters the tree command produces.

Example:

.. runrecord:: _examples/DL-101-5
  :language: console
  :realcommand: cd DataLad-101 && tree

   $ tree

leads to this traceback:

Traceback (most recent call last):
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/cmd/build.py", line 284, in build_main
    app.build(args.force_all, filenames)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/application.py", line 345, in build
    self.builder.build_update()
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 319, in build_update
    len(to_build))
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 332, in build
    updated_docnames = set(self.read())
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 438, in read
    self._read_serial(docnames)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 460, in _read_serial
    self.read_doc(docname)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 504, in read_doc
    doctree = read_doc(self.app, self.env, self.env.doc2path(docname))
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/io.py", line 325, in read_doc
    pub.publish()
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/core.py", line 217, in publish
    self.settings)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/io.py", line 113, in read
    self.parse()
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/readers/__init__.py", line 78, in parse
    self.parser.parse(self.input, document)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/sphinx/parsers.py", line 94, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 171, in run
    input_source=document['source'])
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2753, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2753, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 1150, in indent
    elements = self.block_quote(indented, line_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 1165, in block_quote
    self.nested_parse(blockquote_lines, line_offset, blockquote)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2326, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2338, in explicit_construct
    return method(self, expmatch)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2081, in directive
    directive_class, match, type_name, option_presets)
  File "/home/adina/env/handbook/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2130, in run_directive
    result = directive_instance.run()
  File "/home/adina/repos/autorunrecord/sphinxcontrib/autorunrecord.py", line 56, in run
    self.capture_output(capture_file, work_dir)
  File "/home/adina/repos/autorunrecord/sphinxcontrib/autorunrecord.py", line 99, in capture_output
    stdout.decode(output_encoding),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)

Encoding error:
'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
The full traceback has been saved in /tmp/sphinx-err-t91drlnv.log, if you want to report the issue to the developers.
make[1]: *** [Makefile:50: html] Error 2
make[1]: Leaving directory '/home/adina/repos/datalad-handbook/docs'
make: *** [Makefile:7: html] Error 2

Currently, I'm working around this with ..code-block:: bash.

Import use-case docs/demos from other sources

Thoughts on making contributing more accessible

I'm sitting in a breakout-session by @KirstieJane on their reproducible science how-to book The Turing Way. They have a wonderful inviting and welcoming contributing culture, and I think we can learn from their approach. One example key take-away from the discussion is to have a label-based way of grouping issues into "levels" that people interested in contributing can work on, but find a level that fits their skill. I'm hopeful that we could have something like this as well, so that the book at some point will not be a three-person-project anymore.
Content-wise it will obviously be a bit hard given that writing a chapter will be impossible for people that haven't used DataLad. A few TODOs for people interested to contribute I can think of right now are

  • There will always be a need for typo fixes
  • There are more advanced content TODOs such as the DataLad parable
  • It will be helpful to get feature requests about content
  • I realized while teaching people at the Neurohackademy is that it is very useful to have people with different OS run through the commands of the chapters and check whether they work. Some tools I've been using are not preinstalled on OSX, and they've been working right away for me, but not for Mac users. Maybe we can encourage people to do that, and report the outcome.
  • Especially during alpha stage, we should have a very clear statement on "if things don't work for you as in the book, let us know". Having observed some people trying to use the book, but failing because wget is not installed and giving up in frustration makes me worried that we silently frustrate people away
  • Given that we try to make the book accessible for non-git-users, we should include detailed steps of how to do pull requests in a contributing.md file, or the contributing chapter

Narrative-related section headings

At some point we should go through and convert some of the rather technical section headings (like "Sharing datasets: Common File systems") to something that is immediately understandable with no technical background. In the above example, an alternative could be to anticipate the situation in which such kind of sharing would take place. So maybe: "Sharing datasets with friends and colleagues" -- building on the assumption that no two random people have access to the same machine.

Add front-page info on the state of things

Given that this repo and http://handbook.datalad.org are public, it would be sensible to have some brief statement on the state of things on the frontpages. The question came up whether or not it is OK to point people to it, and ask for feedback.

I personally think that is is always OK to point people that ask for more info, but I don't think general call for feedback (#12) is useful ATM, because many things are still unknown and much is in flux.

Potential book structures

Some thoughts emerging from an initial discussion with @loj :

  • modular approach: The book should not be a "read-from-start-to-end" resource. Users should be able to select modules/chapters/parts based on their knowledge and use case
  • "Build your own Datalad adventure": @loj proposes this as an educational strategy and we agree that it is worthwhile to try to implement (if it fails, we end up with a modular structure anyways).
  • We find a The Datalad Parable section in the introduction incredibly relevant and also helpful for us to get a grasp on Datalad ourselves. We've decided that we will start drafting one, building up on domain-agnostic problems and staying shorter than the original git parable if possible. To do this, we want to brainstorm domain-agnostic problems that datalad solves.
  • Templates for (later) contributions by others for their individual examples with datalad. we cannot anticipate all the ways in which Datalad can be used in different fields and contexts, so these contributions will be especially relevant. With the book, we want to provide the basic building blocks people can use to assemble their use cases, and the community can contribute back by telling us how they did so.
  • The introduction needs to contain an explanation of what datalad is. Both of us have the feeling that we do not grasp that yet (nor anyone else ;-) ). So our aim is to at the end have an answer to that and put it into a well-phrased summary.
  • have a "things that can go wrong" section where we can talk about frequent error messages or warnings & their meanings, and where we educate people about important bits and pieces like thats its important to check the version of datalad and git-annex, how to use datalad wtf, how to seek help where, etc

Mention `git config --list --show-origin`

At present the chapters on Git configuration talk about the various files where information can be put, but they do not mention the most straightforward way to figure out where information actually is:

git config --list --show-origin

Personally, I use this quite a bit when dealing with configurations.

Have "when things go wrong" sections/boxes

These could be hidden by default, but nevertheless contain essential information. Example from #52:

Remember, if you run a datalad save without specifying a path, all untracked files and all file changes will be committed to the history together!

So what if that happens, because I did not remember this one time? There could be a box that just put it out there that git reset HEAD~1 bring you back to the exact same state as before (within the current dataset), without necessarily having to explain all of git reset

Name inspired by git-annex, e.g.: https://git-annex.branchable.com/walkthrough/removing_files__58___When_things_go_wrong/ https://git-annex.branchable.com/walkthrough/transferring_files__58___When_things_go_wrong/

Example dataset to play with

I want to make note of an idea @mih brought up and continue the discussion about it:

Supply a toy dataset that readers can install and learn with, together with book sections that follow a narrative based on this dataset.

There are some requirements:

  • it needs to be as domain-agnostic as possible. My initial attempt with studyforrest would confuse everyone who is not a neuroscientist
  • it should be comparatively small. Even if we only get single files in tutorial snippets, I wouldn't want to pollute readers file systems with GBs of data should they accidentally do a datalad get .
  • it should be large enough however that files could not live within git
  • it needs to live somewhere where everyone can Γ¬nstall it from (tbh, I personally don't know how to publish a dataset in such a way that the data is accessible to everyone, but I would like to know how. Once we have a narrative and content, maybe @mih could show us in person how to do that)
  • a variety of operations should be possible on this dataset:
    • show super- and subdataset properties: It should have at least one subdataset with a bit of history, to demonstrate how a superdataset keeps track of the subdataset, and how to work with the subdatasets history, and so forth. With a subdataset we could even demonstrate how to "update" a dataset, by including a subdataset in a not-most-recent state in the superdataset and at one point having them pull the most recent changes.
    • add data (in a way that makes sense in the narrative we come up with)
    • change at least one larger data file
    • Demonstrate a datalad run on the dataset, also in a way that does not appear to be a completely random action. Content-wise it could be something simple to show principles of using and unlocking content, like maybe renaming files with a shell or python script to showcase how one can change existing files (with --input and --output flags). We should also have a datalad run example that creates a completely new file.
    • it should come with a well-written commit history that is easy to explore and shows best practices (commit messages, commits consisting of changes belonging together and not many unrelated changes, ...).
    • ...

Having such a dataset plus the narrative will make progress on command and workflow explanations much easier I believe. One idea @loj and @mih proposed was a music library. This has the great advantage of easy, almost domain-agnostic narratives, and I think the requirements I came up with could be fulfilled with it. Does any one have additional thoughts on this idea in general, other requirements for such a dataset, or different content/narrative ideas?

What and where of metadata

Recording an aspect that came up in an exchange with @adswa

Datasets are full of metadata, and there are two essential types

  1. (history of) dataset content identity and availability information (i.e. the Git repo)
  2. extracted metadata that is managed by datalad itself as (somewhat hidden) dataset-internal content

The big difference is that (1) is only available after a repository is locally available, while (2) can come with any superdataset. Additionally, both types cover different aspects: (1) metadata geared towards transport and versioning logistics, (2) focused on the semantics/meaning of the managed data. Although recent work in https://github.com/datalad/datalad-metalad aims to blur this boundary and tries to enable a more comprehensive dataset description

We should be clear on what is referred to by "metadata", but proper terms have not been established yet.

Dedicated markup for "the lecturer says"

Those bits often represent summaries, or abstract-type statements. I could be nice to have them presented with specific markup (an special icon in the margin of the PDF, like done in 'discovering statistics'). For that to become possible, we would not to start using a dedicated directive, and also have these bits universally serve this purpose.

@adswa how do you feel about this? Do you want to be this strict? Or do you prefer to keep it as is?

Restructure sections

Proposal (only set of files relevant, within-chapter-order would stay the same as it is now):

intro

duction
installation
philosophy

basics

datasets
howto

usecases

remodnav

plus a logical chapter "appendix" (there is not necessarily an editable source file for each bit, some of it is generated by sphinx)

glossary
index
search
contributing

@loj @adswa Are you OK with such a change? Would open a separate PR in that case.

Create and continously update a glossary

Writing this book will occasionally require specific vocabulary or common vocabulary that needs to be interpreted in a specific context. I can't come up with many examples now, but I'm sure we will stumble across many in the wake of writing -- dataset / superdataset, history etc might be words like this.
Whenever I read about new things with specific vocabulary + definitions I keep myself a glossary to remember the terms or being able to look them up again. Maybe we would want to populate such a glossary as well?
I think of it as a dictionary-like structure, which in alphabetical order lists terms and then states a concise definition (like, you know, a glossary). Populating this while writing will be doable, creating one in the very end would be tedious. I propose we start a glossary document which we add to whenever we write and use a term that someone might want to look up at a later time.

Call for contributions

Once the desired structure of the book is determined, and the first pieces are in a shape that communicates the target style and audience, it would good to send out a call for contributions.

There are many small, but useful bits of information floating around that never had a place to live. Some of those might be a good fit for this effort.

Overview of existing and missing content

Based on the diagram in #3, and the DataLad documentation, here is a list of commands and arguments that should be demonstrated in the book. Please add commands or opinions on options you want to see in there.

Currently, I'm just copying the available options and arguments from the docs, and just because it's listed here, the command does not necessarily need to go into the book. Cross out anything you deem unnecessary by surrounding the line with ~~ like this.

Everything that is ticked is demonstrated or at least referenced somewhere in the book already.

Eventually, this can become a bit of a guide about possible contributions by others.

Core local DataLad commands

  • datalad status

    • "bare" datalad status
      • explanations of datalad status content types (dataset, directory, file, symlink)
      • explanations of datalad status content states (clean, added, modified, untracked)
      • datalad status with path specification
    • datalad status --annex
      • datalad status --annex all
      • datalad status --annex availability
    • datalad status --recursive and datalad status --recursive --recursion-limit
    • datalad status --untracked
      • datalad status --untracked no
      • datalad status --untracked all
  • datalad create

    • "bare" datalad create
      • When things go wrong: Creation attempt in non-empty directory
    • datalad create --dataset
      • When things go wrong: No installed dataset found
    • datalad create PATH (not demonstrated, but talked about)
    • datalad create --description
    • configuration options
      • -c text2git
      • --no annex
      • --nosave False
      • --annex-version
      • --annex-backend
      • --native-metadata-type
  • datalad diff

    • "bare" datalad diff
      • explanations of datalad diff change states (added, copied, deleted, modified, renamed, typechange, unmerged, untracked)
    • datalad diff --staged
    • datalad diff --revision
    • datalad diff --ignore-subdatasets
    • datalad diff --report-untracked
    • datalad diff --recursive and --recursion-limit
  • datalad save

    • "bare" datalad save
      • When things go wrong: Explain that save saves all modifications and untracked content
    • datalad save -m
      • When things go wrong: Forgot the commit message
      • -F/--message-file as an alternative
    • datalad save PATH
    • datalad save -u
    • datalad save -S/--super-datasets
    • datalad save --version-tag
    • datalad save --recursive and --recursion-limit
  • datalad run

    • "bare" datalad run
      • with a script
      • with bash command
    • datalad run -m
    • datalad run options
      • datalad run --input
      • datalad run --output
      • datalad run --explicit
      • datalad run --sidecar
      • datalad run --expand
  • datalad containers-run

    • datalad containers-add
    • datalad containers-run
    • datalad containers-remove
    • datalad containers-list

Advanced local DataLad commands

  • datalad rerun

    • "bare" datalad rerun
      • with a script
      • with bash command
    • datalad rerun -m
    • datalad rerun options
      • datalad rerun --since
      • datalad rerun --onto
      • datalad rerun --branch
      • datalad run --report --script
  • datalad run-procedure

    • datalad run-procedure
      • which one do actually exist? What does it do? How to get help on a procedure
      • datalad run-procedure --discover
  • datalad uninstall

    • "bare" datalad uninstall subdataset
      • When things go wrong: specify nothing or non-dataset
    • datalad uninstall --nocheck
    • datalad uninstall --if-dirty options
    • datalad rerun --recursive
  • datalad remove

    • "bare" datalad remove
      • Files
      • (non)empty directories
      • When things go wrong: no remote copies
    • datalad remove -m
    • datalad remove --nocheck
    • datalad remove --nosave options
    • datalad remove --if-dirty options
    • datalad remove --recursive
  • datalad publish

    • "bare" datalad publish --to
    • explain ``--transfer-data {all|auto|none}
    • recursive and recursion-limit

[...to be continued (fuck)]

Advanced distributed DataLad commands

!!! datalad install was replaced with datalad clone in #321 and #326 !!!

  • datalad install

    • "bare" datalad install
      • with path
      • without path
    • datalad install -d as subdataset
      • with path
      • without path
    • datalad install --recursive --recursion-limit
    • datalad install --nosave
    • datalad install --reckless
    • datalad install --get-data
    • datalad install --jobs
  • datalad clone

    • "bare" datalad clone
      • with path
      • without path
    • datalad clone -d as subdataset
      • with path
      • without path
  • datalad get

    • "bare" datalad get
      • with specific path
      • with . (mentioned)
      • files
      • directories
      • subdatasets
      • datalad get --recursive --recursion-limit
    • When things go wrong
      • timeouts
      • all kinds of error messages this thing can throw
      • datalad get --verbose
    • datalad get --no-data
    • datalad get --reckless
    • datalad get --jobs
  • datalad drop

    • "bare" datalad drop
    • datalad drop --recursive --recursion-limit
    • datalad drop --nocheck
    • datalad drop --if-dirty
    • When things go wrong: If file content is not available elsewhere
  • datalad siblings

    • Concept: What is a sibling
    • Actions: query, add, remove, configure
    • there are many many options...
  • datalad update

    • "bare" datalad update
      • talk about siblings and sibling option
      • datalad update --merge
      • datalad update --reobtain-data
      • datalad update --recursive --recursion-limit
  • datalad publish

    • minimal datalad publish

Git specific commands and concepts

  • What is HEAD?
    • DETACHED HEAD STATE
  • git log
    • mention tig
  • branches
    • Explain
    • how to work with them, commands
  • git diff
  • History
    • git commit --amend
    • git revert
    • git checkout
    • git rebase
    • git reset
  • git status
  • git commit
  • git add
  • git config, and configuration(file)s
  • environment variables for configurations

Git-annex specific commands

  • Git repository version
  • the object tree
  • git-annex fsck
  • git-annex fix
  • What is the git-annex branch, what to do and what not to do
  • Git annex v5/v6/v7 repositories

Misc

  • how to get help on any command with -h/--help
  • man pages
  • datalad wtf, datalad --version
  • general explanation of the -d/--dataset option
  • The datalad superdataset ///
  • Showcase how changes in subdataset look in superdataset

Have tagged DataLad-101 repo available in the future

Because we're continuously building up content, if people get lost once or fail to execute only one command for whatever reason, in the worst case, they can't follow along anymore.

Once the basics are done, we I think it would be useful to provide a complete dataset with tags corresponding to the different sections/chapters. Readers then have a chance to jump into any part of the book and follow along as well.

Construct all examples to always use `datalad save` without path args

This should communicate that one can employ a "clean-desk" philosophy when working with datasets. A (later) dedicated section can detail how one can also survive in a messy world, and when this might be necessary. But otherwise it just complicates the world -- for now good reason, IMHO.

Windows native installation/usage exploration

FTR: This is what I find to be the simplest installation procedure on Win10 after some trial-and-error

Install Conda

  • pick latest PY3 installer: https://docs.conda.io/en/latest/miniconda.html
  • keep everything on default (do not add to PATH)
  • this will also install Python
  • from now on any further action must take place in the "Anaconda Prompt" (a preconfigured terminal shell)

Install Git

  • conda install -c conda-forge git (must be from conda-forge, anaconda version does not provide cp)

Install git-annex

Install datalad

  • it is possible to install datalad via conda too (conda install -c conda-forge datalad) but it only supports the 0.11 version, hence is irrelevant in this context
  • install datalad via pip as usual

How functional datalad actually is on Win10 if installed this way is subject to further exploration.

(Confirmed) features

Approach to sub-part/multi-chapter structure?

With #52 we get something like this:

image

All content of the book PART "Basics", but the "Starting from scratch" chapters clearly aim to be more than just subsequent chapters, without technically being a part.

Two alternative approaches that I can see ATM:

  1. 'Starting from scratch' becomes a part
  2. There will be a single 'Starting from scratch'.rst with some kind of a preamble and a subordinate TOCtree that lists the actual chapters (create, populate, modify). This way, each of them stays its own chapter, but the hierarchical structure is made clear(er).

or 3): it stays implicit.

None of this has to be dealt with any time soon. I am only collecting issues that might have implications on the desired structure of the book.

Explore constraints for literalblock

There will be a lot of "listings" and they should be readable in all output formats. HTML and PDF look good, but epub is somewhat unusable, because only 57 chars per line are visible (at least in some viewers).

Consistency in Language

We should be consistent with how we capitalize/separate/hyphenate terms. I think it makes sense to follow the language used on the DataLad website and in the docs. However, I don't think we need to be beholden to how it's currently done. This can be an opportunity to assess if language should be tweaked, and later bring the other resources to the same standard.

Some inconsistencies I've noticed so far are:

  • DataLad (vs Datalad or datalad)
  • dataset (vs data set)
  • subdataset (vs sub-dataset)
  • metadata (vs meta data)

This doesn't need to be addressed now, and can be cleaned up as the handbook reaches completion. But, I wanted to start the conversation now to develop a running list so we can use these terms intentionally.

Build your own DataLad adventure

Treat this issue as [WIP] - I'm essentially just taking notes to disentangle my brain.

Following up with a suggestion by @loj in #8, I'm brainstorming potential "DataLad adventures" we can let readers "choose" from. If we can find a way to make these "adventures" work, those will be the "Basics" section of the handbook (between introductory sections and use cases). Following up on an idea of a "general example dataset" we could keep whatever example dataset we come up with the theme of each adventure (but we somehow may want to keep in mind that it should not be a read-from-start-to-end book). I believe the general idea behind this is to not mimic a "documentation" structure in which we describe every command sequentially and in detail, but rather identify common, generic workflows. (But I'm not sure whether I am conflating "the basics" meant as building blocks for such an adventure too much with use cases here. plus I'm currently doubting whether this is the modular structure we are also aiming for...) One potential problem I'm seeing while writing this is that we have a lot of duplication about commands between the different "adventures". Also, while simple now with only stable commands, this might become a mess the more complex commands get included. However, it might be more helpful to see commands joined together in different workflows instead of having them stand next to each other unconnected but without duplication.

The tool to visualize this (general idea, not necessarily what I'm sketching out here) likely is this: http://blockdiag.com/en/index.html.

  • A1 "I'm a new player (no experience in the terminal) and need the pre-game tutorial"
    • Chapter 0: General prerequisites
  • A2 "I want to version control a new project (local only)"
    • Chapter 1: Local workflows (Section 1: Starting a new dataset)
      • could include: create, status, diff, save
  • A3 "I want to start version controlling an existing project (local only)"
    • Chapter 1: Local workflows (Section 2: Existing projects)
      • could include: install + create, status, diff, save
  • A4 "I want to start a collaborative project with DataLad"
    • Chapter 2: Shared workflows
      • could include: Read A1, + publish (+ update?)
  • A5 "I want to use DataLad in an existing collaborative project"
    • Chapter 2: Read A2, + publish (+ update?)
  • A??
    • Chapter XY
        • run
  • A?? "I want to share data"
    • Chapter XY: ??
  • A42 "I ran into troubles. How can I help myself?"
    • Chapter XY: When things go wrong
  • A0 "I need a quick reference about the commands covered in this book!"
    • maybe a "quicklinks" chapter, with all commands, a 1-2 sentence description what they can do, and links to the chapters they are covered in. This can be the most "documentation-like" section and will be helpful for navigation and overview for people who might just want to re-read things.

I will continue developing my messy thoughts on this tomorrow.

Use-case contents

Lets start a collection of potentially interesting use-cases here. We can continuously update this list.

  • REMoDNaV/"writing a reproducible paper"
  • sharing data
  • data curation: prepare non-datalad resources as datasets, adoption of formatting standards, metadata "capture"
  • studyforrest web-catalogue
  • ...

Brainstorm templates

I'm just brainstorming bit and pieces, but something that came to my mind is that recurrent "types" of content -- specifically I am thinking about use-cases -- are easiest digested IMO if they follow a common outline/template/structure.

I'm currently thinking of something like

  • summary: what is the context, what will you learn from reading this
  • problem: a relatable, pointed description of a problem that we aim to solve
  • solution/utopia: a description of the outcome state after applying the datalad procedures necessary. To basically awaken demand to know how to get to the state, and to give a quick overview beyond the summary whether the following recipe is worth the readers time given their interest/motivation
  • recipe: a step-by-step tutorial how to get from problem to solution, as short as possible, but with as many steps as necessary, and with pointers to more detailed resources (within the book) for particular commands.
  • additional resources, links, etc for further reading

but that is a preliminary line of thought that could very well not be applicable at all - I just want to keep my thought somewhere. I will write a Remodnav use case and try to find a sensible structure (probably very iteratively over and over again) and maybe we are able to find a generic template structure that is applicable to all kinds of use cases. By supplying a template to fit that structure future contributions by others could be made easy as well.

Discover neuroimaging data

Discovering and obtaining scans that match certain criteria is difficult and requires many working pieces, good metadata, query capabilities, data access procedures. From my POV it would be desirable to show case how connecting all this within a datalad context enables such a use case.

In contrast to other unwritten chapters, this would require some parallel development to smoothly connect the pieces.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.