datalad-handbook / course Goto Github PK

Talks and materials for workshops based on the DataLad handbook

License: Other

HTML 98.99% CSS 0.26% JavaScript 0.04% Shell 0.71%

course's Introduction

The DataLad handbook 📙

This is a living resource on why and - more importantly - how to use DataLad. The rendered version is here: https://handbook.datalad.org, and is currently under initial development.

The handbook is a practical, hands-on crashcourse to learn and experience DataLad. You do not need to be a programmer, computer scientist, or Linux-crank. If you have never touched your computer's shell before, you will be fine. Regardless of your background and personal use cases for DataLad, the handbook will show you the principles of DataLad, and from chapter 1 onwards you will be using them.

Find more general information about the idea behind the handbook in the poster presented at the 2020 OHBM or dive straight into your DataLad adventure.

Contributing

Contributions in any form - pull requests, issues, content requests/ideas, ... are always welcome. If you are using the handbook and find that something does not work, please let us know. Likewise, if you are using DataLad for your individual project, consider contributing by telling us about your use-case. You can find out more on how to contribute here, and a list of all contributors so far below, in CONTRIBUTORS.md, and in .zenodo.json.

Notes for Instructors

The book is the basis for workshops and lectures on DataLad and data management. The handbook's course repository among other things contains live casts from the code examples in this book and slides. It is constantly growing, and everyone is free to use the material under the license terms below. Contributions and feedback are very welcome.

License

CC-BY-SA: You are free to

share - copy and redistribute the material in any medium or format
adapt - remix, transform, and build upon the material for any purpose, even commercially

under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Adina S. Wagner} 💻 🖋 📖 🎨 🤔 🚇 🚧 📆 👀 📓 📢 ⚠️ 🐛 💡 💬 ️️️️♿️	_{Laura Waite} 🤔 🚧 👀 📢 💬 🖋	_{Michael Hanke} 💬 🐛 💻 🖋 📖 🎨 💡 🤔 🚇 🚧 🔌 📆 👀 🔧 ⚠️ 📢 📓 ️️️️♿️	_{Kyle Meyer} 🐛 👀 💬 🖋 🤔	_{Marisa Heckner} 🤔 📓 🐛 🖋	_{Benjamin Poldrack} 💬 🤔 💡 ✅	_{Yaroslav Halchenko} 👀 🖋 🤔 🐛
_{Chris Markiewicz} 🐛	_{Pattarawat Chormai} 🐛 💻	_{Lisa N. Mochalski} 🐛 🖋 💡 🤔	_{Lisa Wiersch} 🐛	_{Jean-Baptiste Poline} 🖋	_{Nevena Kraljevic} 📓	_{Alex Waite} 👀 🐛 🤔
_{Lya K. Paas} 🐛 💻	_{Niels Reuter} 🖋	_{Peter Vavra} 🤔 📓	_{Tobias Kadelka} 📓	_{Peer Herholz} 🤔	_{Alexandre Hutton} 🖋 🐛	_{Sarah Oliveira} 👀 🤔
_{Dorian Pustina} 🤔	_{Hamzah Hamid Baagil} 📓 🐛	_{Tristan Glatard} 🐛 🖋	_{Giulia Ippoliti} 🖋 💡	_{Christian Mönch} 🖋	_{Togaru Surya Teja} 🖋	_{Dorien Huijser} 🐛 📓
_{Ariel Rokem} 🐛	_{Remi Gau} 🐛 🤔 🚧 👀 🚇 💻 🎨	_{Judith Bomba} 🐛	_{Konrad Hinsen} 🐛	_{Wu Jianxiao} 🐛	_{Małgorzata Wierzba} 📓 👀 ✅	_{Stefan Appelhoff} 🚇 🔧 🐛
_{Michael Joseph} 🤔 🖋 🐛	_{Tamara Cook} 👀 🚇	_{Stephan Heunis} 🐛 🚧 🖋 💡 👀	_{Joerg Stadler} 🐛	_{Sin Kim} 🐛 🖋 👀	_{Oscar Esteban} 🐛	_{Michał Szczepanik} 👀 🐛 🖋
_eort 🐛	_Myrskyta 🐛	_{Thomas Guiot} 🐛	_jhpb7 🐛	_{Ikko Ashimine} 🐛	_{Arshitha Basavaraj} 🖋 🐛 🚧	_{Anthony J Veltri} 📓
_{Isil Bilgin} 🐛 🚧	_{Julian Kosciessa} 🖋	_{Isaac To} 🚧 🖋 🐛	_{Austin Macdonald} 🐛	_{Christopher S. Hall} 🐛	_jcf2 🐛	_{Julien Colomb} 🖋
_{Danny Garside} 🐛 🚧	_{Justus Kuhlmann} 🖋	_melanieganz 🐛	_{Damien François} 🐛 🖋	_{Tosca Heunis} 🐛 📓	_{Jeremy Magland} 🐛	_{Matthias Riße} 🐛
_{David Nicholson} 🐛 🖋

This project follows the all-contributors specification. Contributions of any kind welcome!

course's People

Contributors

Stargazers

Watchers

Forkers

llevitis jbpoline yarikoptic

course's Issues

public url for pics/slides?

got interested in sandwhich03.svg which comes from that submodule but that submodule has ssh url for it

(git)lena:~datalad/datalad-handbook/course[master]git
$> datalad -f json_pp subdatasets pics/slides
{
  "action": "subdataset",
  "gitmodule_name": "pics/slides",
  "gitmodule_url": "kumo.ovgu.de:/home/mih/public_html/datalad/slides",
  "gitshasum": "76882e01a9194444b507491889e7d9f6d6dcb6b2",
  "parentds": "/home/yoh/proj/datalad/datalad-handbook/course",
  "path": "/home/yoh/proj/datalad/datalad-handbook/course/pics/slides",
  "refds": "/home/yoh/proj/datalad/datalad-handbook/course",
  "state": "absent",
  "status": "ok",
  "type": "dataset"
}

1.5 day workshop in Lucca

@mih and I will be giving a workshop on DataLad in Lucca on March 23rd-24th. This issue lists the TODOs and acts as a progress tracker.
Please extend and edit as necessary. :)

Logistics

await Feedback from Lucca on dates
await Feedback from Lucca on GDrive account
figure out travel
- ~~@adswa (I will likely take a train. Depending on when we plan to arrive, there is a nice one overnight, arriving at 7 something in the morning)~~ EDIT: both of us will go to Pisa from Montreal
- @mih

Software

write a custom wrapper around a special remote for gdrive.
- Figure out which software to base it on. Rclone seems to work, but there also seems to be git-annex-remote-googledrive, listed under "gitannex/tips", and directly linked as a specialized service.

Teaching

A Basics layout has been proposed by @mih and awaits feedback from Lucca

Datalad concepts and principles
Basics of local data/code version control
- Hands on: tasks to exercise basic building blocks
Modular data management for reproducible science
- Hands on: implement sketch of a reproducible paper
Data management for collaborative science
- Hands on: Using your infrastructure (Gdrive) to collaborate on a
  demo project
Data publication
- Hands on: Publish data on "GitHub"
Outlook (what is else possible, resources, use cases)
Potential group work: Small sets of people are given problems to solve with DataLad and present

This is currently structured like this:
Monday 23 Morning session
1 Datalad concepts and principles
2 Basics of local data/code version control + Hands on: tasks to exercise basic building blocks

Monday 23 Afternoon session
1 Modular data management for reproducible science + Hands on: implement sketch of a reproducible paper
2 Data management for collaborative science + Hands on: Using your infrastructure (Gdrive) to collaborate on a demo project

Tuesday 24 Morning session
1 Data publication + Hands on: Publish data on "GitHub"
2 Outlook (what is else possible, resources, use cases)

Resources to create

rclone GDrive wrapper (started here datalad/datalad#4162)
slides
code lists
sketches of a LaTeX (?) skeleton for a reproducible paper. @adswa could potentially use resources she will help to improve at the Turing Way book dash.
Data to use for examples and to publish to Gdrive
Optional/Wishlist: Some sort of audience response system. EduVote (Browser-based, Google Forms, ...? E.g., in the form of: "How confident are you using --> rating scale"
Workshop feedback (potentially pre-post, to learn about attendees expectations before and after the course, knowledge gain. Also remember to collect Feedback on DataLad

Educating for a FAIR future talk at the NWG, due Feb 22nd

10 minute video, prerecorded - young investigator presentation
live discussion virtually, March 28th, evening

abstract:
With a growing awareness of the role of sample size and replicable results (Button et al., 2013; Turner et al., 2018), a rise of platforms, tools, and standards that aim to facilitate data sharing and management (Wiener et al., 2016), unprecedented sample sizes (e.g., UKBiobank; Bzdok & Yeo, 2017), and increasingly complex data analyses (e.g, Glasser et al., 2013; Alfaro-Almagro et al., 2018), research data management (RDM) is essential to put open and FAIR neuroimaging research into effect. But just as FAIRness and RDM can not be an afterthought in any given scientific project, they also shouldn’t be an afterthought in the training and education of current and future generations of neuroscientists. This training has to fulfill the demands of different stakeholders in science: 1) Researchers, that apply RDM in their scientific projects, 2) PIs and similar personnel with management tasks, that need to set out and justify plans for the implementation of RDM and FAIR principles, and 3) trainers, such as librarians or data managers, that educate users on tools and practices for FAIR science (Fothergill et al., 2019, Grisham et al., 2016). Researchers of any career level and of any background need accessible tutorial-like educational content and documentation for relevant tools and concepts to apply FAIR RDM from the get go. Planners need high-level, non-technical information in order to make informed yet efficient decisions on whether a tool fulfils their needs. And trainers need reliable, open teaching material.
A user-driven alternative to scientific software documentation by software developers, “Documentation Crowdsourcing”, has been successfully employed by the NumPy project (Oliphant, 2006; Pawlik et al., 2015). Extending this concept beyond documentation, we have created the DataLad handbook (handbook.datalad.org) as a free & open-source, user-driven and -focused educational instrument and resource for trainers, users, and planners for (research) data management, independent of their background and skill level (Wagner et al., 2020). Drawing from the experiences of creating more than 400 pages of educational material, with almost 40 independent contributors from around the world, and nearly 2 years of in-person and virtual teaching based on the handbook, I want to highlight the unique challenges of RDM training and as well as its opportunities for the field of neuroscience.

DebConf Talk on DataLad, due August 15th

The DebConf talk proposal was accepted.
Here is the abstract:

Title: DataLad - Decentralized Management of Digital Objects for Open Science

With a general awareness of a reproducibility crisis in many scientific areas and increasing importance of research data management in science and policy making, data-driven fields require convenient and scalable data management solutions. Standing on the shoulders of Git and git-annex (git-annex.branchable.com/, Joey Hess), DataLad provides a decentralized solution that enables the joint management of code, data, and complete containerized computational environments in a scalable and distributed fashion. With features such as unambiguous version control, a wide spectrum of data transport mechanisms, convenient provenance capture, and re-execution for verification or as an alternative to storage and transport, it enables and facilitates many aspects of open and reproducible science: collaboration, sharing, analytical transparency, computational reproducibility of digital research objects, and disk-space aware storage and computing workflows on infrastructure that ranges from personal laptops up to supercomputers.

In this talk, we will introduce DataLad, present its main features which should be of interest to the audience regardless of their relation to any field of science, and share the process and status of its adoption in the neuroimaging community.

Recording tips: https://debconf-video-team.pages.debian.net/docs/advice_for_recording.html

some figures used in the talks are missing

just was trying to get a glimpse of https://github.com/datalad-handbook/course/blob/0b26cb6ac9a5d6c2d5bd5473a92d0284d959ec79/talks/hhu.html but it seems that most of the figures, such as e.g. talks/hhu.html: <img height="850" class="fragment fade-in" src="../pics/ukb_datasets.svg"> are nowhere to be found.

Useful free tool for simple audience polling: https://www.directpoll.com

This tool is very useful:

create questions in advance (expires after 30 days unless you "save" it again)
embed the live results into the presentation (using an <iframe></iframe> tag):

    <iframe src="https://directpoll.com/r?XDbzPBdJ2bAX0ZEC2YlWLumm6WtYBkChGSFh5Vwe4W"
    title="This is my poll", width="900", height="900"></iframe>

Book vs course

The goal is to develop a course, based on the book while minimizing the amount of disconnected material, and therefore making it easier to evolve book and course together with the evolution of datalad

the course and the book share the exact same content, but the former is performed, while the latter serves as the syllabus
code examples in the book are actually executable. we use this feature to turn them into "cast" scripts. once in that form, we can use the cast_live tools from DataLad to demo them in a course installment
each code example in the book needs to be equipped with a "caption" that can then serve as a narrative cue in the cast script. The caption could then also be displayed in the book itself.
each code example in the book needs to get a tag or label that can be used to subselect examples that make up a shorter, but still internally consistent narrative -- this aids the generation of shorter course installments
initially the slides of the course material are based on the "summary" components of each chapter, plus relevant key figures. once tailored to and validated by the teaching the course, their content is fed back into the book (possibly using a new dedicated markup). Each slide contains a link to the respective part of the book, where more details are available. The link is possibly implemented as a QR code.
the order of topics in the course matches the order in the book. if it turns out that this order is suboptimal it needs to be adjusted in both book and course. consequently, the course starts with basics and a uniform narrative, and ends with more standalone scenario descriptions.
the course starts with, or is following a "pitch" that outlines an attractive take-away for a respective target audience. Candidate pitches are any "use case" chapter.
slide decks for course installments are based on reveal.js, and are more or less fully generated using the book sources are a (set of) templates. Each chapter has its own slide deck.
analog to the book, each session/chapter (and in particular the early ones) must communicated in a self-evident fashion, why their content/objective is important, and applicable to practical problems a target audience can relate to.

Content (based on current book)

Setup: Git ID, installation, what is a terminal
Datasets (create, save, install, nesting): basic local version control, manual log keeping
Run: basic provenance tracking , automatic log keeping
Git-annex basics: disaster recovery (needs merge of currently disjoined chapters git-annex and help yourself
Collaboration: yes!
YODA: using the conceptual pieces optimally for maximum practical benefits -- this will be and is a mostly conceptual part

Each of these "basics" chapters is handled in a 90min installment.

After the initial sessions on "basics" and number of use case descriptions can follow.

For the initial run at INM7, we will have a dedicated "How to work with the local infrastructure" session that could take place any time after (3). This will the also turn into a use case chapter in the book.

Instead of a weekly or biweekly frequency, this course can also be tought as a 2-day block event, with the basics on day 1, and a re-cap + use cases on a (shorter) day 2.

Cast_live should log into brainbfast and execute commands there...

... instead of executing everything on my machine.

ABCD-ReproNim Course

Date: Jan 22nd 2020
Tentative schedule:

ReproNim: Data Versioning and Transformation with DataLad
Instructor: Adina Wagner*, Institute of Neuroscience and Medicine (INM-7)
Why Should Data Be Versioned?
Simple DataLad Transform: Retrieve, Compute, Store Results
Create a Dataset
Using DataLad with Containers on the Dataset
Rerunning and Checking Analysis Differences

Submission due: Dec. 15th

Todos:

Pre-record your lecture (details to be provided separately) by September 15th/December 15th (depending on your 'session’; see syllabus);
Be available for your 1-hour question and answer period with the students on the Friday at 1pm EST/10am PST as indicated in the syllabus;
Provide 1-2 readings/watchings (~30 minutes) you would like to assign prior to your lecture;
Review the homework assignment generated by the TA team before distribution to the students;
(Optional) Attend the "virtual" workshop March 8-12, 2021.

Handbook2livecasts: Todos for cast_live and automatically creates casts

This is to document how to turn the handbook into cast_live scripts.

Create a cast with annotated code snippets in the handbook (see datalad-handbook/book#217 for insights on how to do this)
Use a custom version of DataLads cast_live to to "play" it

TODO:

update the cast_live tools to run without obscure failure (XGetWindowProperty[_NET_WM_DESKTOP] failed (code=1))
- the command that fails is xdotool windowactivate --sync $(xdotool getwindowfocus)
create a copy of appropriately customized cast live tools in this repo
add the casts (as soon as they are created)

IRTG Workshop Aachen

When: November 26th, 2019, 4pm
Where: Same library seminar room as before
Duration: 2 hours
Participants: 25 grad students, various backgrounds (neuroscience, psych, bio, physics, engineering, medicine), workshop will be made compulsory

Communicated expectations on content:

DataLad
BIDS

TODO

Dienstreiseantrag
Short description/overview to distribute in advance
Slides/casts
Code/materials for participants

Own thoughts

The time is extremely limited: The workshop needs to get them motivated to learn the tools (e.g., start with reproducible paper teaser, and for BIDS maybe show brainlife.io), give a brief introduction into the basics principles (prob. Dataset basics and as shortened Reproducible execution session), and above that contain pointers to everything that is relevant for subsequent self-study.
Based on the conversation with Julia and HanGue, students don't seem to know about version control/Git, BIDS or any standard structure. Teaching them the very basics alone will already make a large difference to their workflows.
possibly: collate a sheet with a collection of useful links.

HCP data related course for the MPI CBS

Some time in September,
remote
30min concepts, 30-90min hands-on.
centered around how to get and analyze HCP data.

This will be cool!

It would be useful to have an interactive run session (e.g., datalad run nano).
Building up the command by try-and-error as in the book doesn't work as good in a workshop session - It is hard to motivate why we run into all of these errors, and easy to lose track of what it is we're trying to achieve