psychoinformatics-de / datalad-hirni Goto Github PK

View Code? Open in Web Editor NEW

This project forked from datalad/datalad-extension-template

5.0 6.0 8.0 1.88 MB

DataLad extension for (semi-)automated, reproducible processing of (medical/neuro)imaging data

Home Page: http://datalad.org

License: Other

Python 90.93% Shell 0.73% Makefile 0.44% CSS 0.02% HTML 7.76% Batchfile 0.12%

datalad-hirni's Introduction

Datalad-Hirni

This project is closed for now, due to lack of capacity to work on it. If anyone wants to take over, I'm happy to help get started, outline ideas for where to go from its current state and so on, but I'll not be able to actually work on it for the foreseeable future. Hence, although I have some hopes of getting back to it at some point, I'm archiving it.

This extension enhances DataLad (http://datalad.org) with support for (semi-)automated, reproducible processing of (medical/neuro)imaging data. Please see the extension documentation for a description on additional commands and functionality.

For general information on how to use or contribute to DataLad (and this extension), please see the DataLad website or the main GitHub project page.

Installation

Before you install this package, please make sure that you install a recent version of git-annex. Afterwards, install the latest version of datalad-hirni from PyPi. It is recommended to use a dedicated virtualenv:

# create and enter a new virtual environment (optional)
virtualenv --system-site-packages --python=python3 ~/env/datalad
. ~/env/datalad/bin/activate

# install from PyPi
pip install datalad_hirni

# alternative: install the latest development version from GitHub
pip install git+https://github.com/psychoinformatics-de/datalad-hirni.git#egg=datalad_hirni

Support

The documentation of this project is found here: http://docs.datalad.org/projects/hirni The documentation is built from this very repository's files under docs/source and thus you can contribute to the docs by opening a pull request just like you'd contribute to the code itself.

All bugs, concerns and enhancement requests for this software can be submitted here: https://github.com/psychoinformatics-de/datalad-hirni/issues

If you have a problem or would like to ask a question about how to use DataLad, please submit a question to NeuroStars.org with a datalad tag. NeuroStars.org is a platform similar to StackOverflow but dedicated to neuroinformatics.

All previous DataLad questions are available here: http://neurostars.org/tags/datalad/

Acknowledgements

The initial development of this extension was funded by the German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform. Continued development is supported by the German Federal Ministry of Education and Research (BMBF 01GQ1905).

datalad-hirni's People

Contributors

Stargazers

Watchers

Forkers

bpoldrack loj tobiaskadelka pvavra adswa bic-mni kyleam manuelakuhn

datalad-hirni's Issues

studyspec.json contains absolute paths

This is a no-go. We must have everything as relative paths.

studyspec validator

It is easy to end up with studyspec property combinations that will lead to BIDS conversion issues. We need to be able to validate a joint spec (across all sessions and subjects).

Concrete case: four acquisitions per subject. The T1 had no "acq" property, hence spec2bids fails on second attempt.

Heuristic file referenced in non-reprodicible way in run record

Example record, notice the full path to the heuristic file. This could be fixed by having a shim shipped by heudiconv that only imports the real heuristic shipped by hirni.

{
  "chain": [],
  "cmd": "singularity exec --bind {pwd} .datalad/environments/conversion/image heudiconv -f /home/mih/hacking/hirni/datalad_hirni/support/hirni_heuristic.py -s XX -c dcm2niix -o .git/hirni-tmp-70f4s1j8 -b -a '{dspath}' -l '' --minmeta --files sourcedata/XX/dicoms",
  "dsid": "5b1081d6-84d7-11e8-b00a-a0369fb55db0",
  "exit": 0,
  "inputs": [
    "sourcedata/XX/dicoms",
    "sourcedata/XX/studyspec.json",
    ".datalad/environments/conversion/image"
  ],
  "outputs": [
    "."
  ],
  "pwd": "."
}

with spec2bids --anonymize leave no ID info in subject

Heuristic should ignore any image series that has modality==None

spec2bids run record not reproducible

This is how it looks like. Lots of values are absolute paths. We can only use reliative paths

    [DATALAD RUNCMD] DICOM conversion of session /home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED.
    
    === Do not change lines below ===
    {
     "outputs": [
      "/home/data/psyinf/scratch/multires3t/bids"
     ],
     "exit": 0,
     "cmd": [
      "singularity",
      "exec",
      "--bind",
      "/home/data/psyinf/scratch/multires3t/bids",
      ".datalad/environments/conversion/image",
      "heudiconv",
      "-f",
      "/home/mih/datalad-hirni/datalad_hirni/support/hirni_heuristic.py",
      "-s",
      "REDACTED",
      "-c",
      "dcm2niix",
      "-o",
      "/home/data/psyinf/scratch/multires3t/bids/.git/stupid/REDACTED",
      "-b",
      "-a",
      "/home/data/psyinf/scratch/multires3t/bids",
      "-l",
      "",
      "--minmeta",
      "--files",
      "/home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED/dicoms"
     ],
     "pwd": ".",
     "dsid": "0470a5e4-625f-11e8-b78c-a0369f7c647e",
     "inputs": [
      "/home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED/dicoms",
      "/home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED/studyspec.json",
      ".datalad/environments/conversion/image"
     ],
     "chain": []

See datalad/datalad-container#34 for one side-aspect of this situation.

Protocol of the creation of a new study dataset

Create a new dataset

datalad hirni-create-study raw
cd raw

Import a DICOM tarball. Importantly, two subject IDs can be specified. This enables conversion into an anonymized BIDS dataset later on, without forcing a naming convention in the raw dataset.
(all the the other stuff, except the DICOM tarball path is likely going to change).

datalad hirni-import-dcm \
   --subject abcd1234 \
   --anon-subject 52 \
   --session r14
   /path/to/dicom/tarball.tar.gz \

Repeat this process for as many DICOM tarballs as needed. Each acquisition tarball will become a directory in the dataset, just under the top level. The DICOMs will be extracted in a subdirected one level deeper and placed into a dicoms directory, which is a subdataset -- one for each acquisition.
This allows for keeping DICOMs around in a sturcture that is independent of an individual study (i.e. imaging center data storage, etc.).

Once imported, we can drop all extracted dicoms and compact the DICOM subdatasets. This saves quite a bit of diskspace. After this, DataLad still has the original tarballs and can provide extracted DICOMs automatically on request, whenever needed.

datalad drop */dicoms/* 
for i in $(datalad -f '{path}' subdatasets -r); do git -C $i gc; done

With the DICOMs imported, we can add edited the auto-generated studyspec.json files for each acquisition (in the acquisition directory) to add arbitrary information that was not or could not be extracted from the DICOMs. For example, a label for a task that a participant was performing during a run. This is facilitated using a browser-based editing form coming in #51, but can equally be done via hand-editing or programmatic changes of the JSON files (JSON stream format).

At this point additional files could be added to the dataset. Information corresponding to any of the MR data acquisitions must be added into the respective directory. This can be stimulation logs, behavioral response logs, or other simultaneously acquired data. Let's say we have put PsychoPy log files into a psychopylog subdirectory for each acqusition. We can automatically create spec snippets for these data components, and identify a custom conversion script, that transforms the original logs into BIDS-compatible files -- in this case events.tsv files. However, the conversion type, the specific call and the kind of file(s) produced are completely flexible.

datalad hirni-spec4anything \
  --properties '{"converter_path": "../code/psychopylog2events.py", "converter": "{_hs[converter_path]} {_hs[bids_subject]} {_hs[bids_acquisition]} {_hs[location]}"}' \
  */psychopylog/*.log

The above command creates a specification snippet for each log files given, and additionally updates the specification with the properties given via --properties. Those are included verbatim in the spec. These specific properties utilize a placeholder/template language that is similar to that of datalad run. The main difference being that via the _hs symbol arbitrary other properties from the same specification snippet can be referenced. For example: {_hs[bids_subject]} will be replaced with the subject identified (original or anonmymized, depending on config elsewhere). This way it is straighforward to compose arbitrary converter calls. Path specification must be relative to the acquisition folder (i.e. location of the studyspec.json).

Once a converter is configured for each desired data component the study raw dataset is complete and can be archived.

To generate a BIDS-compatible dataset from this raw dataset, please follow the procedure described in #54.

Drop all extracted content in DICOM dataset after metadata extraction

It is leaner to keep a (potentially compressed) tarball then all those files (could be >30k).

Have demo special remote implementation to interface institutional storage

Assuming they store DICOM tarballs somewhere and can locate them via a study/acquisition identifier, we should initially suck in tarballs via such special remote (analog to datalad-special remote) that could be configured to use different ID->URL resolvers depending on the machine it is running on.

That way we can resolve to a directory with the incoming tarballs on the machine next to a scanner, but later on against some institutional query API when running on an abitrary machine. This way access management/permission can be managed completely by an institution.

`spec2bids` needs switch what subject IDs to use

regular IDs (guessed from DICOMs)
provided anonymized IDs

Both have legitimate use cases for a BIDS conversion.

Kill `spec2bids --target-dir`

I cannot see a clear use case that this would help with (other than the abstract "but if someone wants to output into an arbitrary directory"). OTOH it causes real problems with information leakage into run records, and their portability.

Unless we have a real use case that is worth this cost, we should kill it IMHO.

Protocol of a conversion from raw study dataset to a BIDS dataset

# new dataset to receive BIDS compliant content
datalad create public
cd public

# apply default setup for BIDS (README in Git, ...)
datalad run-procedure setup_bids_dataset

# grab a container with a known version of heudiconv to perform all the tricky bits
# the --call-fmt option is not needed in most cases, but in this specific case on this
# specific machine is required for singularity to mount the drive the dataset is on
# (which is not $HOME)
datalad containers-add conversion \
    -u shub://mih/ohbm2018-training:heudiconv \
    --call-fmt 'singularity exec --bind {{pwd}} {img} {cmd}'

# current behavior of heudiconv requires that the raw dataset has a README file in
# the root!
# if not it will force-place one and will thereby have modified the input subdataset
# ... not funny...
datalad install -d . -s ../raw sourcedata

# let is process all the acquisition specification snippets in the raw dataset
# outcome will be one commit for each of them, yielding a complete BIDS
# dataset
datalad hirni-spec2bids --anonymize sourcedata/*/studyspec.json

# drop all inputs of the conversion and leave only the BIDS dataset components
datalad uninstall -d . -r --nocheck sourcedata

The outcome is a valid BIDS dataset according to the official BIDS validator.

studyspec.json should always be in Git

We need to edit this and being in annex is the worst shape to be in for editing.

Oh how I wish there would be a config template for new datasets....

Prevent bloated .git/objects

For a smallish dataset .git/objects went from 66M to 51k after git gc. We should do that ourselves.

Document how to drop annex'ed DICOM tarballs (after add-archive-content)

I.e. months later when a compact to-be-archived dataset is needed and the DICOM tarballs are known to live elsewhere.

FTR @cni-md

`spec2bids` needs option to force-drop "unused" content

Rational: such content can be non-anonymized earlier versions of data files that must not be leaked. Immediately before spec2bids is done, it should perform this step by default (but with the ability to disable it when necessary, e.g. when fixing up a dataset with prior history and additional branches).

metadata extractor for our studyspecs

implement a metadata extractor for our studyspecs and make use of it rendering a summary into the README of the study dataset.

dicom2spec should use spec4anything

the name already implies it.

Support acq- scan property

This is needed to get multi resolution acquisitions with a constant task supported properly.

Runtime info

FYI @cni-md

For a SIEMENS Prisma type dataset with ~1h of BOLD scans it takes 1min to import it into a study dataset. This includes metadata aggregation from the DICOMs.

Seemingly superfluous addition of a submodule on `import-dcm`

The subdatset get's a container added as a submodule, but I do not see why this needs to be done. Possibly, this should go into the primary study dataset.

Moreover, there seems to be something going wrong while setting this up:

<snip>
[INFO   ] Cloning http://psydata.ovgu.de/cbbs-imaging/conv-container/.git to '/home/data/psyinf/scratch/multires3t/raw/datalad_hirni_import/dicoms/.datalad/environments/import-container' 
[INFO   ] access to dataset sibling "psydata-store" not auto-enabled, enable with:           
|               datalad siblings -d "/home/data/psyinf/scratch/multires3t/raw/datalad_hirni_import/dicoms/.datalad/environments/import-container" enable -s psydata-store 
[INFO   ] Aggregate metadata for dataset /home/data/psyinf/scratch/multires3t/raw/REDACTED/dicoms 
[WARNING] Running find resulted in stderr output: git-annex: ds not found
git-annex: find: 1 failed
<snip>

datalad_hirni_import/ -> .git/datalad/hirni_import

To avoid undesired results when otherwise working with a repo in parallel.

Duplicate import-dcm crashes

When I run the same import command twice in a row (where the first run worked just fine):

% datalad hirni-import-dcm <tarball>
<snip>
[INFO   ] Attempting to save 6 files/datasets                                                                                
[ERROR  ] [Errno 17] File exists: '/home/mih/pd/scratch/multires3t/raw/REDACTED_0971' [os.py:makedirs:241] (FileExistsError)

We better state clearly what the issue is and not crash on a false assumption.

Run records of the conversion must not contain identifying info

At the moment, things like the center subject ID are likely to end up in path names or other arguments. We need to ensure that the run records that end up in the bids dataset do not contain any identifying information.

That means that we likely cannot use the current -s/--spec-file argument strategy to identify a path to the to-be-converted data.

I think it would be doable to add a ID->anonID mapping in he studydataset and make spec2dicom look up all paths based on this mapping. But I have no idea yet, how to avoid input path values in the run_record.

Do not aggregate DICOM ds metadata into BIDS dataset (by default)

Otherwise any anonymization effort is pointless.

Support arbitrary/custom converters

Use case: people have a custom raw data file format, but they can include a converter script into the study dataset.

We can use the run-procedure interface for that:

temporarily configure a directory for converter procedures that are stored in the study dataset
refer to such converter via a dedicated prefix, maybe local: in the studyspec
run the procedure on the BIDS dataset

such a converter procedure would need an API. Proposal:

arg1: path to BIDS dataset (all procedures do that)
arg2: location (from the studyspec)
arg3: output filename (provided by hirni, based on the studyspec)
remaining args: list of additional arguments that hirni pulls from an optional field "converter_args" in the study spec

Any converter is free to ignore arg3 (e.g. if it needs to do some "one-input-multiple-outputs" conversion). Possibly it would be useful to pass in more info about the conversion and not just the outputfilename. but we could also use placeholders + "converter_args" to achieve that.

No uniform spec2bids(ok) results please

Results of underlying commands should be passed-through or swallowed whole.

Implement "additional data" importer

Use case: We already have imported DICOMs for an acquisition, now we need to add more (e.g. stimulus timing info) for a known acquisition. We do not want to ask for tons of duplicate info again. Instead we want to point to the imported session and say "and now more for this one". So we can inherit stuff like subject ID, etc.

Convince heudiconv to not create tons of empty template files that could be "edited"

It should leave it to us, without having to disable ALL BIDS-related functionality.

Implement copy "converter"

Use case: People have created files in a format that is already perfect for further processing (e.g. BIDS conversion), so it just needs to be copied to the right location with the correct name. This should be possible with minimal effort.

heudiconv always add `age` column to participants tsv

For DICOM from MD this will always be wrong und uniform. So it would be nice to be able to disable this column.

`create-study` should add basic README

This will also help to mititgate #34

and avoid submodule changes in spec2bids

Enable sidecar storage for spec2bids run calls

Otherwise info leaks from directory names.

singularity exec fails when dataset is in directory that contains a symlink

containers-run output:

[INFO   ] == Command start (output follows) =====                                            
INFO: Running heudiconv version 0.5.dev1
Traceback (most recent call last):
  File "/usr/local/bin/heudiconv", line 11, in <module>
    load_entry_point('heudiconv==0.5.dev1', 'console_scripts', 'heudiconv')()
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/cli/run.py", line 120, in main
    process_args(args)
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/cli/run.py", line 244, in process_args
    args.subjs, grouping=args.grouping)
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/parser.py", line 164, in get_study_sessions
    for _, files_ex in get_extracted_dicoms(files):
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/parser.py", line 84, in get_extracted_dicoms
    if not tarfile.is_tarfile(t):
  File "/usr/lib/python3.5/tarfile.py", line 2450, in is_tarfile
    t = open(name)
  File "/usr/lib/python3.5/tarfile.py", line 1559, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.5/tarfile.py", line 1685, in xzopen
    fileobj = lzma.LZMAFile(fileobj or name, mode, preset=preset)
  File "/usr/lib/python3.5/lzma.py", line 118, in __init__
    self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/mih/pd/scratch/multires3t/bids/sourcedata/REDACTED/dicoms'
[INFO   ] == Command exit (modification check follows) =====

pd is a symlink.

I tried the following (which works on the commandline), but causes some yet-to-be-determined issue:

diff --git a/.datalad/config b/.datalad/config
index b139e7b..f7f7752 100644
--- a/.datalad/config
+++ b/.datalad/config
@@ -3,4 +3,4 @@
 [datalad "containers.conversion"]
        updateurl = shub://mih/ohbm2018-training:heudiconv
        image = .datalad/environments/conversion/image
-       cmdexec = [\"singularity\", \"exec\", \"{img}\", \"{cmd}\"]
+       cmdexec = [\"singularity\", \"exec\", \"--bind\", \"$(readlink -f $(pwd))\", \"{img}\", \"{cmd}\"]

% datalad hirni-spec2bids -s sourcedata/REDACTED
[INFO   ] == Command start (output follows) ===== 
ERROR  : Target /var/lib/singularity/mnt/final/$(readlink -f $(pwd)) doesn't exist
ABORT  : Retval = 255
[INFO   ] == Command exit (modification check follows) ===== 
[INFO   ] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -r -F.git/COMMIT_EDITMSG .' 
Failed to run ['singularity', 'exec', '--bind', '$(readlink -f $(pwd))', '.datalad/environments/conversion/image', 'heudiconv', '-f', '/home/mih/datalad-hirni/datalad_hirni/support/hirni_heuristic.py', '-s', 'REDACTED', '-c', 'dcm2niix', '-o', '/home/mih/pd/scratch/multires3t/bids/.git/stupid/REDACTED', '-b', '-a', '/home/mih/pd/scratch/multires3t/bids', '-l', '', '--minmeta', '--files', '/home/mih/pd/scratch/multires3t/bids/sourcedata/REDACTED/dicoms'] under '/home/mih/pd/scratch/multires3t/bids'. Exit code=255. out= err=%

(also: there is "stupid" in the path...)

Command modules need module-level docstring

First line is used as short summary in the docs (absent now).

https://datalad-hirni.readthedocs.io/en/latest/modref.html

Remove `data_type` field from studyspec

It is completely redundant to modality and less detailed.

Remove reference volumes

While converting studyforrest phase1 it became clear that the current conversion via heudiconv includes the reference volume while the existing convertion via mcverter doesn't.
We probably should provide some routine to remove those automatically. We could even detect this by comparing first volumes across runs.
Could be done in a post conversion procedure or from within spec2bids via a switch. Since it's actually a post conversion processing step, I'd separate from spec2bids. Considering, that we also want some post conversion routine for defacing, we might even want to combine those to into some hirni standard preprocessing routine.

Missing import

[ERROR  ] 'Dataset' object has no attribute 'hirni_dicom2spec' [import_dicoms.py:__call__:211] (AttributeError)

studyspec.json not diff-friendly

It is better to store it sorted by key (done already), and one item per line (like in aggregate.json). Otherwise diffs are incomprehensible.

Support notion of subject ID anonymization in studyspec

ATM there is only one ID, the one extracted from the DICOMs. On conversion, we need to be able to map this into a new anonymized subject ID.

It would be good to be able to assign such ID either in studyspec.json, or on spec2bids.

Resolve naming confusion

import-dcm takes a SESSION argument, to decide where to put the imported data. In the context of BIDS, a session is one measurement session that usually gets a name like 'pre' or 'post' and does not work as a unique identifier of an acquisition (one measurement of one subject for one study session type).

Maybe call the SESSION arg; "acquisition_id"

and export the 'session' in a BIDS sense as a commandline arg, similar to 'subject'.

Publish the DICOM demo datasets as a hirni study dataset

Including the crafted studyspec.json pieces.

Notes on BIDS conversion

# fresh dataset
datalad create bids
cd bids
# add container with converter
datalad containers-add conversion -u shub://mih/ohbm2018-training:heudiconv --call-fmt 'singularity exec --bind {{pwd}} {img} {cmd}'
# grab raw DICOM dataset with all subjects and acquisition sessions
datalad install -d . -s ../raw sourcedata --reckless
# convert a single acquisition to BIDS
datalad hirni-spec2bids -a sourcedata/REDACTED --anonymize

Give studyspec field with BIDS semantics a bids_ name prefix

modality -> bids_modality
run -> bids_run
session -> bids_session
task -> bids_task

Resolve dependency on patched heudiconv

The upstream diff is submitted as a PR: nipy/heudiconv#204

`run` property default could be more clever

Scenario with such descriptions:

...
epi2d_bold_1.4iso_03
epi2d_bold_1.4iso_04
...

Clearly the last element is a counter. If there is a common prefix and the last element is numeric, it could be used as a run counter default.

studyspec UI needs the ability to add a full record from scratch

ATM at least an empty file needs to come from elsewhere before anything goes.

Need ability to codify conversion post-processing steps

Needed for e.g. de-facing and removal of intial volumes.

Should leave traces in the git history of the converted dataset.

Should run as part of spec2bids.

spec2bids requires manual `containers-add` with an undocumented container name

Document purpose of studyspec fields

Please edit, what are possible values (if there are limited choices), what is it used for:

anon_subject
anonymized subject ID. To be used for conversion if spec2bids is called with --anonymize
bids_modality
modality according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
bids_run
run according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
bids_session
session according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
bids_task
task according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
comment
free to use. Intended to ease maintenance and specification editing by humans
converter
converter to be used. ATM just either heudiconv or ignore (to not convert at all)
dataset_id
datalad dataset ID of the (DICOM-)dataset
dataset_refcommit
commit of the (DICOM-) dataset this specification is referring to
description
human readable description of the image series
id
human readable id of the image series. Defaults to SeriesNumber from DICOM metadata
location
location of the DICOMs containing this image series
status
unused ATM
subject
subject ID. Either specified by the user or guessed from DICOM metadata. This is the non-anonymized subject ID
type
type of the data this specification is about. Kind of specification class. Most important ATM is dicomseries, which denominates a specification of a DICOM image series
uid
unique identifier of an image series. This is the SeriesInstanceUID from DICOM headers. Used during conversion by spec2bids to match image series data as found by heudiconv with this specification