Giter Site home page Giter Site logo

psychoinformatics-de / datalad-hirni Goto Github PK

View Code? Open in Web Editor NEW

This project forked from datalad/datalad-extension-template

5.0 6.0 8.0 1.88 MB

DataLad extension for (semi-)automated, reproducible processing of (medical/neuro)imaging data

Home Page: http://datalad.org

License: Other

Python 90.93% Shell 0.73% Makefile 0.44% CSS 0.02% HTML 7.76% Batchfile 0.12%

datalad-hirni's Introduction

Datalad-Hirni

This project is closed for now, due to lack of capacity to work on it. If anyone wants to take over, I'm happy to help get started, outline ideas for where to go from its current state and so on, but I'll not be able to actually work on it for the foreseeable future. Hence, although I have some hopes of getting back to it at some point, I'm archiving it.

Travis tests status codecov.io Documentation License: MIT GitHub release PyPI version fury.io Average time to resolve an issue Percentage of issues still open

This extension enhances DataLad (http://datalad.org) with support for (semi-)automated, reproducible processing of (medical/neuro)imaging data. Please see the extension documentation for a description on additional commands and functionality.

For general information on how to use or contribute to DataLad (and this extension), please see the DataLad website or the main GitHub project page.

Installation

Before you install this package, please make sure that you install a recent version of git-annex. Afterwards, install the latest version of datalad-hirni from PyPi. It is recommended to use a dedicated virtualenv:

# create and enter a new virtual environment (optional)
virtualenv --system-site-packages --python=python3 ~/env/datalad
. ~/env/datalad/bin/activate

# install from PyPi
pip install datalad_hirni

# alternative: install the latest development version from GitHub
pip install git+https://github.com/psychoinformatics-de/datalad-hirni.git#egg=datalad_hirni

Support

The documentation of this project is found here: http://docs.datalad.org/projects/hirni The documentation is built from this very repository's files under docs/source and thus you can contribute to the docs by opening a pull request just like you'd contribute to the code itself.

All bugs, concerns and enhancement requests for this software can be submitted here: https://github.com/psychoinformatics-de/datalad-hirni/issues

If you have a problem or would like to ask a question about how to use DataLad, please submit a question to NeuroStars.org with a datalad tag. NeuroStars.org is a platform similar to StackOverflow but dedicated to neuroinformatics.

All previous DataLad questions are available here: http://neurostars.org/tags/datalad/

Acknowledgements

The initial development of this extension was funded by the German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform. Continued development is supported by the German Federal Ministry of Education and Research (BMBF 01GQ1905).

datalad-hirni's People

Contributors

adswa avatar aqw avatar bpoldrack avatar kyleam avatar loj avatar manuelakuhn avatar mih avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datalad-hirni's Issues

studyspec validator

It is easy to end up with studyspec property combinations that will lead to BIDS conversion issues. We need to be able to validate a joint spec (across all sessions and subjects).

Concrete case: four acquisitions per subject. The T1 had no "acq" property, hence spec2bids fails on second attempt.

Heuristic file referenced in non-reprodicible way in run record

Example record, notice the full path to the heuristic file. This could be fixed by having a shim shipped by heudiconv that only imports the real heuristic shipped by hirni.

{
  "chain": [],
  "cmd": "singularity exec --bind {pwd} .datalad/environments/conversion/image heudiconv -f /home/mih/hacking/hirni/datalad_hirni/support/hirni_heuristic.py -s XX -c dcm2niix -o .git/hirni-tmp-70f4s1j8 -b -a '{dspath}' -l '' --minmeta --files sourcedata/XX/dicoms",
  "dsid": "5b1081d6-84d7-11e8-b00a-a0369fb55db0",
  "exit": 0,
  "inputs": [
    "sourcedata/XX/dicoms",
    "sourcedata/XX/studyspec.json",
    ".datalad/environments/conversion/image"
  ],
  "outputs": [
    "."
  ],
  "pwd": "."
}

spec2bids run record not reproducible

This is how it looks like. Lots of values are absolute paths. We can only use reliative paths

    [DATALAD RUNCMD] DICOM conversion of session /home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED.
    
    === Do not change lines below ===
    {
     "outputs": [
      "/home/data/psyinf/scratch/multires3t/bids"
     ],
     "exit": 0,
     "cmd": [
      "singularity",
      "exec",
      "--bind",
      "/home/data/psyinf/scratch/multires3t/bids",
      ".datalad/environments/conversion/image",
      "heudiconv",
      "-f",
      "/home/mih/datalad-hirni/datalad_hirni/support/hirni_heuristic.py",
      "-s",
      "REDACTED",
      "-c",
      "dcm2niix",
      "-o",
      "/home/data/psyinf/scratch/multires3t/bids/.git/stupid/REDACTED",
      "-b",
      "-a",
      "/home/data/psyinf/scratch/multires3t/bids",
      "-l",
      "",
      "--minmeta",
      "--files",
      "/home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED/dicoms"
     ],
     "pwd": ".",
     "dsid": "0470a5e4-625f-11e8-b78c-a0369f7c647e",
     "inputs": [
      "/home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED/dicoms",
      "/home/data/psyinf/scratch/multires3t/bids/sourcedata/REDACTED/studyspec.json",
      ".datalad/environments/conversion/image"
     ],
     "chain": []

See datalad/datalad-container#34 for one side-aspect of this situation.

Protocol of the creation of a new study dataset

Create a new dataset

datalad hirni-create-study raw
cd raw

Import a DICOM tarball. Importantly, two subject IDs can be specified. This enables conversion into an anonymized BIDS dataset later on, without forcing a naming convention in the raw dataset.
(all the the other stuff, except the DICOM tarball path is likely going to change).

datalad hirni-import-dcm \
   --subject abcd1234 \
   --anon-subject 52 \
   --session r14
   /path/to/dicom/tarball.tar.gz \

Repeat this process for as many DICOM tarballs as needed. Each acquisition tarball will become a directory in the dataset, just under the top level. The DICOMs will be extracted in a subdirected one level deeper and placed into a dicoms directory, which is a subdataset -- one for each acquisition.
This allows for keeping DICOMs around in a sturcture that is independent of an individual study (i.e. imaging center data storage, etc.).

Once imported, we can drop all extracted dicoms and compact the DICOM subdatasets. This saves quite a bit of diskspace. After this, DataLad still has the original tarballs and can provide extracted DICOMs automatically on request, whenever needed.

datalad drop */dicoms/* 
for i in $(datalad -f '{path}' subdatasets -r); do git -C $i gc; done

With the DICOMs imported, we can add edited the auto-generated studyspec.json files for each acquisition (in the acquisition directory) to add arbitrary information that was not or could not be extracted from the DICOMs. For example, a label for a task that a participant was performing during a run. This is facilitated using a browser-based editing form coming in #51, but can equally be done via hand-editing or programmatic changes of the JSON files (JSON stream format).

At this point additional files could be added to the dataset. Information corresponding to any of the MR data acquisitions must be added into the respective directory. This can be stimulation logs, behavioral response logs, or other simultaneously acquired data. Let's say we have put PsychoPy log files into a psychopylog subdirectory for each acqusition. We can automatically create spec snippets for these data components, and identify a custom conversion script, that transforms the original logs into BIDS-compatible files -- in this case events.tsv files. However, the conversion type, the specific call and the kind of file(s) produced are completely flexible.

datalad hirni-spec4anything \
  --properties '{"converter_path": "../code/psychopylog2events.py", "converter": "{_hs[converter_path]} {_hs[bids_subject]} {_hs[bids_acquisition]} {_hs[location]}"}' \
  */psychopylog/*.log

The above command creates a specification snippet for each log files given, and additionally updates the specification with the properties given via --properties. Those are included verbatim in the spec. These specific properties utilize a placeholder/template language that is similar to that of datalad run. The main difference being that via the _hs symbol arbitrary other properties from the same specification snippet can be referenced. For example: {_hs[bids_subject]} will be replaced with the subject identified (original or anonmymized, depending on config elsewhere). This way it is straighforward to compose arbitrary converter calls. Path specification must be relative to the acquisition folder (i.e. location of the studyspec.json).

Once a converter is configured for each desired data component the study raw dataset is complete and can be archived.

To generate a BIDS-compatible dataset from this raw dataset, please follow the procedure described in #54.

Have demo special remote implementation to interface institutional storage

Assuming they store DICOM tarballs somewhere and can locate them via a study/acquisition identifier, we should initially suck in tarballs via such special remote (analog to datalad-special remote) that could be configured to use different ID->URL resolvers depending on the machine it is running on.

That way we can resolve to a directory with the incoming tarballs on the machine next to a scanner, but later on against some institutional query API when running on an abitrary machine. This way access management/permission can be managed completely by an institution.

Kill `spec2bids --target-dir`

I cannot see a clear use case that this would help with (other than the abstract "but if someone wants to output into an arbitrary directory"). OTOH it causes real problems with information leakage into run records, and their portability.

Unless we have a real use case that is worth this cost, we should kill it IMHO.

Protocol of a conversion from raw study dataset to a BIDS dataset

# new dataset to receive BIDS compliant content
datalad create public
cd public

# apply default setup for BIDS (README in Git, ...)
datalad run-procedure setup_bids_dataset

# grab a container with a known version of heudiconv to perform all the tricky bits
# the --call-fmt option is not needed in most cases, but in this specific case on this
# specific machine is required for singularity to mount the drive the dataset is on
# (which is not $HOME)
datalad containers-add conversion \
    -u shub://mih/ohbm2018-training:heudiconv \
    --call-fmt 'singularity exec --bind {{pwd}} {img} {cmd}'

# current behavior of heudiconv requires that the raw dataset has a README file in
# the root!
# if not it will force-place one and will thereby have modified the input subdataset
# ... not funny...
datalad install -d . -s ../raw sourcedata

# let is process all the acquisition specification snippets in the raw dataset
# outcome will be one commit for each of them, yielding a complete BIDS
# dataset
datalad hirni-spec2bids --anonymize sourcedata/*/studyspec.json

# drop all inputs of the conversion and leave only the BIDS dataset components
datalad uninstall -d . -r --nocheck sourcedata

The outcome is a valid BIDS dataset according to the official BIDS validator.

`spec2bids` needs option to force-drop "unused" content

Rational: such content can be non-anonymized earlier versions of data files that must not be leaked. Immediately before spec2bids is done, it should perform this step by default (but with the ability to disable it when necessary, e.g. when fixing up a dataset with prior history and additional branches).

Runtime info

FYI @cni-md

For a SIEMENS Prisma type dataset with ~1h of BOLD scans it takes 1min to import it into a study dataset. This includes metadata aggregation from the DICOMs.

Seemingly superfluous addition of a submodule on `import-dcm`

The subdatset get's a container added as a submodule, but I do not see why this needs to be done. Possibly, this should go into the primary study dataset.

Moreover, there seems to be something going wrong while setting this up:

<snip>
[INFO   ] Cloning http://psydata.ovgu.de/cbbs-imaging/conv-container/.git to '/home/data/psyinf/scratch/multires3t/raw/datalad_hirni_import/dicoms/.datalad/environments/import-container' 
[INFO   ] access to dataset sibling "psydata-store" not auto-enabled, enable with:           
|               datalad siblings -d "/home/data/psyinf/scratch/multires3t/raw/datalad_hirni_import/dicoms/.datalad/environments/import-container" enable -s psydata-store 
[INFO   ] Aggregate metadata for dataset /home/data/psyinf/scratch/multires3t/raw/REDACTED/dicoms 
[WARNING] Running find resulted in stderr output: git-annex: ds not found
git-annex: find: 1 failed
<snip>

Duplicate import-dcm crashes

When I run the same import command twice in a row (where the first run worked just fine):

% datalad hirni-import-dcm <tarball>
<snip>
[INFO   ] Attempting to save 6 files/datasets                                                                                
[ERROR  ] [Errno 17] File exists: '/home/mih/pd/scratch/multires3t/raw/REDACTED_0971' [os.py:makedirs:241] (FileExistsError)     

We better state clearly what the issue is and not crash on a false assumption.

Run records of the conversion must not contain identifying info

At the moment, things like the center subject ID are likely to end up in path names or other arguments. We need to ensure that the run records that end up in the bids dataset do not contain any identifying information.

That means that we likely cannot use the current -s/--spec-file argument strategy to identify a path to the to-be-converted data.

I think it would be doable to add a ID->anonID mapping in he studydataset and make spec2dicom look up all paths based on this mapping. But I have no idea yet, how to avoid input path values in the run_record.

Support arbitrary/custom converters

Use case: people have a custom raw data file format, but they can include a converter script into the study dataset.

We can use the run-procedure interface for that:

  1. temporarily configure a directory for converter procedures that are stored in the study dataset
  2. refer to such converter via a dedicated prefix, maybe local: in the studyspec
  3. run the procedure on the BIDS dataset

such a converter procedure would need an API. Proposal:

  • arg1: path to BIDS dataset (all procedures do that)
  • arg2: location (from the studyspec)
  • arg3: output filename (provided by hirni, based on the studyspec)
  • remaining args: list of additional arguments that hirni pulls from an optional field "converter_args" in the study spec

Any converter is free to ignore arg3 (e.g. if it needs to do some "one-input-multiple-outputs" conversion). Possibly it would be useful to pass in more info about the conversion and not just the outputfilename. but we could also use placeholders + "converter_args" to achieve that.

Implement "additional data" importer

Use case: We already have imported DICOMs for an acquisition, now we need to add more (e.g. stimulus timing info) for a known acquisition. We do not want to ask for tons of duplicate info again. Instead we want to point to the imported session and say "and now more for this one". So we can inherit stuff like subject ID, etc.

Implement copy "converter"

Use case: People have created files in a format that is already perfect for further processing (e.g. BIDS conversion), so it just needs to be copied to the right location with the correct name. This should be possible with minimal effort.

singularity exec fails when dataset is in directory that contains a symlink

containers-run output:

[INFO   ] == Command start (output follows) =====                                            
INFO: Running heudiconv version 0.5.dev1
Traceback (most recent call last):
  File "/usr/local/bin/heudiconv", line 11, in <module>
    load_entry_point('heudiconv==0.5.dev1', 'console_scripts', 'heudiconv')()
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/cli/run.py", line 120, in main
    process_args(args)
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/cli/run.py", line 244, in process_args
    args.subjs, grouping=args.grouping)
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/parser.py", line 164, in get_study_sessions
    for _, files_ex in get_extracted_dicoms(files):
  File "/usr/local/lib/python3.5/dist-packages/heudiconv/parser.py", line 84, in get_extracted_dicoms
    if not tarfile.is_tarfile(t):
  File "/usr/lib/python3.5/tarfile.py", line 2450, in is_tarfile
    t = open(name)
  File "/usr/lib/python3.5/tarfile.py", line 1559, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.5/tarfile.py", line 1685, in xzopen
    fileobj = lzma.LZMAFile(fileobj or name, mode, preset=preset)
  File "/usr/lib/python3.5/lzma.py", line 118, in __init__
    self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/mih/pd/scratch/multires3t/bids/sourcedata/REDACTED/dicoms'
[INFO   ] == Command exit (modification check follows) ===== 

pd is a symlink.

I tried the following (which works on the commandline), but causes some yet-to-be-determined issue:

diff --git a/.datalad/config b/.datalad/config
index b139e7b..f7f7752 100644
--- a/.datalad/config
+++ b/.datalad/config
@@ -3,4 +3,4 @@
 [datalad "containers.conversion"]
        updateurl = shub://mih/ohbm2018-training:heudiconv
        image = .datalad/environments/conversion/image
-       cmdexec = [\"singularity\", \"exec\", \"{img}\", \"{cmd}\"]
+       cmdexec = [\"singularity\", \"exec\", \"--bind\", \"$(readlink -f $(pwd))\", \"{img}\", \"{cmd}\"]
% datalad hirni-spec2bids -s sourcedata/REDACTED
[INFO   ] == Command start (output follows) ===== 
ERROR  : Target /var/lib/singularity/mnt/final/$(readlink -f $(pwd)) doesn't exist
ABORT  : Retval = 255
[INFO   ] == Command exit (modification check follows) ===== 
[INFO   ] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -r -F.git/COMMIT_EDITMSG .' 
Failed to run ['singularity', 'exec', '--bind', '$(readlink -f $(pwd))', '.datalad/environments/conversion/image', 'heudiconv', '-f', '/home/mih/datalad-hirni/datalad_hirni/support/hirni_heuristic.py', '-s', 'REDACTED', '-c', 'dcm2niix', '-o', '/home/mih/pd/scratch/multires3t/bids/.git/stupid/REDACTED', '-b', '-a', '/home/mih/pd/scratch/multires3t/bids', '-l', '', '--minmeta', '--files', '/home/mih/pd/scratch/multires3t/bids/sourcedata/REDACTED/dicoms'] under '/home/mih/pd/scratch/multires3t/bids'. Exit code=255. out= err=%                               

(also: there is "stupid" in the path...)

Remove reference volumes

While converting studyforrest phase1 it became clear that the current conversion via heudiconv includes the reference volume while the existing convertion via mcverter doesn't.
We probably should provide some routine to remove those automatically. We could even detect this by comparing first volumes across runs.
Could be done in a post conversion procedure or from within spec2bids via a switch. Since it's actually a post conversion processing step, I'd separate from spec2bids. Considering, that we also want some post conversion routine for defacing, we might even want to combine those to into some hirni standard preprocessing routine.

Missing import

[ERROR  ] 'Dataset' object has no attribute 'hirni_dicom2spec' [import_dicoms.py:__call__:211] (AttributeError) 

studyspec.json not diff-friendly

It is better to store it sorted by key (done already), and one item per line (like in aggregate.json). Otherwise diffs are incomprehensible.

Support notion of subject ID anonymization in studyspec

ATM there is only one ID, the one extracted from the DICOMs. On conversion, we need to be able to map this into a new anonymized subject ID.

It would be good to be able to assign such ID either in studyspec.json, or on spec2bids.

Resolve naming confusion

import-dcm takes a SESSION argument, to decide where to put the imported data. In the context of BIDS, a session is one measurement session that usually gets a name like 'pre' or 'post' and does not work as a unique identifier of an acquisition (one measurement of one subject for one study session type).

Maybe call the SESSION arg; "acquisition_id"

and export the 'session' in a BIDS sense as a commandline arg, similar to 'subject'.

Notes on BIDS conversion

# fresh dataset
datalad create bids
cd bids
# add container with converter
datalad containers-add conversion -u shub://mih/ohbm2018-training:heudiconv --call-fmt 'singularity exec --bind {{pwd}} {img} {cmd}'
# grab raw DICOM dataset with all subjects and acquisition sessions
datalad install -d . -s ../raw sourcedata --reckless
# convert a single acquisition to BIDS
datalad hirni-spec2bids -a sourcedata/REDACTED --anonymize

`run` property default could be more clever

Scenario with such descriptions:

...
epi2d_bold_1.4iso_03
epi2d_bold_1.4iso_04
...

Clearly the last element is a counter. If there is a common prefix and the last element is numeric, it could be used as a run counter default.

Document purpose of studyspec fields

Please edit, what are possible values (if there are limited choices), what is it used for:

  • anon_subject
    anonymized subject ID. To be used for conversion if spec2bids is called with --anonymize
  • bids_modality
    modality according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
  • bids_run
    run according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
  • bids_session
    session according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
  • bids_task
    task according to BIDS. To be used for conversion and therefore for the naming scheme of converted files
  • comment
    free to use. Intended to ease maintenance and specification editing by humans
  • converter
    converter to be used. ATM just either heudiconv or ignore (to not convert at all)
  • dataset_id
    datalad dataset ID of the (DICOM-)dataset
  • dataset_refcommit
    commit of the (DICOM-) dataset this specification is referring to
  • description
    human readable description of the image series
  • id
    human readable id of the image series. Defaults to SeriesNumber from DICOM metadata
  • location
    location of the DICOMs containing this image series
  • status
    unused ATM
  • subject
    subject ID. Either specified by the user or guessed from DICOM metadata. This is the non-anonymized subject ID
  • type
    type of the data this specification is about. Kind of specification class. Most important ATM is dicomseries, which denominates a specification of a DICOM image series
  • uid
    unique identifier of an image series. This is the SeriesInstanceUID from DICOM headers. Used during conversion by spec2bids to match image series data as found by heudiconv with this specification

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.