Giter Site home page Giter Site logo

datalad / datalad-neuroimaging Goto Github PK

View Code? Open in Web Editor NEW
17.0 12.0 14.0 401 KB

DataLad extension for neuroimaging research

Home Page: http://datalad.org

License: Other

Python 97.81% Makefile 0.57% Shell 0.68% Batchfile 0.18% Singularity 0.58% Jinja 0.19%
datalad neuroimaging

datalad-neuroimaging's Introduction

 ____          _           _                 _
|  _ \   __ _ | |_   __ _ | |      __ _   __| |
| | | | / _` || __| / _` || |     / _` | / _` |
| |_| || (_| || |_ | (_| || |___ | (_| || (_| |
|____/  \__,_| \__| \__,_||_____| \__,_| \__,_|
                                   Neuroimaging

Travis tests status codecov.io Documentation https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg License: MIT GitHub release PyPI version fury.io Average time to resolve an issue Percentage of issues still open

This extension enhances DataLad (http://datalad.org) for working with neuroimaging data and workflows. Please see the extension documentation for a description on additional commands and functionality.

For general information on how to use or contribute to DataLad (and this extension), please see the DataLad website or the main GitHub project page.

Installation

Before you install this package, please make sure that you install a recent version of git-annex. Afterwards, install the latest version of datalad-neuroimaging from PyPi. It is recommended to use a dedicated virtualenv:

# create and enter a new virtual environment (optional)
virtualenv --system-site-packages --python=python3 ~/env/dataladni
. ~/env/dataladni/bin/activate

# install from PyPi
pip install datalad_neuroimaging

There is also a Singularity container available. The latest release version can be obtained by running:

singularity pull shub://datalad/datalad-neuroimaging

Support

The documentation of this project is found here: http://docs.datalad.org/projects/neuroimaging

All bugs, concerns and enhancement requests for this software can be submitted here: https://github.com/datalad/datalad-neuroimaging/issues

If you have a problem or would like to ask a question about how to use DataLad, please submit a question to NeuroStars.org with a datalad tag. NeuroStars.org is a platform similar to StackOverflow but dedicated to neuroinformatics.

All previous DataLad questions are available here: http://neurostars.org/tags/datalad/

Acknowledgements

DataLad development is supported by a US-German collaboration in computational neuroscience (CRCNS) project "DataGit: converging catalogues, warehouses, and deployment logistics into a federated 'data distribution'" (Halchenko/Hanke), co-funded by the US National Science Foundation (NSF 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411). Additional support is provided by the German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform. This work is further facilitated by the ReproNim project (NIH 1P41EB019936-01A1).

datalad-neuroimaging's People

Contributors

adswa avatar bpoldrack avatar christian-monch avatar jsheunis avatar jwodder avatar kyleam avatar loj avatar mih avatar mslw avatar remi-gau avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datalad-neuroimaging's Issues

Replace `simplejson` with `json`

Because simplejson will be removed from datalad-core, the datalad-metalad code has to be adapted. All uses of simplejson should be replaced by the use of json.

ENH: run-procedure for BIDS dataset configuration

I'm wondering if it would be useful to add a run-procedure to this extension to configure BIDS+datalad datasets such that all files in the root BIDS directory are committed to git while all the rest of the files, irrespective of type, go to the annex?

I'm thinking of use-cases related to distributed dataset-level metadata extraction and catalog generation. Data in the annex (typically all subfolders of the root BIDS directory) would need to be protected because of data privacy concerns, while data in the root directory (participants.tsv, dataset_description.json, any json sidecar files defined at the root level, any additional dataset-level metadata added at root level) are typically considered non-sensitive or have specifically been edited to be so, and can therefore be considered safe to commit to git.

Configuring a dataset like that (as opposed to annexing all files in the dataset) would allow sufficient metadata extraction on any clone without requiring access to the annex.

The run procedure would add something like this to .gitattributes:

* annex.largefiles=anything
/* annex.largefiles=nothing

The procedure (let's call it rootfiles2git) would be available in this extension because it seems (to me) like it could be generally applicable to BIDS datasets collected in the EU (because of GDPR).

WDYT @yarikoptic @bpoldrack @mslw @CPernet @loj

bids extractor rejects any dataset without dataset_description.json

But those are still compliant and pybids can yields loads of useful into on them.

Rational was:

       # (I think) we need a cheap test to see if there is anything, otherwise
        # pybids we try to parse any size of directory hierarchy in full
        if not exists(opj(self.ds.path, self._dsdescr_fname)):
            return {}, []

But I think this is invalid, as any extractor is enabled explicitly for a dataset.

General upkeep and updates to `datalad-neuroimaging` + compatibility with `datalad-metalad`

I'm interested in spending some time on general maintenance of this extension, and eventually updating all of the extractors to be compatible with the first release of datalad-metalad. This interest comes mainly from the need to support a seamless pipeline for metadata extraction and catalog generation. Based on our current user profile, supporting metadata extraction and catalog generation for neuroimaging data seems very applicable.

There are a bunch of open PRs, some quite old. I'll start by commenting on them and asking applicable contributors for input if needed to merge/close. I'll do the same with open issues.

At some point, perhaps in a dev call, we should probably discuss if a coordinated release is necessary, and how to approach it if so.

@datalad/developers comments welcome, and please let me know if you have other/better ideas for approaching this.

get_bids_dataset is broken

Ran into it via datalad-hirni's Travis setup (see psychoinformatics-de/datalad-hirni#93).

On a fresh system (no cached bversions of the test datasets, that are submodules of datalad-neuroimaging):

from datalad_neuroimaging.tests.utils import get_bids_dataset  

ds = get_bids_dataset()

<snip>


[INFO   ] == Command exit (modification check follows) ===== 
INFO   : == Command exit (modification check follows) =====
[ERROR  ] dataset containing given paths is not underneath the reference dataset <Dataset path=/home/ben/.cache/datalad/datalad_neuroimaging_srcrepo/datalad_neuroimaging/tests/data/bids>: [PosixPath('/home/ben/.cache/datalad/datalad_neuroimaging_srcrepo/datalad_neuroimaging/tests/data/bids')] [status(/home/ben/.cache/datalad/datalad_neuroimaging_srcrepo)] 
ERROR  : dataset containing given paths is not underneath the reference dataset <Dataset path=/home/ben/.cache/datalad/datalad_neuroimaging_srcrepo/datalad_neuroimaging/tests/data/bids>: [PosixPath('/home/ben/.cache/datalad/datalad_neuroimaging_srcrepo/datalad_neuroimaging/tests/data/bids')] [status(/home/ben/.cache/datalad/datalad_neuroimaging_srcrepo)]
---------------------------------------------------------------------------
IncompleteResultsError                    Traceback (most recent call last)
<ipython-input-3-7f9e1a26998f> in <module>
----> 1 ds = get_bids_dataset()

<snip>

This is the same that can be found here: https://travis-ci.org/psychoinformatics-de/datalad-hirni/jobs/518791608

bids extractor - sidecar .json files have no bids info

with current master 0.2.0-11-gb3d8897 (and new reworked metadata handling in revolution 0.9.0-183-g8765949, pybids seems to be 0.8.0) saw that side car files have only datalad_core extractor records:

{
  "datalad_core": {
    "@id": "SHA1-s1835--d1bd1ee3b7131c44ad2190558c428fda02b630b6",
    "contentbytesize": 1835
  },
  "path": "sub-9013/ses-2/func/sub-9013_ses-2_task-sleepiness_bold.json"
}

ideally they should carry bids extractor metadata such as subject/session/task. Didn't look into it, i.e. either it is something on pybids or our side to do to address it

enhance DICOM metadata extractor

As far as I can see, currently there's no information about the number if images/volumes in a series available from the metadata. Looking at the extractor I'm still a bit confused about some details, but it should be possible to make that information available. (Ran into that while trying to convert using datalad-hirni where that number would be a criteria on whether or not to convert that series at all).
Not sure yet where to put that information in an image series' record though. At some point I noticed DICOMDIR files somewhere. If those (if present) would provide such information on image series, we should be consistent in field naming, I guess.

participants.tsv information is silently not embedded?

What is the problem?

on ds000201 which is a bit "damaged" by the crawling (some tarballs content seems should have gone under derivatives - yet to fix), but with a very rich participants.tsv (although without Age but with AgeGroup) etc. Thanks to #2149 quickly found out that no gender is reported for a sample subject although it is specified:

$> datalad -f json_pp plugin extract_metadata type=bids file=sub-9001/ses-1/anat/sub-9001_ses-1_T1w.nii.gz | grep -i Gender

$> grep 9001 participants.tsv                                                                                           
9001	Male	Young	19,78997095	19,78997095	Studerar för närvarande på universitet/högskola	0	1	12	4,75	4,666666667	6	5,6	10:30:00	11:30:00	06:10:00	08:00:00	5 Nästan aldrig	00:00:00	2 Stämmer inte så bra	1 Stämmer inte alls	1 Stämmer inte alls	2 Stämmer inte så bra	3 Stämmer ganska bra	4 Ja, i stort sett tillräckligt	5 Mycket litet problem	08:00:00	3 Varken bra eller dåligt	4	3,571428571	4,142857143	1,285714286	2,428571429	55	14	14	27	31	26	39	42	19	46	41	17	16	6	5	5	28	NA	1,533333333	1,625	1,272727273	1,727272727	Male	Male	SurelyNot	Surely	No	NA	No		Early	Early			Interesting	Good	NA	NA	NA	NA	No	No, but lost focus		5	4	32	11	18	6	5	30	11	16	38	29	21	30	51	43	40	30	37	10	11	29	16	6	2	2	6	0	2	B

note: this was ran with #2151 merged in for reading unicode participants.tsv... didn't check where the problem lies yet or it is the one to blame -- too much of DEBUG output, while no warning if aggregate-metadata ran without any increased level of logging. just reporting for now

BIDS extractor outdated

Changes in pybids (see bids-standard/pybids@025341a) lead to tests currently failing with sth like:

datalad.metadata.metadata: INFO: Engage bids metadata extractor
datalad.metadata.metadata: ERROR  : Failed to get dataset metadata (bids): __init__() got an unexpected keyword argument 'config' [bids_layout.py:__init__:116]

extractors/bids.py needs adaption on how to instantiate BIDSLayout.

Read the Docs builds failing

The Read the Docs build history is a long list of failures.

#100 introduced readthedocs.yaml to bump Python version used on RtD and then it worked. Apparently something changed and now if we use readthedocs.yaml we also need to specify explicitly that we need to install this extension (needed by our build process) to build the docs - discovered in datalad/datalad-redcap#17

AFNI dereferences files while accessing them

thus being uncapable to open .BRIKs since .HEADs would be in another key directory:

$> afni 0back_SD_tbi-hc+tlrc.BRIK  

Thanks go to J Haxby for much encouragement

Initializing: X11.. Widgets...... Input files:ONLY ATLASES
** Couldn't open .../.git/annex/objects/fW/24/SHA256E-s2280000--b3b0....BRIK as session OR as dataset!

Workaround -- use AFNI_NOREALPATH=yes. I guess we might need to research either this could be a better default and then campaign for it!

test_bids2scidata AssertionError

daily run
https://github.com/datalad/datalad-extensions/runs/603788377

. (dataset): datalad_core,datalad_unique_content_properties,dicom
======================================================================
. (dataset): datalad_core,datalad_unique_content_properties,dicom
FAIL: datalad_neuroimaging.tests.test_bids2scidata.test_real_ds
. (dataset): datalad_core,datalad_unique_content_properties,nidm
----------------------------------------------------------------------
nifti1.nii.gz (file): annex,nifti1
Traceback (most recent call last):
. (dataset): bids,datalad_core,datalad_unique_content_properties
  File "/opt/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
. (dataset): bids,datalad_core,datalad_unique_content_properties
    self.test(*self.arg)
. (dataset): aggregated
  File "/opt/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/datalad/tests/utils.py", line 691, in newfunc
acq100 (dataset): aggregated
    return t(*(arg + (filename,)), **kw)
. (dataset): bids,datalad_core,datalad_unique_content_properties,nifti1
  File "/opt/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/datalad/tests/utils.py", line 691, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/opt/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/datalad_neuroimaging/tests/test_bids2scidata.py", line 172, in test_real_ds
    ['a_mri_bold.txt', 'a_mri_t1w.txt', 'i_Investigation.txt', 's_study.txt'])
AssertionError: ['a_mri_t1w.txt', 'i_Investigation.txt', 's_study.txt'] != ['a_mri_bold.txt', 'a_mri_t1w.txt', 'i_Investigation.txt', 's_study.txt']

NIDM results extractor and demo

Metadata structure overview:
http://nidm.nidash.org/specs/nidm-results_130.html#fig-nidmresults-core-uml

bids extractor "fails" on benign CHANGES

What is the problem?

I've made mistake to run it with datalad.runtime.raiseonerror=1 I guess so "internal" error caused pdb to kick in... I am not sure what it is whining about since IIRC CHANGES is part of BIDS spec

[bids.py:_get_cnmeta:170,bids_layout.py:get_metadata:161,bids_layout.py:_get_nearest_helper:129]
Traceback (most recent call last):
  File "/home/yoh/proj/datalad/datalad-neuroimaging/venvs/dev/bin/datalad", line 8, in <module>
    main()
  File "/home/yoh/proj/datalad/datalad-master/datalad/cmdline/main.py", line 507, in main
    ret = cmdlineargs.func(cmdlineargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/base.py", line 424, in call_from_parser
    ret = list(ret)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 421, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 490, in _process_results
    for res in results:
  File "/home/yoh/proj/datalad/datalad-master/datalad/metadata/aggregate.py", line 878, in __call__
    force_extraction)
  File "/home/yoh/proj/datalad/datalad-master/datalad/metadata/aggregate.py", line 253, in _dump_extracted_metadata
    subds_relpaths)
  File "/home/yoh/proj/datalad/datalad-master/datalad/metadata/aggregate.py", line 327, in _extract_metadata
    paths=relevant_paths)
  File "/home/yoh/proj/datalad/datalad-master/datalad/metadata/metadata.py", line 539, in _get_metadata
    for loc, meta in contentmeta_t or {}:
  File "/home/yoh/proj/datalad/datalad-neuroimaging/datalad_neuroimaging/extractors/bids.py", line 170, in _get_cnmeta
    include_entities=True).items()
  File "/home/yoh/proj/datalad/datalad-neuroimaging/venvs/dev/local/lib/python2.7/site-packages/bids/grabbids/bids_layout.py", line 161, in get_metadata
    potentialJSONs = self._get_nearest_helper(path, '.json', **kwargs)
  File "/home/yoh/proj/datalad/datalad-neuroimaging/venvs/dev/local/lib/python2.7/site-packages/bids/grabbids/bids_layout.py", line 129, in _get_nearest_helper
    "likely because it is not a valid BIDS file." % path
ValueError: File '/mnt/btrfs/datasets/datalad/crawl/dbic/QA/CHANGES' does not have a valid type definition, most likely because it is not a valid BIDS file.
()
> /home/yoh/proj/datalad/datalad-neuroimaging/venvs/dev/local/lib/python2.7/site-packages/bids/grabbids/bids_layout.py(129)_get_nearest_helper()
-> "likely because it is not a valid BIDS file." % path

support DICOMs tarballs

It is quite common to have dicoms (e.g. for a single sequence) .tar or .zip balled. I wondered if we could/should make it possible to extract/aggregate metadata from those. I could see it done

  • within dicom extractor
  • in a dedicated dicom-tarballs extractor
  • a generic "extractor helper" (e.g. called "balls") which could then be used to prepare (extract) data for other extractors to munch on

The question is how to "represent" that metadata

  • tarball could be considered as a "subdataset" of a kind, and thus we could extract/keep it similarly to how we deal with subdatasets
  • files within tarball could be considered "continuation" of a path for the file, e.g. for a file bu.dcm within a/b/bu.tar it could be path a/b/bu.tar/bu.dcm or some more explicitly defined boundary a/b/bu.tar//bu.dcm or a/b/bu.tar#bu.dcm or even a/b/bu.tar#path=bu.dcm to be inline with how we deal with referencing files in tarballs within our special remote
  • we could extract/contain only fields common and identical to all files in the tarball, and thus associate with the tarball itself

Extracting JSON encodable text data from reStructuredText documents

For the updated BIDS extractor I'm reading information intended for a generic description field from any README files in the datalad dataset. This would be for example: README.md, README.rst, README.txt or just README.

Currently I'm doing:

with open(README_fname, 'rb') as f:
    desc = assure_unicode(f.read()).strip()

(with from datalad.utils import assure_unicode) which gets a string from, e.g. the RST doc. Here's an example from the of a studyforrest subdataset:

"An Extension of studyforrest.org Dataset\n****************************************\n\n|license| |access| |doi|\n\nSimultaneous fMRI/eyetracking while movie watching, plus visual localizers\n==========================================================================\n\nThis is an extension of the studyforrest project, all participants previously\nvolunteered for the audio-only Forrest Gump study. The datset is structured in\nBIDS format, details of the files and metadata can be found at:\n\n     Ayan Sengupta, Falko R. Kaule, J. Swaroop Guntupalli, Michael B. Hoffmann,\n     Christian H\u00e4usler, J\u00f6rg Stadler, Michael Hanke. `An extension of the\n     studyforrest dataset for vision research\n     <http://biorxiv.org/content/early/2016/03/31/046573>`_. (submitted for\n     publication)\n\n     Michael Hanke, Nico Adelh\u00f6fer, Daniel Kottke, Vittorio Iacovella,\n     Ayan Sengupta, Falko R. Kaule, Roland Nigbur, Alexander Q. Waite,\n     Florian J. Baumgartner & J\u00f6rg Stadler. `Simultaneous fMRI and eye gaze\n     recordings during prolonged natural stimulation \u2013 a studyforrest extension\n     <http://biorxiv.org/content/early/2016/03/31/046581>`_. (submitted for\n     publication)\n\nFor more information about the project visit: http://studyforrest.org\n\n\nHow to obtain the dataset\n-------------------------\n\nThe dataset is available for download from `OpenFMRI (accession number\nds000113d) <https://www.openfmri.org/dataset/ds000113d>`_.\n\nAlternatively, the `studyforrest phase 2 repository on GitHub\n<https://github.com/psychoinformatics-de/studyforrest-data-phase2>`_ provides\naccess as a DataLad dataset.\n\nDataLad datasets and how to use them\n------------------------------------\n\nThis repository is a `DataLad <https://www.datalad.org/>`__ dataset. It provides\nfine-grained data access down to the level of individual files, and allows for\ntracking future updates up to the level of single files. In order to use\nthis repository for data retrieval, `DataLad <https://www.datalad.org>`_ is\nrequired. It is a free and open source command line tool, available for all\nmajor operating systems, and builds up on Git and `git-annex\n<https://git-annex.branchable.com>`__ to allow sharing, synchronizing, and\nversion controlling collections of large files. You can find information on\nhow to install DataLad at `handbook.datalad.org/en/latest/intro/installation.html\n<http://handbook.datalad.org/en/latest/intro/installation.html>`_.\n\nGet the dataset\n^^^^^^^^^^^^^^^\n\nA DataLad dataset can be ``cloned`` by running::\n\n   datalad clone <url>\n\nOnce a dataset is cloned, it is a light-weight directory on your local machine.\nAt this point, it contains only small metadata and information on the\nidentity of the files in the dataset, but not actual *content* of the\n(sometimes large) data files.\n\nRetrieve dataset content\n^^^^^^^^^^^^^^^^^^^^^^^^\n\nAfter cloning a dataset, you can retrieve file contents by running::\n\n   datalad get <path/to/directory/or/file>\n\nThis command will trigger a download of the files, directories, or\nsubdatasets you have specified.\n\nDataLad datasets can contain other datasets, so called *subdatasets*. If you\nclone the top-level dataset, subdatasets do not yet contain metadata and\ninformation on the identity of files, but appear to be empty directories. In\norder to retrieve file availability metadata in subdatasets, run::\n\n   datalad get -n <path/to/subdataset>\n\nAfterwards, you can browse the retrieved metadata to find out about\nsubdataset contents, and retrieve individual files with ``datalad get``. If you\nuse ``datalad get <path/to/subdataset>``, all contents of the subdataset will\nbe downloaded at once.\n\nStay up-to-date\n^^^^^^^^^^^^^^^\n\nDataLad datasets can be updated. The command ``datalad update`` will *fetch*\nupdates and store them on a different branch (by default\n``remotes/origin/master``). Running::\n\n   datalad update --merge\n\nwill *pull* available updates and integrate them in one go.\n\nMore information\n^^^^^^^^^^^^^^^^\n\nMore information on DataLad and how to use it can be found in the DataLad Handbook at\n`handbook.datalad.org <http://handbook.datalad.org/en/latest/index.html>`_. The\nchapter \"DataLad datasets\" can help you to familiarize yourself with the\nconcept of a dataset.\n\n\n.. _Git: http://www.git-scm.com\n\n.. _git-annex: http://git-annex.branchable.com/\n\n.. |license|\n   image:: https://img.shields.io/badge/license-PDDL-blue.svg\n    :target: http://opendatacommons.org/licenses/pddl/summary\n    :alt: PDDL-licensed\n\n.. |access|\n   image:: https://img.shields.io/badge/data_access-unrestricted-green.svg\n    :alt: No registration or authentication required\n\n.. |doi|\n   image:: https://zenodo.org/badge/14167/psychoinformatics-de/studyforrest-data-phase2.svg\n    :target: https://zenodo.org/badge/latestdoi/14167/psychoinformatics-de/studyforrest-data-phase2\n    :alt: DOI"

However, when I process this field as part of a larger JSON object with jq, I get an error:

parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 148, column 14

It looks like the assure_unicode function did not succeed in properly escaping the unicode expressions?

If I first convert the rst doc to md using pandoc:

pandoc ../Data/studyforrest-data/original/phase2/README.rst -f rst -t markdown -o READMEPHAS2.md

and then read it in the same way as before, I get:

"'# An Extension of studyforrest.org Dataset\n\n## Simultaneous fMRI/eyetracking while movie watching, plus visual localizers\n\nThis is an extension of the studyforrest project, all participants\npreviously volunteered for the audio-only Forrest Gump study. The datset\nis structured in BIDS format, details of the files and metadata can be\nfound at:\n\n> Ayan Sengupta, Falko R. Kaule, J. Swaroop Guntupalli, Michael B.\n> Hoffmann, Christian Häusler, Jörg Stadler, Michael Hanke. [An\n> extension of the studyforrest dataset for vision\n> research](http://biorxiv.org/content/early/2016/03/31/046573).\n> (submitted for publication)\n>\n> Michael Hanke, Nico Adelhöfer, Daniel Kottke, Vittorio Iacovella, Ayan\n> Sengupta, Falko R. Kaule, Roland Nigbur, Alexander Q. Waite, Florian\n> J. Baumgartner & Jörg Stadler. [Simultaneous fMRI and eye gaze\n> recordings during prolonged natural stimulation -- a studyforrest\n> extension](http://biorxiv.org/content/early/2016/03/31/046581).\n> (submitted for publication)\n\nFor more information about the project visit: <http://studyforrest.org>\n\n### How to obtain the dataset\n\nThe dataset is available for download from [OpenFMRI (accession number\nds000113d)](https://www.openfmri.org/dataset/ds000113d).\n\nAlternatively, the [studyforrest phase 2 repository on\nGitHub](https://github.com/psychoinformatics-de/studyforrest-data-phase2)\nprovides access as a DataLad dataset.\n\n### DataLad datasets and how to use them\n\nThis repository is a [DataLad](https://www.datalad.org/) dataset. It\nprovides fine-grained data access down to the level of individual files,\nand allows for tracking future updates up to the level of single files.\nIn order to use this repository for data retrieval,\n[DataLad](https://www.datalad.org) is required. It is a free and open\nsource command line tool, available for all major operating systems, and\nbuilds up on Git and [git-annex](https://git-annex.branchable.com) to\nallow sharing, synchronizing, and version controlling collections of\nlarge files. You can find information on how to install DataLad at\n[handbook.datalad.org/en/latest/intro/installation.html](http://handbook.datalad.org/en/latest/intro/installation.html).\n\n#### Get the dataset\n\nA DataLad dataset can be `cloned` by running:\n\n    datalad clone <url>\n\nOnce a dataset is cloned, it is a light-weight directory on your local\nmachine. At this point, it contains only small metadata and information\non the identity of the files in the dataset, but not actual *content* of\nthe (sometimes large) data files.\n\n#### Retrieve dataset content\n\nAfter cloning a dataset, you can retrieve file contents by running:\n\n    datalad get <path/to/directory/or/file>\n\nThis command will trigger a download of the files, directories, or\nsubdatasets you have specified.\n\nDataLad datasets can contain other datasets, so called *subdatasets*. If\nyou clone the top-level dataset, subdatasets do not yet contain metadata\nand information on the identity of files, but appear to be empty\ndirectories. In order to retrieve file availability metadata in\nsubdatasets, run:\n\n    datalad get -n <path/to/subdataset>\n\nAfterwards, you can browse the retrieved metadata to find out about\nsubdataset contents, and retrieve individual files with `datalad get`.\nIf you use `datalad get <path/to/subdataset>`, all contents of the\nsubdataset will be downloaded at once.\n\n#### Stay up-to-date\n\nDataLad datasets can be updated. The command `datalad update` will\n*fetch* updates and store them on a different branch (by default\n`remotes/origin/master`). Running:\n\n    datalad update --merge\n\nwill *pull* available updates and integrate them in one go.\n\n#### More information\n\nMore information on DataLad and how to use it can be found in the\nDataLad Handbook at\n[handbook.datalad.org](http://handbook.datalad.org/en/latest/index.html).\nThe chapter \\"DataLad datasets\\" can help you to familiarize yourself\nwith the concept of a dataset.'"

It looks like unicode characters render correctly.

Then, when I process this field as part of a larger JSON object with jq, I get a different error:

parse error: Invalid numeric literal at line 36, column 3942

which points to this part of the string: \\"DataLad datasets\\".

I'm not sure what would be the best way of handling this text extraction such that it can be encoded/decoded in JSON without errors. Any thoughts?

pybids pandas complaining non descriptively

[INFO   ] Aggregate metadata for dataset /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000109                                                            
Metadata extraction: 100%|██████████████████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:19<00:00, 4.85s/ extractors][WARNING] Failed to load participants info due to: 'subject' [hashtable_class_helper.pxi:pandas._libs.hashtable.PyObjectHashTable.get_item:1500]. Skipping the rest of file 

the file is missing the header:

(git-annex)hopa:~/datalad/openfmri/ds000109[master]
$> cat participants.tsv 
sub-01  M	18
sub-02	F	23
sub-03	F	21

I wondered if we should do anything (more sensible warning) in such cases?

pybids stumbles over `CHANGES` file

With #17 I see this locally:

======================================================================
ERROR: datalad_neuroimaging.tests.test_dicomconv.test_validate_bids_fixture
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/mih/hacking/datalad/datalad-neuroimaging/datalad_neuroimaging/tests/test_dicomconv.py", line 39, in test_validate_bids_fixture
    bids_ds = get_bids_dataset()
  File "/home/mih/hacking/datalad/datalad-neuroimaging/datalad_neuroimaging/tests/utils.py", line 94, in get_bids_dataset
    bids_ds.aggregate_metadata(recursive=False, incremental=True)
  File "/home/mih/env/datalad3-dev/lib/python3.6/site-packages/wrapt/wrappers.py", line 562, in __call__
    args, kwargs)
  File "/home/mih/hacking/datalad/git/datalad/distribution/dataset.py", line 439, in apply_func
    return f(**kwargs)
  File "/home/mih/env/datalad3-dev/lib/python3.6/site-packages/wrapt/wrappers.py", line 523, in __call__
    args, kwargs)
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 470, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/mih/env/datalad3-dev/lib/python3.6/site-packages/wrapt/wrappers.py", line 523, in __call__
    args, kwargs)
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 458, in return_func
    results = list(results)
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 414, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 482, in _process_results
    for res in results:
  File "/home/mih/hacking/datalad/git/datalad/metadata/aggregate.py", line 721, in __call__
    to_save)
  File "/home/mih/hacking/datalad/git/datalad/metadata/aggregate.py", line 206, in _extract_metadata
    paths=relevant_paths)
  File "/home/mih/hacking/datalad/git/datalad/metadata/metadata.py", line 517, in _get_metadata
    for loc, meta in contentmeta_t or {}:
  File "/home/mih/hacking/datalad/datalad-neuroimaging/datalad_neuroimaging/extractors/bids.py", line 168, in _get_cnmeta
    include_entities=True).items()
  File "/home/mih/hacking/pybids/bids/grabbids/bids_layout.py", line 184, in get_metadata
    potentialJSONs = self._get_nearest_helper(path, '.json', **kwargs)
  File "/home/mih/hacking/pybids/bids/grabbids/bids_layout.py", line 152, in _get_nearest_helper
    "likely because it is not a valid BIDS file." % path
ValueError: File '/home/mih/hacking/datalad/datalad-neuroimaging/datalad_neuroimaging/tests/data/bids/CHANGES' does not have a valid type definition, most likely because it is not a valid BIDS file.

but strangely only within nose, but not when called like this:

python -c 'from datalad_neuroimaging.tests.utils import get_bids_dataset; get_bids_dataset()'

Update Appveyor config to use new codecov uploader

On June 9, codecov released a new version of their coverage uploader program to replace the old bash uploader, which will stop working on February 1 and is currently experiencing scheduled brownouts; see this blog post for more information. The Appveyor config must be updated to install the new codecov uploader; see the linked blog post for instructions.

cfg_bids -- template it out

I first wanted to check on resistance or desire for such a change.
I wondered to move "templating" of some files (e.g. dataset_description.json, CHANGES, README) from https://github.com/nipy/heudiconv/blob/master/heudiconv/bids.py#L33 into the cfg_bids procedure here.

We could also make use of git config user.name for the Author. The other fields could also be
queried from the dataset.config (e.g. bids-template.license, bids-template.acknowledgements), so account wide config settings could be setup to uniformly populate them for new datasets.

fresh pydicom started to cause troubles?

first mentioned in PRs of main datalad, now in daily extensions builds:

  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/imp.py", line 171, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/datalad_neuroimaging/extractors/tests/test_dicom.py", line 13, in <module>
    from datalad_neuroimaging.extractors.dicom import MetadataExtractor as DicomExtractor
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/datalad_neuroimaging/extractors/dicom.py", line 39, in <module>
    dcm.valuerep.PersonName3)
AttributeError: module 'pydicom.valuerep' has no attribute 'PersonName3'

e.g. on https://github.com/datalad/datalad-extensions/runs/722270660

Failed to load participants info due to: 'float' object has no attribute 'lower'

might be a non-issue - just want to record before I forget

[INFO   ] Aggregate metadata for dataset /mnt/btrfs/datasets-meta6-6/datalad/crawl/openfmri/ds000017
Metadata extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00/5.00 [01:29<00:00, 24.5s/ extractors][WARNING] Failed to load participants info due to: 'float' object has no attribute 'lower' [bids.py:yield_participant_info:202]. Skipping the rest of file    

whenever file seems to be kosher

$> cat /mnt/btrfs/datasets-meta6-6/datalad/crawl/openfmri/ds000017/participants.tsv
participant_id  sex     age
sub-1   n/a     n/a
sub-2   M       44
sub-3   n/a     n/a
sub-4   M       49
sub-5   F       49
sub-6   M       25
sub-7   n/a     n/a
sub-8   M       50

Pandas FutureWarning: frame.append deprecated, use pandas.concat instead

This message is from the test run with Pandas 1.4.0 on Python 3.9:
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

This has been deprecated in Pandas 1.4.0 less than a month ago, so there's no urgency.

FWIW, git grep shows 49 occurrences of "append" and it looks like few of them are dataframe operations.

cfg_* procedure(s) for preferable .gitattributes for various known dataset types

ATM we have cfg_bids which

  • sets up .gitattributes to have some files directly in git
  • sets up metadata extraction configuration

But besides BIDS I keep running into the need to establish .gitattributes for following types, where I think following, analogous to BIDS one, should be done

.feat and .gfeat FSL outputs

  • .gitattributes - may be use a cfg_text2git?

on a sample .gfeat directory of 9GB, with a regular cfg_text2git I ended up with 260KB .git/objects

that allowed to quickly install that dataset elsewhere, datalad get **/*.png

  • metadata
    • datalad: eventually might configure the extractor
    • git-annex: we might like to annotate with annex metadata file types may be so on shells without ** ppl could quickly get all needed supplementary data files to browse the results

fmriprep

  • .gitattributes

    I had

*.md annex.largefiles=nothing
*.html annex.largefiles=nothing
*.json annex.largefiles=nothing
CITATION.* annex.largefiles=(not(mimetype=text/*))

which resulted in 32MB .git/objects for ~500GB dataset (~250 subjects).

  • metadata
    • configure extractors (nifti1, bids, may be more when support FreeSurfer etc)
    • interesting use case since BIDS(-derivative) dataset is not at the top of this dataset which has two directories -- fmriprep and freesurfer, so bids extractor should be informed to look into fmrieprep/

HOWTO

Pretty much all those scenarios are very similar and just require only slightly different specification. I see two implementation possibilities

breed cfg_* scripts

  • extract common code from cfg_bids into some cfg_common.py helper
  • reuse from within individual cfg_bids, cfg_feat, cfg_fmriprep

create (optionally parametrized) cfg_neuroimaging_dataset

which would sense (or "force" via explicit parameter) the type of the dataset and act accordingly (if can figure out, crash if fails and no explicit parameter such as "bids") is specified

"Support" MultiValue type from pydicom

I was surprised to not find ImageType meta data field in aggregated metadata, and discovered that it is not provided by pydicom as list or tuple but as pydicom.multival.MultiValue which is used for a few fields in a sample DICOM I have tortured. IMHO should get the same treatment as list -- the distinctive feature for this class is assuring uniform data typing.

*(Pdb) pprint(dict([(f, (getattr(d, f), type(getattr(d, f)))) for f in d.dir() if not isinstance(getattr(d, f), (int, float, string_types, dcm.valuerep.DSfloat, dcm.valuerep.IS, dcm.valuerep.PersonName3, list, tuple))]))
{'ImageOrientationPatient': (['0', '1', '0', '0', '0', '-1'],
                             <class 'pydicom.multival.MultiValue'>),
 'ImagePositionPatient': (['-101.60000151396', '-140', '130'],
                          <class 'pydicom.multival.MultiValue'>),
 'ImageType': (['ORIGINAL', 'PRIMARY', 'M', 'ND', 'NORM'],
               <class 'pydicom.multival.MultiValue'>),
 'PixelSpacing': (['1.625', '1.625'], <class 'pydicom.multival.MultiValue'>)}

Prepare conda recipe/package

0.10.0 conda package is still strugling to get built properly for python2 (ok for python3s) due to lack of securestorage due to lack of old fashioned dbus python bindings.

crash while running on a datalad-fuse'd dataset

(git-annex)lena:/tmp/mnt/ds000001[master]git-annex
$> datalad meta-extract bids
[ERROR  ] TypeError('PosixPath' object is not subscriptable) (TypeError) 
(dev3) 1 35958 ->1.....................................:Tue 01 Feb 2022 09:32:42 AM EST:.
(git-annex)lena:/tmp/mnt/ds000001[master]git-annex
$> datalad --dbg meta-extract bids
Traceback (most recent call last):
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/bin/datalad", line 8, in <module>
    sys.exit(main())
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/datalad/cmdline/main.py", line 211, in main
    ret = cmdlineargs.func(cmdlineargs)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/datalad/interface/base.py", line 786, in call_from_parser
    ret = list(ret)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/datalad/interface/utils.py", line 396, in generator_func
    for r in _process_results(
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/datalad/interface/utils.py", line 579, in _process_results
    for res in results:
  File "/home/yoh/proj/datalad/datalad-metalad/datalad_metalad/extract.py", line 297, in __call__
    yield from do_dataset_extraction(extraction_parameters)
  File "/home/yoh/proj/datalad/datalad-metalad/datalad_metalad/extract.py", line 341, in do_dataset_extraction
    yield from legacy_extract_dataset(ep)
  File "/home/yoh/proj/datalad/datalad-metalad/datalad_metalad/extract.py", line 678, in legacy_extract_dataset
    dataset_result, _ = extractor.get_metadata(True, False)
  File "/home/yoh/proj/datalad/datalad-neuroimaging/datalad_neuroimaging/extractors/bids.py", line 72, in get_metadata
    bids = BIDSLayout(self.ds.path, derivatives=derivative_exist)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/layout.py", line 145, in __init__
    indexer(self)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/index.py", line 109, in __call__
    self._index_dir(self._layout._root, self._config)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/index.py", line 193, in _index_dir
    self._index_dir(d, list(config), default_action=default)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/index.py", line 193, in _index_dir
    self._index_dir(d, list(config), default_action=default)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/index.py", line 164, in _index_dir
    cfg = Config.load(config_file, session=self.session)
  File "/home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/models.py", line 156, in load
    result = session.query(Config).filter_by(name=config['name']).first()
TypeError: 'PosixPath' object is not subscriptable

> /home/yoh/proj/datalad/datalad-metalad/venvs/dev3/lib/python3.9/site-packages/bids/layout/models.py(156)load()
-> result = session.query(Config).filter_by(name=config['name']).first()
(Pdb) import bids
*(Pdb) print(bids.__version__)
0.14.0

BUG: metadata extraction for superdataset reports results for subdataset

The context

I'm running into a weird issue. I have a superdataset (https://github.com/jsheunis/datalad-catalog-demo-super) which has several subdatasets, including the one at data/ds001499 which is a BIDS dataset. I am running metadata extraction on the superdataset using multiple extractors.

The problem

when I run the bids_dataset extractor (from datalad-neuroimaging), meta-extract goes into the subdataset and extracts BIDS metadata, and then reports that for the superdataset.

Here you can see the call and the full debug output:

datalad -f json -l debug meta-extract -d ../datalad-catalog-demo-super bids_dataset        
Command output with level set to debug:
datalad -f json -l debug meta-extract -d . bids_dataset        
                                                                               
[DEBUG  ] Command line args 1st pass for DataLad 0.18.3. Parsed: Namespace(common_result_renderer='json') Unparsed: ['meta-extract', '-d', '../datalad-catalog-demo-super', 'bids_dataset']
[DEBUG  ] Processing entrypoints
[DEBUG  ] Loading entrypoint deprecated from datalad.extensions
[DEBUG  ] Loaded entrypoint deprecated from datalad.extensions
[DEBUG  ] Loading entrypoint metalad from datalad.extensions
[DEBUG  ] Loaded entrypoint metalad from datalad.extensions
[DEBUG  ] Loading entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Loaded entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Loading entrypoint catalog from datalad.extensions
[DEBUG  ] Loaded entrypoint catalog from datalad.extensions
[DEBUG  ] Loading entrypoint wackyextra from datalad.extensions
[DEBUG  ] Loaded entrypoint wackyextra from datalad.extensions
[DEBUG  ] Done processing entrypoints
[DEBUG  ] Building doc for <class 'datalad_metalad.extract.Extract'>
[DEBUG  ] Parsing known args among ['/Users/jsheunis/opt/miniconda3/envs/catalog-demo/bin/datalad', '-f', 'json', '-l', 'debug', 'meta-extract', '-d', '../datalad-catalog-demo-super', 'bids_dataset']
[DEBUG  ] Determined class of decorated function: <class 'datalad_metalad.extract.Extract'>
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/.datalad/config'] with status 0
[DEBUG  ] Resolved dataset to extract metadata: /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Using metadata extractor bids_dataset from distribution datalad-neuroimaging
[DEBUG  ] performing dataset-level metadata extraction (bids_dataset) for dataset at /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super
[DEBUG  ] Importing datalad.api to possibly discover possibly not yet bound method 'get'
[DEBUG  ] Building doc for <class 'datalad.core.local.create.Create'>
[DEBUG  ] Building doc for <class 'datalad.core.local.status.Status'>
[DEBUG  ] Building doc for <class 'datalad.core.local.save.Save'>
[DEBUG  ] Building doc for <class 'datalad.core.distributed.clone.Clone'>
[DEBUG  ] Building doc for <class 'datalad.local.subdatasets.Subdatasets'>
[DEBUG  ] Building doc for <class 'datalad.distribution.get.Get'>
[DEBUG  ] Building doc for <class 'datalad.core.local.diff.Diff'>
[DEBUG  ] Building doc for <class 'datalad.core.distributed.push.Push'>
[DEBUG  ] Building doc for <class 'datalad.distribution.install.Install'>
[DEBUG  ] Building doc for <class 'datalad.local.unlock.Unlock'>
[DEBUG  ] Building doc for <class 'datalad.core.local.run.Run'>
[DEBUG  ] Failed to import requests_ftp, thus no ftp support: ModuleNotFoundError(No module named 'requests_ftp')
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_github.CreateSiblingGithub'>
[DEBUG  ] Building doc for <class 'datalad.distribution.update.Update'>
[DEBUG  ] Building doc for <class 'datalad.distribution.siblings.Siblings'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gitlab.CreateSiblingGitlab'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gogs.CreateSiblingGogs'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gin.CreateSiblingGin'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gitea.CreateSiblingGitea'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_ria.CreateSiblingRia'>
[DEBUG  ] Building doc for <class 'datalad.distribution.create_sibling.CreateSibling'>
[DEBUG  ] Building doc for <class 'datalad.distributed.drop.Drop'>
[DEBUG  ] Building doc for <class 'datalad.local.remove.Remove'>
[DEBUG  ] Building doc for <class 'datalad.local.addurls.Addurls'>
[DEBUG  ] Building doc for <class 'datalad.local.copy_file.CopyFile'>
[DEBUG  ] Building doc for <class 'datalad.local.download_url.DownloadURL'>
[DEBUG  ] Building doc for <class 'datalad.local.foreach_dataset.ForEachDataset'>
[DEBUG  ] Building doc for <class 'datalad.local.rerun.Rerun'>
[DEBUG  ] Building doc for <class 'datalad.local.run_procedure.RunProcedure'>
[DEBUG  ] Building doc for <class 'datalad.local.configuration.Configuration'>
[DEBUG  ] Building doc for <class 'datalad.local.wtf.WTF'>
[DEBUG  ] Building doc for <class 'datalad.local.clean.Clean'>
[DEBUG  ] Building doc for <class 'datalad.local.add_archive_content.AddArchiveContent'>
[DEBUG  ] Building doc for <class 'datalad.local.add_readme.AddReadme'>
[DEBUG  ] Building doc for <class 'datalad.local.export_archive.ExportArchive'>
[DEBUG  ] Building doc for <class 'datalad.distributed.export_archive_ora.ExportArchiveORA'>
[DEBUG  ] Building doc for <class 'datalad.distributed.export_to_figshare.ExportToFigshare'>
[DEBUG  ] Building doc for <class 'datalad.local.no_annex.NoAnnex'>
[DEBUG  ] Building doc for <class 'datalad.local.check_dates.CheckDates'>
[DEBUG  ] Building doc for <class 'datalad.distribution.uninstall.Uninstall'>
[DEBUG  ] Building doc for <class 'datalad.distribution.create_test_dataset.CreateTestDataset'>
[DEBUG  ] Building doc for <class 'datalad.support.sshrun.SSHRun'>
[DEBUG  ] Building doc for <class 'datalad.interface.shell_completion.ShellCompletion'>
[DEBUG  ] Processing entrypoints
[DEBUG  ] Loading entrypoint deprecated from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_deprecated.ls.Ls'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.annotate_paths.AnnotatePaths'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.publish.Publish'>
[DEBUG  ] Building doc for <class 'datalad_metalad.dump.Dump'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.metadata.Metadata'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.search.Search'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.extract_metadata.ExtractMetadata'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.aggregate.AggregateMetaData'>
[DEBUG  ] Loaded entrypoint deprecated from datalad.extensions
[DEBUG  ] Loading entrypoint metalad from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_metalad.aggregate.Aggregate'>
[DEBUG  ] Building doc for <class 'datalad_metalad.add.Add'>
[DEBUG  ] Building doc for <class 'datalad_metalad.conduct.Conduct'>
[DEBUG  ] Building doc for <class 'datalad_metalad.filter.Filter'>
[DEBUG  ] Loaded entrypoint metalad from datalad.extensions
[DEBUG  ] Loading entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_neuroimaging.bids2scidata.BIDS2Scidata'>
[DEBUG  ] Loaded entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Loading entrypoint catalog from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_catalog.catalog.Catalog'>
[DEBUG  ] Loaded entrypoint catalog from datalad.extensions
[DEBUG  ] Loading entrypoint wackyextra from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_wackyextra.translate.Translate'>
[DEBUG  ] Loaded entrypoint wackyextra from datalad.extensions
[DEBUG  ] Done processing entrypoints
[DEBUG  ] Determined class of decorated function: <class 'datalad.distribution.get.Get'>
[DEBUG  ] Resolved dataset to get content of <<PosixPath('/Us++98 chars++on')>>: /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super
[DEBUG  ] Determined class of decorated function: <class 'datalad.local.subdatasets.Subdatasets'>
[DEBUG  ] Resolved dataset to report on subdataset(s): /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super
[DEBUG  ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'data/AIDAqc_test_data', 'data/ds001499', 'data/human-connectome-project-openaccess', 'data/machinelearning-books', 'data/studyforrest-data'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] with status 0
[DEBUG  ] Determine what files match the query to work with
[DEBUG  ] Run ['git', 'annex', 'version', '--raw'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG  ] Finished ['git', 'annex', 'version', '--raw'] with status 0
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] (protocol_class=AnnexJsonProtocol) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] with status 0
[DEBUG  ] No files found needing fetching.
[DEBUG  ] already present [get(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/dataset_description.json)]
action summary:
  get (notneeded: 2)
[DEBUG  ] Determined class of decorated function: <class 'datalad.distribution.get.Get'>
[DEBUG  ] Resolved dataset to get content of <<PosixPath('/Us++90 chars++sv')>>: /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super
[DEBUG  ] Determined class of decorated function: <class 'datalad.local.subdatasets.Subdatasets'>
[DEBUG  ] Resolved dataset to report on subdataset(s): /Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super
[DEBUG  ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'data/AIDAqc_test_data', 'data/ds001499', 'data/human-connectome-project-openaccess', 'data/machinelearning-books', 'data/studyforrest-data'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] with status 0
[DEBUG  ] Determine what files match the query to work with
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'participants.tsv'] (protocol_class=AnnexJsonProtocol) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'participants.tsv'] with status 0
[DEBUG  ] No files found needing fetching.
[DEBUG  ] already present [get(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/participants.tsv)]
action summary:
  get (notneeded: 2)
bids_dataset metadata extraction:   0%|
/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/bids/layout/validation.py:131: UserWarning: Derivative indexing was requested, but no valid derivative datasets were found in the specified locations ([PosixPath('/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/derivatives')]). Note that all BIDS-Derivatives datasets must meet all the requirements for BIDS-Raw datasets (a common problem is to fail to include a 'dataset_description.json' file in derivatives datasets).
Example contents of 'dataset_description.json':
{"Name": "Example dataset", "BIDSVersion": "1.0.2", "GeneratedBy": [{"Name": "Example pipeline"}]}
  warnings.warn("Derivative indexing was requested, but no valid "
{"action": "meta_extract", "metadata_record": {"agent_email": "[email protected]", "agent_name": "Stephan Heunis", "dataset_id": "ff750e89-09bf-48cc-b21c-fe94f071da00", "dataset_version": "75ce5bfa9380ff05a2046473cfa292f98f754596", "extracted_metadata": {"@context": {"@id": "https://doi.org/10.5281/zenodo.4710751", "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard v1.6.0", "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"}, "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.", "Authors": ["Nadine Chang", "John A. Pyles", "Austin Marcus", "Abhinav Gupta", "Michael J. Tarr", "Elissa M. Aminoff"], "BIDSVersion": "1.0.2", "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1", "Funding": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.", "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281", "License": "CC0", "Name": "BOLD5000", "ReferencesAndLinks": ["https://bold5000.org"], "description": null, "entities": {"acquisition": ["spinecho", "spinechopf68", "AP", "PA"], "datatype": ["fmap", "func", "anat", "dwi"], "direction": ["AP", "PA"], "extension": [".json", ".tsv", ".nii.gz", ".tsv.gz", ".bval", ".bvec"], "fmap": ["epi"], "recording": ["cardiac", "respiratory", "trigger"], "run": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "session": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16"], "subject": ["CSI1", "CSI2", "CSI3", "CSI4"], "suffix": ["description", "participants", "epi", "bold", "events", "physio", "T2w", "T1w", "dwi", "sessions"], "task": ["5000scenes", "localizer"]}, "variables": {"dataset": ["subject", "age", "handedness", "sex", "suffix"], "subject": ["subject", "session", "At this point in the day, you have eaten...", "Date", "Did you work out today?", "Do you drink alcoholic beverages?", "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?", "Do you smoke?", "Duration (in seconds)", "End Date", "Have you taken ibuprofen today (e.g. Advil, Motrin)?", "How long ago was your last meal?", "How many hours of sleep did you get last night?", "If so, what was the activity, and how long ago?", "If so, when was the last time you had a caffeinated beverage?", "If so, when was the last time you had an alcoholic beverage?", "If so, when was the last time you smoked?", "If so, when was the last time you took it?", "In the scanner today I was mentally: - Click to write Choice 1", "In the scanner today I was physically: - Click to write Choice 1", "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?", "Is this...", "Is this....1", "Is this....2", "Is this....3", "Is this....4", "Is this....5", "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):", "Progress", "Recorded Date", "Response ID", "Session", "Start Date", "Subject ID", "Time", "suffix"]}}, "extraction_parameter": {}, "extraction_time": 1680166013.927886, "extractor_name": "bids_dataset", "extractor_version": "0.0.1", "type": "dataset"}, "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super", "status": "ok", "type": "dataset"}

More info:

The superdataset ID and VERSION are shown in the json output:

"dataset_id": "ff750e89-09bf-48cc-b21c-fe94f071da00", "dataset_version": "75ce5bfa9380ff05a2046473cfa292f98f754596",

The datalad ID of the subdataset:

[submodule "data/ds001499"]
	path = data/ds001499
	url = https://github.com/OpenNeuroDatasets/ds001499.git
	datalad-id = 3e874376-b053-11e8-b9ac-0242ac130026

Relevant comments

Comment 1

This same problem occurs when I run meta-conduct on the superdatset with traverser.traverse_sub_datasets=True (I actually came across the issue the first time when using meta-conduct):

#!/bin/zsh
 
EXTRACTOR=$1
DATASET_PATH=`pwd`
PIPELINE_PATH="$DATASET_PATH/code/extract_single_pipeline.json"
touch "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.jsonl"
touch "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.err"
datalad -f json meta-conduct "$PIPELINE_PATH" \
    traverser.top_level_dir=$DATASET_PATH \
    traverser.item_type=dataset \
    traverser.traverse_sub_datasets=True \
    extractor1.extractor_type=dataset \
    extractor1.extractor_name=$EXTRACTOR \
    > "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.jsonl" \
    2> "$DATASET_PATH/outputs/dataset_metadata_$EXTRACTOR.err"

Note in the output that there are two extraction results containing BIDS metadata, one for the superdataset and one for the subdataset. Note also that these objects differ in their content, specifically that the superdataset object has field description equal to null, and the subdataset object has field description equal to a json string:

{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "message": "concurrent.futures.process._RemoteTraceback: \n\"\"\"\nTraceback (most recent call last):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker\n    r = call_item.fn(*call_item.args, **call_item.kwargs)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/base.py\", line 25, in execute\n    return context, self.process(pipeline_data)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/pipeline/processor/extract.py\", line 116, in process\n    for extract_result in meta_extract(**kwargs):\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 773, in eval_func\n    return return_func(*args, **kwargs)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 763, in return_func\n    results = list(results)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/base.py\", line 873, in _execute_command_\n    for r in _process_results(\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad/interface/utils.py\", line 319, in _process_results\n    for res in results:\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 321, in __call__\n    yield from do_extraction(ep=extraction_arguments)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 414, in do_extraction\n    yield from perform_metadata_extraction(ep, extractor)\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/extract.py\", line 435, in perform_metadata_extraction\n    res = extractor.get_required_content()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 150, in get_required_content\n    bids_dir = _find_bids_root(self.dataset.path)\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/site-packages/datalad_neuroimaging/extractors/bids_dataset.py\", line 308, in _find_bids_root\n    raise FileNotFoundError(msg)\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n\"\"\"\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/jsheunis/Documents/psyinf/dl-meta/datalad_metalad/conduct.py\", line 378, in process_parallel\n    source_index, pipeline_data = future.result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 451, in result\n    return self.__get_result()\n  File \"/Users/jsheunis/opt/miniconda3/envs/catalog-demo/lib/python3.10/concurrent/futures/_base.py\", line 403, in __get_result\n    raise self._exception\nFileNotFoundError: The file 'participants.tsv' should be part of the BIDS dataset in order for the 'bids_dataset' extractor to function correctly\n", "status": "error"}
{"action": "meta_conduct", "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499", "pipeline_data": {"result": {"dataset-traversal-record": [{"state": "SUCCESS"}], "metadata": [{"metadata_record": {"agent_email": "[email protected]", "agent_name": "Stephan Heunis", "dataset_id": "3e874376-b053-11e8-b9ac-0242ac130026", "dataset_version": "5be66b27ab5e033e9163caa94cec882bd4cee1d0", "extracted_metadata": {"@context": {"@id": "https://doi.org/10.5281/zenodo.4710751", "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard v1.6.0", "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"}, "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.", "Authors": ["Nadine Chang", "John A. Pyles", "Austin Marcus", "Abhinav Gupta", "Michael J. Tarr", "Elissa M. Aminoff"], "BIDSVersion": "1.0.2", "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1", "Funding": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.", "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281", "License": "CC0", "Name": "BOLD5000", "ReferencesAndLinks": ["https://bold5000.org"], "description": [{"extension": "", "text": "BOLD5000: Brains, Objects, Landscapes Dataset\n\nFor details please refer to BOLD5000.org and our paper on arXiv (http://arxiv.org/abs/1809.01281)\n\n*Participant Directories Content*\n1) Four participants: CSI1, CSI2, CSI3, & CSI4\n2) Functional task data acquisition sessions: sessions datalad/datalad-metalad#1-15\nEach functional session includes:\n-3 sets of fieldmaps (EPI opposite phase encoding; spin-echo opposite phase encoding pairs with partial & non-partial Fourier)\n-9 or 10 functional scans of slow event-related 5000 scene data (5000scenes)\n-1 or 0 functional localizer scans used to define scene selective regions (localizer)\n-each event.json file lists each stimulus, the onset time, and the participant\u2019s response (participants performed a simple valence task) \n3) Anatomical data acquisition session: datalad/datalad-metalad#16\nAnatomical Data: T1 weighted MPRAGE scan, a T2 weighted SPACE, diffusion spectrum imaging   \n\nNotes:\n-All MRI and fMRI data provided is with Siemens pre-scan normalization filter.  \n-CSI4 only participated in 10 MRI sessions: 1-9 were functional acquisition sessions, and 10 was the anatomical data acquisition session.\n\n*Derivatives Directory Content*\n1) fMRIprep: \n-Preprocessed data for all functional data of CSI1 through CSI4 (listed in folders for each participant: derivatives/fmriprep/sub-CSIX). Data was preprocessed both in T1w image space and on surface space. Functional data was motion corrected, susceptibility distortion corrected, and aligned to the anatomical data using bbregister. Please refer to the paper for the details on preprocessing.\n-Reports resulting from fMRI prep, which include the success of anatomical alignment and distortion correction, among other measures of preprocessing success are all listed in the sub-CSIX.html files.  \n2) Freesurfer: Freesurfer reconstructions as a result of fMRIprep preprocessing stream. \n3) MRIQC: Image quality metrics (IQMs) of the dataset using MRIQC. \n-CSIX-func.csv files are text files with a list of all IQMs for each session, for each run.\n-CSIX-anat.csv files are text files with a list of all IQMs for the scans acquired in the anatomical session (e.g., MPRAGE). \n-CSIX_IQM.xls an excel workbook, each sheet of workbook lists the IQMs for a single run. This is the same data as CSIX-func.csv, except formatted differently. \n-sub-CSIX/derivatives: contain .json with the MRIQC/IQM results for each run. \n-sub-CSIX/reports: contains .html file with MRIQC/IQM results for each run along with mean signal and standard deviation maps. \n4)spm: A directory that contains the masks used to define each region of interest (ROI) in each participant. There were 10 ROIs: early visual (EarlyVis), lateral occipital cortex (LOC), occipital place area (OPA), parahippocampal place area (PPA), retrosplenial complex (RSC) for the left hemisphere (LH) and right hemisphere (RH)."}], "entities": {"acquisition": ["spinecho", "spinechopf68", "AP", "PA"], "datatype": ["fmap", "func", "anat", "dwi"], "direction": ["AP", "PA"], "extension": [".json", ".tsv", ".nii.gz", ".tsv.gz", ".bval", ".bvec"], "fmap": ["epi"], "recording": ["cardiac", "respiratory", "trigger"], "run": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "session": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16"], "subject": ["CSI1", "CSI2", "CSI3", "CSI4"], "suffix": ["description", "participants", "epi", "bold", "events", "physio", "T2w", "T1w", "dwi", "sessions"], "task": ["5000scenes", "localizer"]}, "variables": {"dataset": ["subject", "age", "handedness", "sex", "suffix"], "subject": ["session", "subject", "At this point in the day, you have eaten...", "Date", "Did you work out today?", "Do you drink alcoholic beverages?", "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?", "Do you smoke?", "Duration (in seconds)", "End Date", "Have you taken ibuprofen today (e.g. Advil, Motrin)?", "How long ago was your last meal?", "How many hours of sleep did you get last night?", "If so, what was the activity, and how long ago?", "If so, when was the last time you had a caffeinated beverage?", "If so, when was the last time you had an alcoholic beverage?", "If so, when was the last time you smoked?", "If so, when was the last time you took it?", "In the scanner today I was mentally: - Click to write Choice 1", "In the scanner today I was physically: - Click to write Choice 1", "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?", "Is this...", "Is this....1", "Is this....2", "Is this....3", "Is this....4", "Is this....5", "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):", "Progress", "Recorded Date", "Response ID", "Session", "Start Date", "Subject ID", "Time", "suffix"]}}, "extraction_parameter": {}, "extraction_time": 1680124486.907684, "extractor_name": "bids_dataset", "extractor_version": "0.0.1", "type": "dataset"}, "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499", "state": "SUCCESS"}], "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499"}, "state": "CONTINUE"}, "status": "ok"}
action summary:
  get (notneeded: 2)
action summary:
  get (notneeded: 2)
{"action": "meta_conduct", "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super", "pipeline_data": {"result": {"dataset-traversal-record": [{"state": "SUCCESS"}], "metadata": [{"metadata_record": {"agent_email": "[email protected]", "agent_name": "Stephan Heunis", "dataset_id": "ff750e89-09bf-48cc-b21c-fe94f071da00", "dataset_version": "75ce5bfa9380ff05a2046473cfa292f98f754596", "extracted_metadata": {"@context": {"@id": "https://doi.org/10.5281/zenodo.4710751", "description": "ad-hoc vocabulary for the Brain Imaging Data Structure (BIDS) standard v1.6.0", "type": "http://purl.org/dc/dcam/VocabularyEncodingScheme"}, "Acknowledgements": "We thank Scott Kurdilla for his patience as our MRI technologist throughout all data collection. We would also like to thank Austin Marcus for his assistance in various stages of this project, Jayanth Koushik for his assistance in AlexNet feature extractions, and Ana Van Gulick for her assistance with public data distribution and open science issues. Finally, we thank our participants for their participation and patience, without them this dataset would not have been possible.", "Authors": ["Nadine Chang", "John A. Pyles", "Austin Marcus", "Abhinav Gupta", "Michael J. Tarr", "Elissa M. Aminoff"], "BIDSVersion": "1.0.2", "DatasetDOI": "10.18112/openneuro.ds001499.v1.3.1", "Funding": "This dataset was collected with the support of NSF Award BCS-1439237 to Elissa M. Aminoff and Michael J. Tarr, ONR MURI N000141612007 and Sloan, Okawa Fellowship to Abhinav Gupta, and NSF Award BSC-1640681 to Michael Tarr.", "HowToAcknowledge": "Please cite our paper available on arXiv: http://arxiv.org/abs/1809.01281", "License": "CC0", "Name": "BOLD5000", "ReferencesAndLinks": ["https://bold5000.org"], "description": null, "entities": {"acquisition": ["spinecho", "spinechopf68", "AP", "PA"], "datatype": ["fmap", "func", "anat", "dwi"], "direction": ["AP", "PA"], "extension": [".json", ".tsv", ".nii.gz", ".tsv.gz", ".bval", ".bvec"], "fmap": ["epi"], "recording": ["cardiac", "respiratory", "trigger"], "run": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "session": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16"], "subject": ["CSI1", "CSI2", "CSI3", "CSI4"], "suffix": ["description", "participants", "epi", "bold", "events", "physio", "T2w", "T1w", "dwi", "sessions"], "task": ["5000scenes", "localizer"]}, "variables": {"dataset": ["subject", "age", "handedness", "sex", "suffix"], "subject": ["subject", "session", "At this point in the day, you have eaten...", "Date", "Did you work out today?", "Do you drink alcoholic beverages?", "Do you drink caffeinated beverages (i.e. coffee, tea, coke, etc.)?", "Do you smoke?", "Duration (in seconds)", "End Date", "Have you taken ibuprofen today (e.g. Advil, Motrin)?", "How long ago was your last meal?", "How many hours of sleep did you get last night?", "If so, what was the activity, and how long ago?", "If so, when was the last time you had a caffeinated beverage?", "If so, when was the last time you had an alcoholic beverage?", "If so, when was the last time you smoked?", "If so, when was the last time you took it?", "In the scanner today I was mentally: - Click to write Choice 1", "In the scanner today I was physically: - Click to write Choice 1", "Is there anything you think we should know about your experience in the MRI (e.g. were you tired, confused about task instructions, etc)?", "Is this...", "Is this....1", "Is this....2", "Is this....3", "Is this....4", "Is this....5", "Please comment on which of these things, if any, are particularly different today and/or if you think they have affected your performance (for better or for worse):", "Progress", "Recorded Date", "Response ID", "Session", "Start Date", "Subject ID", "Time", "suffix"]}}, "extraction_parameter": {}, "extraction_time": 1680124488.3530781, "extractor_name": "bids_dataset", "extractor_version": "0.0.1", "type": "dataset"}, "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super", "state": "SUCCESS"}], "path": "/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super"}, "state": "CONTINUE"}, "status": "ok"}

Comment 2

The problem seems to only occur for bids_dataset and not other extractors. I created an analogous test with metalad_studyminimeta (i.e. superdataset with no metadata, subdataset with a .studyminimeta.yaml file). And this only reported that extraction was not possible since there is no required metadata file:

datalad -f json -l debug meta-extract -d minimeta_test_super metalad_studyminimeta
Meta-extract output:
[DEBUG  ] Command line args 1st pass for DataLad 0.18.3. Parsed: Namespace(common_result_renderer='json') Unparsed: ['meta-extract', '-d', '.', 'metalad_studyminimeta']
[DEBUG  ] Processing entrypoints
[DEBUG  ] Loading entrypoint deprecated from datalad.extensions
[DEBUG  ] Loaded entrypoint deprecated from datalad.extensions
[DEBUG  ] Loading entrypoint metalad from datalad.extensions
[DEBUG  ] Loaded entrypoint metalad from datalad.extensions
[DEBUG  ] Loading entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Loaded entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Loading entrypoint catalog from datalad.extensions
[DEBUG  ] Loaded entrypoint catalog from datalad.extensions
[DEBUG  ] Loading entrypoint wackyextra from datalad.extensions
[DEBUG  ] Loaded entrypoint wackyextra from datalad.extensions
[DEBUG  ] Done processing entrypoints
[DEBUG  ] Building doc for <class 'datalad_metalad.extract.Extract'>
[DEBUG  ] Parsing known args among ['/Users/jsheunis/opt/miniconda3/envs/catalog-demo/bin/datalad', '-f', 'json', '-l', 'debug', 'meta-extract', '-d', '.', 'metalad_studyminimeta']
[DEBUG  ] Determined class of decorated function: <class 'datalad_metalad.extract.Extract'>
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.datalad/config'] with status 0
[DEBUG  ] Resolved dataset to extract metadata: /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'rev-parse', '--quiet', '--verify', 'HEAD^{commit}'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test)
[DEBUG  ] Using metadata extractor metalad_studyminimeta from distribution datalad-metalad
[DEBUG  ] performing legacy dataset-level metadata extraction (metalad_studyminimeta) for dataset at /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test
[DEBUG  ] Importing datalad.api to possibly discover possibly not yet bound method 'subdatasets'
[DEBUG  ] Building doc for <class 'datalad.core.local.create.Create'>
[DEBUG  ] Building doc for <class 'datalad.core.local.status.Status'>
[DEBUG  ] Building doc for <class 'datalad.core.local.save.Save'>
[DEBUG  ] Building doc for <class 'datalad.core.distributed.clone.Clone'>
[DEBUG  ] Building doc for <class 'datalad.local.subdatasets.Subdatasets'>
[DEBUG  ] Building doc for <class 'datalad.distribution.get.Get'>
[DEBUG  ] Building doc for <class 'datalad.core.local.diff.Diff'>
[DEBUG  ] Building doc for <class 'datalad.core.distributed.push.Push'>
[DEBUG  ] Building doc for <class 'datalad.distribution.install.Install'>
[DEBUG  ] Building doc for <class 'datalad.local.unlock.Unlock'>
[DEBUG  ] Building doc for <class 'datalad.core.local.run.Run'>
[DEBUG  ] Failed to import requests_ftp, thus no ftp support: ModuleNotFoundError(No module named 'requests_ftp')
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_github.CreateSiblingGithub'>
[DEBUG  ] Building doc for <class 'datalad.distribution.update.Update'>
[DEBUG  ] Building doc for <class 'datalad.distribution.siblings.Siblings'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gitlab.CreateSiblingGitlab'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gogs.CreateSiblingGogs'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gin.CreateSiblingGin'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_gitea.CreateSiblingGitea'>
[DEBUG  ] Building doc for <class 'datalad.distributed.create_sibling_ria.CreateSiblingRia'>
[DEBUG  ] Building doc for <class 'datalad.distribution.create_sibling.CreateSibling'>
[DEBUG  ] Building doc for <class 'datalad.distributed.drop.Drop'>
[DEBUG  ] Building doc for <class 'datalad.local.remove.Remove'>
[DEBUG  ] Building doc for <class 'datalad.local.addurls.Addurls'>
[DEBUG  ] Building doc for <class 'datalad.local.copy_file.CopyFile'>
[DEBUG  ] Building doc for <class 'datalad.local.download_url.DownloadURL'>
[DEBUG  ] Building doc for <class 'datalad.local.foreach_dataset.ForEachDataset'>
[DEBUG  ] Building doc for <class 'datalad.local.rerun.Rerun'>
[DEBUG  ] Building doc for <class 'datalad.local.run_procedure.RunProcedure'>
[DEBUG  ] Building doc for <class 'datalad.local.configuration.Configuration'>
[DEBUG  ] Building doc for <class 'datalad.local.wtf.WTF'>
[DEBUG  ] Building doc for <class 'datalad.local.clean.Clean'>
[DEBUG  ] Building doc for <class 'datalad.local.add_archive_content.AddArchiveContent'>
[DEBUG  ] Building doc for <class 'datalad.local.add_readme.AddReadme'>
[DEBUG  ] Building doc for <class 'datalad.local.export_archive.ExportArchive'>
[DEBUG  ] Building doc for <class 'datalad.distributed.export_archive_ora.ExportArchiveORA'>
[DEBUG  ] Building doc for <class 'datalad.distributed.export_to_figshare.ExportToFigshare'>
[DEBUG  ] Building doc for <class 'datalad.local.no_annex.NoAnnex'>
[DEBUG  ] Building doc for <class 'datalad.local.check_dates.CheckDates'>
[DEBUG  ] Building doc for <class 'datalad.distribution.uninstall.Uninstall'>
[DEBUG  ] Building doc for <class 'datalad.distribution.create_test_dataset.CreateTestDataset'>
[DEBUG  ] Building doc for <class 'datalad.support.sshrun.SSHRun'>
[DEBUG  ] Building doc for <class 'datalad.interface.shell_completion.ShellCompletion'>
[DEBUG  ] Processing entrypoints
[DEBUG  ] Loading entrypoint deprecated from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_deprecated.ls.Ls'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.annotate_paths.AnnotatePaths'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.publish.Publish'>
[DEBUG  ] Building doc for <class 'datalad_metalad.dump.Dump'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.metadata.Metadata'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.search.Search'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.extract_metadata.ExtractMetadata'>
[DEBUG  ] Building doc for <class 'datalad_deprecated.metadata.aggregate.AggregateMetaData'>
[DEBUG  ] Loaded entrypoint deprecated from datalad.extensions
[DEBUG  ] Loading entrypoint metalad from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_metalad.aggregate.Aggregate'>
[DEBUG  ] Building doc for <class 'datalad_metalad.add.Add'>
[DEBUG  ] Building doc for <class 'datalad_metalad.conduct.Conduct'>
[DEBUG  ] Building doc for <class 'datalad_metalad.filter.Filter'>
[DEBUG  ] Loaded entrypoint metalad from datalad.extensions
[DEBUG  ] Loading entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_neuroimaging.bids2scidata.BIDS2Scidata'>
[DEBUG  ] Loaded entrypoint neuroimaging from datalad.extensions
[DEBUG  ] Loading entrypoint catalog from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_catalog.catalog.Catalog'>
[DEBUG  ] Loaded entrypoint catalog from datalad.extensions
[DEBUG  ] Loading entrypoint wackyextra from datalad.extensions
[DEBUG  ] Building doc for <class 'datalad_wackyextra.translate.Translate'>
[DEBUG  ] Loaded entrypoint wackyextra from datalad.extensions
[DEBUG  ] Done processing entrypoints
[DEBUG  ] Determined class of decorated function: <class 'datalad.local.subdatasets.Subdatasets'>
[DEBUG  ] Resolved dataset to report on subdataset(s): /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test
[DEBUG  ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'super-duper-octo-engine'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test)
{"action": "meta_extract", "message": "file /Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.studyminimeta.yaml could not be opened", "path": "/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test", "status": "error", "type": "dataset"}
[DEBUG  ] could not perform all requested actions: IncompleteResultsError(Command did not complete successfully. 1 failed:
[{'action': 'meta_extract',
  'message': 'file '
             '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test/.studyminimeta.yaml '
             'could not be opened',
  'path': '/Users/jsheunis/Documents/psyinf/Data/duplicate-extraction-test',
  'status': 'error',
  'type': 'dataset'}])
Studyminimeta metadata extraction:   0%|

Comment 3

The above comment suggests the problem lies in the extractor code. But something that confuses me from the initial meta-extract debug logs is when the process dives into the subdatasets:

[DEBUG  ] Query subdatasets of Dataset(/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'config', '-z', '-l', '--file', '.gitmodules'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'ls-files', '--stage', '-z', '--', 'data/AIDAqc_test_data', 'data/ds001499', 'data/human-connectome-project-openaccess', 'data/machinelearning-books', 'data/studyforrest-data'] (protocol_class=GeneratorStdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super)
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG  ] Run ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] (protocol_class=StdOutErrCapture) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', 'config', '-z', '-l', '--show-origin', '--file', '/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499/.datalad/config'] with status 0
[DEBUG  ] Determine what files match the query to work with
[DEBUG  ] Run ['git', 'annex', 'version', '--raw'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG  ] Finished ['git', 'annex', 'version', '--raw'] with status 0
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] (protocol_class=AnnexJsonProtocol) (cwd=/Users/jsheunis/Documents/psyinf/Data/datalad-catalog-demo-super/data/ds001499)
[DEBUG  ] Finished ['git', '-c', 'diff.ignoreSubmodules=none', '-c', 'annex.merge-annex-branches=false', 'annex', 'find', '--not', '--in', '.', '--json', '--json-error-messages', '-c', 'annex.dotfiles=true', '--', 'dataset_description.json'] with status 0

I'm not sure why/how this happens.

bids extractor: succeptable to symlinks in the path

What is the problem?

it seems that somewhere paths resolution/matching is happening and symlinks in the path gets resolved but not "consistently", which results in tests failing for me -- don't those test run also on travis in a similar setup???

$> TMPDIR=/tmp python -m nose -s -v --pdb datalad/metadata/extractors/tests/test_bids.py 
datalad.metadata.extractors.tests.test_bids.test_get_metadata ... ok
datalad.metadata.extractors.tests.test_bids.test_get_metadata_with_README ... ok
datalad.metadata.extractors.tests.test_bids.test_get_metadata_with_description_and_README ... ok
Versions: appdirs=1.4.3 boto=2.44.0 cmd:annex=6.20180206+gitg638032f3a-1~ndall+1 cmd:git=2.11.0 cmd:system-git=2.15.1 cmd:system-ssh=7.6p1 git=2.1.8 gitdb=2.0.2 humanize=0.5.1 iso8601=0.1.11 msgpack=0.4.8 requests=2.18.4 scrapy=1.4.0 six=1.11.0 tqdm=4.19.5 wrapt=1.10.11

----------------------------------------------------------------------
Ran 3 tests in 3.322s

OK

passes, but with symlinked /tmp/ -- fail!:

$> TMPDIR=$HOME/.tmp python -m nose -s -v --pdb datalad/metadata/extractors/tests/test_bids.py 
datalad.metadata.extractors.tests.test_bids.test_get_metadata ... > /home/yoh/proj/bids/pybids/bids/grabbids/bids_layout.py(64)_get_nearest_helper()
-> if 'type' not in self.files[path].entities:
(Pdb) p self.files
{}

pybids API is changing

With bids 0.6.5+89.gb25be29 expect

    from bids.grabbids import BIDSLayout
ImportError: No module named grabbids

due to 0.7 changes bids-standard/pybids#247:

The module names are simplified: grabbids is now gone (in favor of layout) and bidslayout.py and bidsvalidator.py are just layout.py and validation.py. A deprecation warning is no longer issued (we had said this change would happen in 0.8, but I think it's better to introduce it early and have people fix all of the breaking changes at once).

DICOM import failed with datalad hirni-import-dcm

Hi guys,
We have successfully imported most of our acquisitions with datalad hirni-import-dcm. However, two acquisitions resulted in errors. The datalad hirni commands do not differ and with the DICOM acquisitions everything seems to be OK.

datalad hirni-import-dcm --anon-subject 001 /path/to/file.tar ab01
[Hirni ADDONS] Running following command:
[Hirni ADDONS] datalad hirni-import-dcm --anon-subject 001 /path/to/file.tar ab01
[INFO ] Creating a new annex repo at /sourcedata/ab01/dicoms
[INFO ] Adding content of the archive ab01_dicomsorted_nr111.tar into annex <AnnexRepo path=/sourcedata/ab01/dicoms (<class 'datalad.support.annexrepo.A$
[INFO ] Finished adding ab01_dicomsorted_nr111.tar: Files processed: 3200, +annex: 3200
Metadata aggregation: 0%| | 0.00/1.00 [00:00<?, ? datasets/s
[ERROR ] Failed to get metadata (dicom): Unknown Value Representation '0x00 0x00' in tag (0000, 0000) [dataelem.py:DataElement_from_raw:759] [meta_extract(/sourcedata/ab01/dicoms)] ########################################################8| 3.20k/3.20k [03:28<00:00, 25.1 Files/s]

datalad hirni-import-dcm --anon-subject 002 /path/to/file.tar yx99
[Hirni ADDONS] Running following command:
[Hirni ADDONS] datalad hirni-import-dcm --anon-subject 002 /path/to/file.tar yx99
[INFO ] Creating a new annex repo at /sourcedata/yx99/dicoms
[INFO ] Adding content of the archive yx99_dicomsorted_nr222.tar into annex <AnnexRepo path=/sourcedata/yx99/dicoms (<class 'datalad.support.annexrepo.A$
[INFO ] Finished adding yx99_dicomsorted_nr222.tar: Files processed: 3331, +annex: 3331
here
[ERROR ] 'id' [spec_helpers.py:get_specval:33] (KeyError)

The DICOMS are imported - with all subfolders of the different measurements. But in both cases the studyspec.json files are not created.

I would be grateful for any help.

Best,
Linda

Docs failing "no module named `numpy`"

See https://readthedocs.org/projects/datalad-neuroimaging/builds/

Excerpt:

Installed /home/docs/checkouts/readthedocs.org/user_builds/datalad-neuroimaging/envs/latest/lib/python3.7/site-packages/datalad_neuroimaging-0.3.1-py3.7.egg
Processing dependencies for datalad-neuroimaging==0.3.1
Searching for pandas
Reading https://pypi.org/simple/pandas/
Downloading https://files.pythonhosted.org/packages/4d/aa/e7078569d20f45e8cf6512a24bf2945698f13a7975650773c01366ea96dc/pandas-1.4.0.tar.gz#sha256=cdd76254c7f0a1583bd4e4781fb450d0ebf392e10d3f12e92c95575942e37df5
Best match: pandas 1.4.0
Processing pandas-1.4.0.tar.gz
Writing /tmp/easy_install-s5qiofc6/pandas-1.4.0/setup.cfg
Running pandas-1.4.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-s5qiofc6/pandas-1.4.0/egg-dist-tmp-mwnrd8fq
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/datalad-neuroimaging/envs/latest/lib/python3.7/site-packages/setuptools/sandbox.py", line 156, in save_modules
    yield saved
  File "/home/docs/checkouts/readthedocs.org/user_builds/datalad-neuroimaging/envs/latest/lib/python3.7/site-packages/setuptools/sandbox.py", line 198, in setup_context
    yield
  File "/home/docs/checkouts/readthedocs.org/user_builds/datalad-neuroimaging/envs/latest/lib/python3.7/site-packages/setuptools/sandbox.py", line 259, in run_setup
    _execfile(setup_script, ns)
  File "/home/docs/checkouts/readthedocs.org/user_builds/datalad-neuroimaging/envs/latest/lib/python3.7/site-packages/setuptools/sandbox.py", line 46, in _execfile
    exec(code, globals, locals)
  File "/tmp/easy_install-s5qiofc6/pandas-1.4.0/setup.py", line 18, in <module>
    """Find files under subdir having specified extensions
ModuleNotFoundError: No module named 'numpy'

Not sure why this is happening? Should numpy be an explicit dependency?

ds000174 participants is problematic for pybids

just want to file to check in detail later

$> DATALAD_EXC_STR_TBLIMIT=10 datalad --dbg aggregate-metadata --force-extraction --incremental $PWD
[INFO   ] Aggregate metadata for dataset /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000174
Metadata extraction:  60%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                                                   | 3.00/5.00 [00:01<00:01, 1.63 extractors/s]/usr/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Metadata extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:03<00:00, 1.43 extractors/s]/home/yoh/proj/datalad/datalad-neuroimaging/venvs/dev/local/lib/python2.7/site-packages/grabbit/core.py:448: UserWarning: Domain with name 'bids' already exists; returning existing Domain configuration.
  warnings.warn(msg)
[WARNING] Failed to load participants info due to: Can only use .str accessor with string values, which use np.object_ dtype in pandas [bids.py:_get_cnmeta:123,bids.py:yield_participant_info:197,bids_layout.py:get_collections:317,io.py:load_variables:76,io.py:_load_tsv_variables:364,generic.py:__getattr__:4372,accessor.py:__get__:133,strings.py:__init__:1895,strings.py:_validate:1917]. Skipping the rest of file
                                                                                                                                                                                                                                                                                                                              [INFO   ] Update aggregate metadata in dataset at: /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000174
aggregate_metadata(ok): /mnt/btrfs/datasets/datalad/crawl/openfmri/ds000174 (dataset)
[INFO   ] Attempting to save 6 files/datasets
action summary:
  aggregate_metadata (ok: 1)
  save (notneeded: 1)
DATALAD_EXC_STR_TBLIMIT=10 datalad --dbg aggregate-metadata --force-extractio  5.35s user 3.58s system 137% cpu 6.496 total

Upgrade `bids.py` metadata extractor

It would be useful if the bids extractor (and for some points, eventually all other extractors in this extension) could:

  • be compatible with the new generation of metadata handling, i.e.:
    • inherit metadata classes from datalad-metalad (not datalad.metadata.*)
    • make use of the distinction between dataset-level and file-level metadata extraction
  • extract everything there is to extract from a datalad dataset (i.e. not from annexed data) before datalad getting any file content that might be necessary for further extraction. (currently the extraction starts by getting all required file content...)
  • be compatible with, and use updated functionality of, the latest stable version of pybids

I've made a start at this. I'm working on this within the context of the catalog: likely many of our future users will be working with BIDS data and would want to extract BIDS metadata and have it rendered in the catalog. So I have an idea of the BIDS-related metadata that would be useful in the catalog, but I'm keen to get input from other @datalad/developers if there are features that you think will be useful to include.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.