dandi / dandi-cli Goto Github PK

View Code? Open in Web Editor NEW

19.0 9.0 24.0 4.28 MB

DANDI command line client to facilitate common operations

Home Page: https://dandi.readthedocs.io/

License: Apache License 2.0

Python 100.00%

dandi-archive

dandi-cli's People

Stargazers

Watchers

dandi-cli's Issues

download: we need to retry

$> dandi download https://gui.dandiarchive.org/#/file-browser/folder/5e9f9588b5c9745bad9f58fe
2020-04-22 13:52:43,982 [    INFO] Downloading folder with id 5e9f9588b5c9745bad9f58fe from https://girder.dandiarchive.org/
2020-04-22 13:52:44,496 [    INFO] Traversing remote folders (sub-mouse-AAYYT) recursively and downloading them locally
sub-mouse-AAYYT_ses-20180420-sample-4_slice-20180420-slice-4_cell-20180420-sample-4.nwb: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 9.50M/9.50M [00:03<00:00, 2.91MB/s]
sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2.nwb: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 9.34M/9.34M [00:03<00:00, 2.61MB/s]
Error: HTTP error 400: get https://girder.dandiarchive.org/api/v1/file/5e9f95e7b5c9745bad9f592e/download
Response text: {"message": "Unable to connect to S3 assetstore", "type": "validation"}
(dev3) 1 21415 ->1.....................................:Wed 22 Apr 2020 01:52:50 PM EDT:.
lena:/tmp
$> dandi download https://gui.dandiarchive.org/#/file-browser/folder/5e9f9588b5c9745bad9f58fe
2020-04-22 13:53:03,607 [    INFO] Downloading folder with id 5e9f9588b5c9745bad9f58fe from https://girder.dandiarchive.org/
2020-04-22 13:53:03,930 [    INFO] Traversing remote folders (sub-mouse-AAYYT) recursively and downloading them locally
2020-04-22 13:53:04,955 [    INFO] './sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2.nwb' - same time and size, skipping
2020-04-22 13:53:04,955 [    INFO] './sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-4_slice-20180420-slice-4_cell-20180420-sample-4.nwb' - same time and size, skipping
sub-mouse-AAYYT_ses-20180420-sample-3_slice-20180420-slice-3_cell-20180420-sample-3.nwb: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 9.31M/9.31M [00:02<00:00, 3.65MB/s]

OPT: speedup get_metadata

ATM ls (see #8) is taking awhile even though we are accessing only basic metadata fields. We should see how to speed things up:

$> dandi ls /home/yoh/proj/dandi/nwb-datasets/najafi-2018-nwb/data/FN_dataSharing/nwb/mouse1_fni16_15081{7,8}*nwb                                                                                                                  
PATH                                                    SIZE EXPERIMENTER SESSION_ID SESSION_START_TIME KEYWORDS
...ni16_150817_001_ch2-PnevPanResults-170808-190057.nwb 3... Farzaneh ... 1708081... 2015-08-16/20:0...         
...ni16_150818_001_ch2-PnevPanResults-170808-180842.nwb 2... Farzaneh ... 1708081... 2015-08-17/20:0...         
Summary:                                                5...                         2015-08-16/20:0...         
                                                                                     2015-08-17/20:0...         
dandi ls   18.04s user 0.56s system 79% cpu 23.369 total

so it took 23sec to just get those basic fields

validate: verify number_ statements in the dandiset.yaml

if those are still to be provided in the dandiset.yaml -- they better be validated.

should we list found (NWB:N?) specification versions?

reflecting upon NeurodataWithoutBorders/pynwb#1077 -- PyNWB fails to open some (NWB 1.0? or just not NWB:N?) NWB files. So I wondered if we should collect a list of specification versions (Groups) and list them with dandi ls before even trying to feed that file to PyNWB?

@bendichter what do you think?

prevent attempts of being installed or upgraded using python2

although pip install dandi correctly reports

ERROR: Package u'dandi' requires a different Python: 2.7.17 not in '>=3.5.1'

@bendichter ran into the situation where pip install --upgrade dandi failed with

bdichter@smaug:~$ pip install --upgrade dandi
Collecting dandi
  Using cached https://files.pythonhosted.org/packages/b9/4d/f2e8243beeb0427e713f4342017768f79fe6917e9cf2dd107ce10d23456c/dandi-0.2.0.tar.gz
  Installing build dependencies ... done
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-HoAKb4/dandi/setup.py", line 20, in <module>
        "version": versioneer.get_version(),
      File "versioneer.py", line 1498, in get_version
        return get_versions()["version"]
      File "versioneer.py", line 1452, in get_versions
        keywords = get_keywords_f(versionfile_abs)
      File "versioneer.py", line 985, in git_get_keywords
        except (FileNotFoundError, KeyError):
    NameError: global name 'FileNotFoundError' is not defined

the reason for which (pip of python2) was immediately obvious. I think we should just add a check to setup.py to kaboom with an informative message before versioneer is approached.

Make public uploads truly public

See dandi/dandiarchive-legacy#54
We seems to not provide explicit settings to make uploads public

dandi register/upload/update workflow

create a staged folder, set the identifier to the new dataset id, version set to "staging"
create a dandiset.yaml file within the staged folder
display the dataset id for the user to use with the CLI

Retrieve an authorization token from the UI
organize files into the dataset

Retrieve the dataset
this should update the dandiset.yaml file

upload dataset to dandi

update process questions:

what is the reference dataset? local, remote, collaborator's
across contributors in a team, how are operations (add/delete) synchronized?

treat archive as master for human entry metadata

dandi download currently does not edit keys in dandiset.yaml that are locally present. it should do so for any key that's not generated for the nwb section. i.e. treat the archive metadata for non-nwb keys as master.

CMD (idea): environment(s)?

To have ability from the client to interact with available environments.
We could simplify access to available through DANDI jupyter hub, etc; but also possibly to just start (list, stop) local instances of those.

upload+download: RF to reuse logic/options/UI

ATM implementations are separate and analysis for either to download/upload a file is happening right before actual download/upload. That forbids implementing proper "sync" functionality where analysis first to be done on either any file needs to be removed (or may be moved! ;-) ) on local/server end. So RF should be done to minimize difference between API and implementation of the two:

given the path specification both should first obtain list of local/remote "assets"
given the mode of operation in case of "existing" do the analysis and provide user with the summary on upcoming transfer
- possibly even present user with a list of files which would not be download/uploaded because they are already newer on the destination
pass into actual download/upload functions only the files which decided to be acted upon
if it was full dandiset (or a folder) to be downloaded/uploaded -- perform necessary deletions (locally or remotely) (might want to be a dedicated option, e.g. --sync)
exit with non-0 if any file which could have been download/uploaded was not (e.g. in case of --sync and --file-mode not being "overwrite" or "force", if some file was already newer and we didn't perform transfer)

For download would be needed:

to not use girder's downloadFile, which relies on a context manager to report progress, but which gets only filename as an ID.
Also it downloads into temporary directory which might be on a different partition. Should be near the target file. we could then detect/continue interrupted downloads etc

organize: add optional `--disappeared=error|remove` or `--mode=complete|incremental`

If organize is (re)run on a full collection of files, some of which were modified, then by "injecting" new files and leaving previously organized ones we might end up breeding some . With something like --disappeared we could detect which paths were no longer considered and offer to remove them.

Alternatively - we could make a default mode to be "complete" (default) which would imply removing files which were not "re-organized" (probably should ask first since could lead to the loss of data) and only by adding explicit --mode=incremental (complete - default) new file added in, while allowing all previous stick around.

dandi diff: new command to show differences between files/directories/dandisets

On .nwb files:

similar to nib-diff (nibabel's diff, basic implementation) would show differences in metadata fields. With a dedicated option, could also do data diffs summaries.
if no data diff is desired should be able to diff local and remote files -- we would just fetch metadata records from dandiarchive.

On directories:

would compare files listings and per file (as above)
also could diff locally or remotely
could (optionally) detect file renames based on object id we upload and/or later checksums (we do not upload those yet)

On dandisets

diff dandiset.yaml + diff directory with subdirectories (per above "on directories")

Relates to the usecase of #70 where it is desired to see what files were already dandi organized or not. So users then could

dandi diff --find-renames /path/myorigdata /path/organized

or even

dandi diff --find-renames /path/myorigdata http://dandiarchive.org/dandiset/000XXX/draft

upload gives relative/absolute path error

$ dandi upload -d 000001
2020-03-14 21:53:02,654 [    INFO] Found 136 files to consider
Error: Both paths must either be absolute or relative. Got '/net/vast-storage.ib.cluster/scratch/Mon/satra/000001/dandiset.yaml' and '000001'

organize: maintain mapping to original filenames and checksums

a request by labs via @bendichter to know which file was converted to the dandiset. perhaps organize can upload this with the file/item metadata and dandi cli can look this up from the server.

however in general perhaps we can add a simple provenance file:

<filename 1> wasDerivedFrom <old filename> .
<filename1> <sha512_or_some_such> <sha> .

this may be useful to create a manifest file later or for checking against amazon store on download.

at present organize doesn't change checksums. so that notion is simply a flag for now.

CMD: upload

Very basic functionality

subject_id confusion

There are a few issues around subject_id:

dandi validate does not enforce that subject_id is present, but dandi organize requires it.
When dandi organize runs on a file that is missing subject_id it complains about missing metadata, but does not specify what this missing metadata is.
Our user guide does not mention what to do in the case of dandi organize complaining, so a user will just carry on and will be confused when no data is uploaded.

This is all exacerbated by the fact that ipfx, the main conversion software for ABF and DAT icephys files out of the Allen Institute, doesn't currently even have a way to add a Subject object to the nwb file in the first place (so you can't specify subject_id). I have made a pull request to be able to add this information here.

upload: option to compress traffic

In the light of #21 it might be highly benefitial to compress files during upload

@mgrauer - does girder support receiving compressed payload?

@bendichter - do you have a quick way/code to assess if hdf5 file used compression, so we could include that into ls output and dynamically decide either to compress payload to girder?

wishlist: dandi wtf

similar to datalad wtf but with details pertinent to dandi. Here is datalad example

DataLad 0.12.2 WTF (configuration, datalad, dependencies, environment, extensions, git-annex, location, metadata_extractors, python, system)

WTF

configuration <SENSITIVE, report disabled by configuration>

datalad

full_version: 0.12.2
version: 0.12.2

dependencies

appdirs: 1.4.3
boto: 2.44.0
cmd:7z: 16.02
cmd:annex: 7.20190819+git2-g908476a9b-1~ndall+1
cmd:bundled-git: 2.20.1
cmd:git: 2.20.1
cmd:system-git: 2.24.0
cmd:system-ssh: 7.9p1
exifread: 2.1.2
git: 3.0.5
gitdb: 2.0.5
humanize: 0.5.1
iso8601: 0.1.11
keyring: 17.1.1
keyrings.alt: 3.1.1
msgpack: 0.5.6
mutagen: 1.40.0
requests: 2.21.0
wrapt: 1.10.11

environment

GIT_PAGER: less --no-init --quit-if-one-screen
GIT_PYTHON_GIT_EXECUTABLE: /usr/lib/git-annex.linux/git
LANG: en_US
LANGUAGE: en_US:en
LC_ADDRESS: en_US.UTF-8
LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8
LC_IDENTIFICATION: en_US.UTF-8
LC_MEASUREMENT: en_US.UTF-8
LC_MESSAGES: en_US.UTF-8
LC_MONETARY: en_US.UTF-8
LC_NAME: en_US.UTF-8
LC_NUMERIC: en_US.UTF-8
LC_PAPER: en_US.UTF-8
LC_TELEPHONE: en_US.UTF-8
LC_TIME: en_US.UTF-8
PATH: /home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/proj/dandi/dandi-cli/venvs/dev3/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/usr/local/sbin

extensions

container:
- description: Containerized environments
- entrypoints:
  - datalad_container.containers_add.ContainersAdd:
    - class: ContainersAdd
    - load_error: None
    - module: datalad_container.containers_add
    - names:
      - containers-add
      - containers_add
  - datalad_container.containers_list.ContainersList:
    - class: ContainersList
    - load_error: None
    - module: datalad_container.containers_list
    - names:
      - containers-list
      - containers_list
  - datalad_container.containers_remove.ContainersRemove:
    - class: ContainersRemove
    - load_error: None
    - module: datalad_container.containers_remove
    - names:
      - containers-remove
      - containers_remove
  - datalad_container.containers_run.ContainersRun:
    - class: ContainersRun
    - load_error: None
    - module: datalad_container.containers_run
    - names:
      - containers-run
      - containers_run
- load_error: None
- module: datalad_container
- version: 0.5.0

git-annex

build flags:
- Assistant
- Webapp
- Pairing
- S3
- WebDAV
- Inotify
- DBus
- DesktopNotify
- TorrentParser
- MagicMime
- Feeds
- Testsuite
dependency versions:
- aws-0.20
- bloomfilter-2.0.1.0
- cryptonite-0.25
- DAV-1.3.3
- feed-1.0.0.0
- ghc-8.4.4
- http-client-0.5.13.1
- persistent-sqlite-2.8.2
- torrent-10000.1.1
- uuid-1.3.13
- yesod-1.6.0
key/value backends:
- SHA256E
- SHA256
- SHA512E
- SHA512
- SHA224E
- SHA224
- SHA384E
- SHA384
- SHA3_256E
- SHA3_256
- SHA3_512E
- SHA3_512
- SHA3_224E
- SHA3_224
- SHA3_384E
- SHA3_384
- SKEIN256E
- SKEIN256
- SKEIN512E
- SKEIN512
- BLAKE2B256E
- BLAKE2B256
- BLAKE2B512E
- BLAKE2B512
- BLAKE2B160E
- BLAKE2B160
- BLAKE2B224E
- BLAKE2B224
- BLAKE2B384E
- BLAKE2B384
- BLAKE2BP512E
- BLAKE2BP512
- BLAKE2S256E
- BLAKE2S256
- BLAKE2S160E
- BLAKE2S160
- BLAKE2S224E
- BLAKE2S224
- BLAKE2SP256E
- BLAKE2SP256
- BLAKE2SP224E
- BLAKE2SP224
- SHA1E
- SHA1
- MD5E
- MD5
- WORM
- URL
operating system: linux x86_64
remote types:
- git
- gcrypt
- p2p
- S3
- bup
- directory
- rsync
- web
- bittorrent
- webdav
- adb
- tahoe
- glacier
- ddar
- git-lfs
- hook
- external
supported repository versions:
- 5
- 7
upgrade supported from repository versions:
- 0
- 1
- 2
- 3
- 4
- 5
- 6
version: 7.20190819+git2-g908476a9b-1~ndall+1

location

path: /mnt/datasets/dandi
type: directory

metadata_extractors

annex:
- load_error: None
- module: datalad.metadata.extractors.annex
- version: None
audio:
- load_error: None
- module: datalad.metadata.extractors.audio
- version: None
datacite:
- load_error: None
- module: datalad.metadata.extractors.datacite
- version: None
datalad_core:
- load_error: None
- module: datalad.metadata.extractors.datalad_core
- version: None
datalad_rfc822:
- load_error: None
- module: datalad.metadata.extractors.datalad_rfc822
- version: None
exif:
- load_error: None
- module: datalad.metadata.extractors.exif
- version: None
frictionless_datapackage:
- load_error: None
- module: datalad.metadata.extractors.frictionless_datapackage
- version: None
image:
- load_error: None
- module: datalad.metadata.extractors.image
- version: None
xmp:
- load_error: None
- module: datalad.metadata.extractors.xmp
- version: None

python

implementation: CPython
version: 3.7.3

system

distribution: debian/10.0
encoding:
- default: utf-8
- filesystem: utf-8
- locale.prefered: UTF-8
max_path_length: 275
name: Linux
release: 4.19.0-5-amd64
type: posix
version: #1 SMP Debian 4.19.37-5+deb10u1 (2019-07-19)

Q: is it needed to open NWBHDF5IO file with load_namespaces=True?

question to @bendichter (doing it via github so there is trace or I will forget the answer and will ask again ;)): in validate I opened a file via pynwb.NWBHDF5IO(path, 'r', load_namespaces=True). On some files then it causes hdmf to whine that

UserWarning: No cached namespaces found in testfile.nwb

so I wondered if for validation I have to (or not) to explicitly request loading of the namespaces?

full example with a script to produce that file

/home/yoh/proj/dandi/trash > dandi validate testfile.nwb
/home/yoh/deb/gits/pkg-exppsy/hdmf/src/hdmf/backends/hdf5/h5tools.py:99: UserWarning: No cached namespaces found in testfile.nwb
  warnings.warn(msg)
No validation errors among 1 files

/home/yoh/proj/dandi/trash > cat mksample_nwb.py 
import os
import time
from datetime import datetime
from dateutil.tz import tzlocal, tzutc

tm1 = time.time()

from pynwb import NWBFile, TimeSeries
from pynwb import NWBHDF5IO
from pynwb.file import Subject, ElectrodeTable
from pynwb.epoch import TimeIntervals

t0 = time.time()
print("Took %.2f sec to import pynwb" % (t0-tm1))

filename ="testfile.nwb"
nwbfile = NWBFile('session description',
                  'identifier',
                  datetime.now(tzlocal()),
                  file_create_date=datetime.now(tzlocal()),
                  lab='a Lab',
                  # Keywords cause that puke upon `repr` ValueError: Not a dataset (not a dataset)
                  keywords=('these', 'are', 'keywords')
                 )
t1 = time.time()
print("Took %.2f sec to create NWB instance" % (t1-t0))

with NWBHDF5IO(filename, 'w') as io:
    io.write(nwbfile, cache_spec=False)

t2 = time.time()
print("Took %.2f sec to write NWB instance" % (t2-t1))

with NWBHDF5IO(filename, 'r', load_namespaces=True) as reader:
    nwbfile = reader.read()
    t3 = time.time()
    print("Took %.2f sec to read NWB instance" % (t3-t2))
    print(nwbfile)
    t4 = time.time()
    print("Took %.2f sec to print repr of NWB instance" % (t4-t3))
print(filename)

ls: alternative output formats

we would like to be able to spit out metadata in alternative formats/renderings

pyout -- current
json
json_pp (pretty printed) (both might be not a single json record, but rather json lines stream)
yaml (may be when/if interest/demand)
pandas -- dataframe rendering (will require obtaining all records first)
auto, idea by @satra: "if multiple files, show columns, if single file, show key-value pairs in rows" . So we can flip between pyout and yaml (or json_pp, would give details) depending on the number of files

validate: verify that all files have unique identifiers

nwb-schema says

  - name: identifier
    dtype: text
    doc: A unique text identifier for the file. For example, concatenated lab name,
      file creation date/time and experimentalist, or a hash of these and/or other
      values. The goal is that the string should be unique to all other files.

but I ran into:

$> dandi ls -F path,nwb_version,identifier /home/yoh/proj/dandi/nwb-datasets/bendichter/Kriegstein2020/1853103{4,5}.nwb
PATH                                                                     NWB  IDENTIFIER                                                      
/home/yoh/proj/dandi/nwb-datasets/bendichter/Kriegstein2020/18531034.nwb 2.0b 638a0dd87776dd9a06e03dd658b3c702d55096871d68728254d6d814703b7630
/home/yoh/proj/dandi/nwb-datasets/bendichter/Kriegstein2020/18531035.nwb 2.0b 638a0dd87776dd9a06e03dd658b3c702d55096871d68728254d6d814703b7630

attn @bendichter

define a "dataset boundary" (e.g. by `.dandi/`) and mapping between local folders and girder collection/folders

In general it would be useful to upload/download(/sync) content from the harddrive to DANDI archive by pointing to entire directories or specific files. E.g. while being within some specific study (already known to DANDI) I could simply dandi upload FILES* without worrying to map them "properly" to some collection/folder on DANDI (girder) server.

So we need to decide on organization of datasets on Girder and locally.

On Girder

There are

collections
folders

It is not clear to me yet on either we should define a "Collection" per each dataset or just rely on some flat or non-flat hierarchy within a specific (e.g. per center/study) collection to upload datasets.

Locally

Not unlike git (with .git) we could define/rely on .dandi/ folder to define the dataset boundary. Such folder could contain necessary configuration/credentials etc to interoperate with DANDI archive, while user is within that directory or under any of its subdirectories.

dandi download error

User reported error, I cannot reproduce:

dandi download https://gui.dandiarchive.org/#/file-browser/folder/5e72840a3da50caa9adb0489
...
2020-04-14 15:19:52,862 [       INFO] Updating dandiset.yaml from obtained dandiset metadat
Error: 'generator' object is not subscriptable

Which is strange because this command works for me, as well as:

dandi download https://dandiarchive.org/dandiset/000009/draft

The user is on a Windows machine. I'll update as I gather more info from the user.

hidden resource fork files break organize command on mac with external drive

Nac users with an external drive are likely to have hidden "resource fork" files that have the prefix "._". These files are invisible and can normally be safely ignored, but they break the organize command. These files are difficult for mac users to identify, as they are hidden in Finder, and only findable with ls -a on the command line. I think it would be better to ignore them on organize. If not, then maybe we could throw an error right away, instead of waiting for all of the meta-data to be gathered (for 18 minutes in this case) before throwing the error.

(base) Bens-MacBook-Pro-2:dandi_staging bendichter$ dandi organize -d 000008 /Volumes/easystore5T/data/Tolias/nwb -f symlink
2020-03-17 14:43:47,494 [    INFO] Loading metadata from 1319 files
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    9.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   10.4s
...
[Parallel(n_jobs=-1)]: Done 1285 tasks      | elapsed: 18.1min
[Parallel(n_jobs=-1)]: Done 1319 out of 1319 | elapsed: 18.6min finished
2020-03-17 15:02:25,252 [ WARNING] Completely empty record for /Volumes/easystore5T/data/Tolias/nwb/._20171204_sample_2.nwb
Error: 1 out of 1319 files were found not containing all necessary metadata: /Volumes/easystore5T/data/Tolias/nwb/._20171204_sample_2.nwb

option to override keyring

is there an option to change the api key once stored in the keyring?

need to clarify subject id or parse part of it.

This dataset created 87 folders, but there are only 59 subjects: https://dandiarchive.org/dandiset/000004/draft

very likely the NOID* is an appendix to the ID, but would have been impossible for cli to know.

perhaps we can provide an option to clean up IDs during conversion.

enable "retries" for HTTP communication with girder

ATM disabled since was causing troubles: https://github.com/dandi/dandi-cli/blob/master/dandi/girder.py#L61 yet to provide more information, and not assigning to anyone, so available for grabbers -- just enable that code, and try to upload to any girder instance using dandi upload

difficulty: populate NWB-based fields in dandiset.yaml using partial dandisets

As a usecase with openscope shows, it might be impossible for researchers to have all files in a single local dandiset to populate fields such as number_subjects etc.

approach 1

do it on the archive side in those cases, adjusting dandiset.yaml (or dandiset metadata if no physical file).

approach 2

extract/include in dandiset.yaml not only number_ but also actual values for the corresponding items. Then number_ could always be computed. Discussion of context remains relevant though since it could effect counts (e.g. if cell id context is within tissue id)

approach 3

disallow partial uploads

more approaches?

register: no longer works with deployed dandiarchive.org

I was alerted to the issue by @satra, reproduced:

$> dandi register -n "Anticipatory Activity in Mouse Motor Cortex" -D "Activity in the mouse anterior lateral motor cortex (ALM) instructs directional movements, often seconds before movement initiation. It is unknown whether this preparatory activity is localized to ALM or widely distributed within motor cortex. Here we imaged activity across motor cortex while mice performed a whisker-based object localization task with a delayed, directional licking response. During tactile sensation and the delay epoch, object location was represented in motor cortex areas that are medial and posterior relative to ALM, including vibrissal motor cortex."
Error: HTTP error 401: POST https://girder.dandiarchive.org/api/v1/dandi?name=Anticipatory+Activity+in+Mouse+Motor+Cortex&description=Activity+in+the+mouse+anterior+lateral+motor+cortex+%28ALM%29+instructs+directional+movements%2C+often+seconds+before+movement+initiation.+It+is+unknown+whether+this+preparatory+activity+is+localized+to+ALM+or+widely+distributed+within+motor+cortex.+Here+we+imaged+activity+across+motor+cortex+while+mice+performed+a+whisker-based+object+localization+task+with+a+delayed%2C+directional+licking+response.+During+tactile+sensation+and+the+delay+epoch%2C+object+location+was+represented+in+motor+cortex+areas+that+are+medial+and+posterior+relative+to+ALM%2C+including+vibrissal+motor+cortex.
Response text: {"message": "You must be logged in.", "type": "access"}

but it works on my local deployment which might be a bit of older version of the dandiarchive.

$> DANDI_DEVEL=1 dandi register -i local-docker -n "Anticipatory Activity in Mouse Motor Cortex" -D "Activity in the mouse anterior lateral motor cortex (ALM) instructs directional movements, often seconds before movement initiation. It is unknown whether this preparatory activity is localized to ALM or widely distributed within motor cortex. Here we imaged activity across motor cortex while mice performed a whisker-based object localization task with a delayed, directional licking response. During tactile sensation and the delay epoch, object location was represented in motor cortex areas that are medial and posterior relative to ALM, including vibrissal motor cortex."
2020-03-14 21:17:58,505 [    INFO] Registered dandiset at None/dandiset/000023/draft. Please visit and adjust metadata.
2020-03-14 21:17:58,505 [    INFO] No dandiset path was provided and no dandiset detected in the path. Here is a record for dandiset.yaml
# DO NOT EDIT this file manually.
# It can be edied online and obtained from the dandiarchive.
# It also gets updated using dandi organize
description: Activity in the mouse anterior lateral motor cortex (ALM) instructs directional
  movements, often seconds before movement initiation. It is unknown whether this
  preparatory activity is localized to ALM or widely distributed within motor cortex.
  Here we imaged activity across motor cortex while mice performed a whisker-based
  object localization task with a delayed, directional licking response. During tactile
  sensation and the delay epoch, object location was represented in motor cortex areas
  that are medial and posterior relative to ALM, including vibrissal motor cortex.
identifier: '000023'
name: Anticipatory Activity in Mouse Motor Cortex

Establish the "trail" of changes per file.

Ideally should be kept within .nwb itself (extension?) -- @bendichter could correct me if I am wrong, and I should just check later, but upon each save pynwb should regenerate UUID within a file. We could then establish a trail of UUIDs for a file, not unlike git history where parent commit which contained previous version of the file is known. This could provide a reliable mechanism for "fast-forward only" updates to/from dandiarchive. Any possible conflicts (divergence in the history of the trail, thus requiring a merge) would be beyond the scope here, although we could provide the tool to review the differences (if any) and accept one version or another or establish a "merged" one.

edit 1: note that this is very .nwb specific, and wouldn't provide remedy for any other type of file we might need to allow upload to the archive. general solution IMHO would be to use VCS-based platform.

CMD: validate

basic validation of .pynwb files via pynwb

etelemetry doesn't seem to run consistently

this line https://github.com/dandi/dandi-cli/blob/master/dandi/__init__.py#L44 returns False when dandi cli is called in the terminal.

{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x107f7b250>, 
'__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 
'__file__': '/Users/satra/software/miniconda3/envs/dandi/bin/dandi', '__cached__': None,
 're': <module 're' from '/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/re.py'>, 
'sys': <module 'sys' (built-in)>}

so possibly remove the if statement and call it anytime?

organize: publications - update to latest schema

the main things are publications should be a list of objects and is now being edited online.

validate: provide workaround for pynwb validate not being nwb-version specific

see NeurodataWithoutBorders/pynwb#1091
It might take them a bit to resolve it properly since it would (as far as I see) require quite fundamental changes. Meanwhile we could simply do ad-hoc filtering for those few specific ones, based on nwb_version of the file, and later (when pynwb fixes it) disable that

test demo workflow of satra

which tries to use -d to specify dandiset to operate on:

dandi register
dandi download dataset-permalink 
dandi organize -d folder_download-typically-datasetid source_folder -f some_mode
dandi upload -d folder_download-typically-datasetid

validate: provide likely reasons for typical validation errors

E.g., as reported on slack, there was a bunch of

TimeSeries/data (processing/spikes/Sweep_27): argument missing

validation errors. Thanks to the feedback from @rly on nwb slack channel:

from the path, i guess that the user wishes to store spike times in a TimeSeries. the most straightforward way to do that right now is to put the spike times in the timestamps array, but the data array is still required. here, until we implement a proper Events type (see NeurodataWithoutBorders/nwb-schema#301), i recommend the user create a dummy array data consisting of ones for every value in the timestamps array.

Which gave me an idea that we might want to post-process some of the pynwb validation errors and provide guesses on what could have lead to them and how to mitigate.

idea: (re)layout/view command

came up during the call, having a command to assist with re-layouting a dataset since NWB doesn't enforce any file system organization/structure. It would be very close to what git annex view does.

Examples of layouts found in the "wild" (thanks to @bendichter for summarizing, I hope it is ok to post here, slap me if I am wrong):

EC’s lab

data is stored across separate servers for PHI/HIPAA reasons, and the entire lab shares data, so they do not separate by experimenter:

raw/
	EC61/ (subject)
		EC61_B1/ (session)
			data
processed/
	EC61/
		EC61_B1/
imaging/
	EC61/

GB’s lab

.. is probably the most common among neurophysiology labs that have a central data storage architecture at all. Here, subjects belong to specific experimenters who have unique subject naming conventions, so their first level is “experimenter”. Most of their data is shared publicly and you can navigate their file structure here (this is the data referenced by Peter’s Database).

SenzaiY/ (experimenter)
	YutaMouse-41/ (subject)
		YutaMouse-41-150819/ (session)
			eeg.dat
			spikes.dat
			...

I like that the sessions are named by subject and date, because they are easy to manage by eye.

BIDS-like

Analogously to BIDS, we could use subject > session. I think we should allow for a bit of extra meta-data around NWB files, so our structure could be something like:

sub-01/
	sub-01_20190505/
	data
	dashboard_config.py
	sub-01_20190506/
sub-02/
sub-03/
sub-04/

dandi download might generate "suboptimal" yaml

$> dandi download https://gui.dandiarchive.org/\#/folder/5e6d855776569eb93f451e50
2020-03-16 23:49:38,770 [    INFO] Downloading folder with id 5e6d855776569eb93f451e50 from https://girder.dandiarchive.org/
2020-03-16 23:49:38,885 [    INFO] Traversing remote dandisets (000002) recursively and downloading them locally
2020-03-16 23:49:38,885 [    INFO] Updating fdandiset.yaml from obtained dandiset metadata
(dev3) 3 10975.....................................:Mon 16 Mar 2020 11:49:39 PM EDT:.
smaug:/mnt/datasets/dandi
$> cat 000002/dandiset.yaml                                           
# DO NOT EDIT this file manually.
# It can be edied online and obtained from the dandiarchive.
# It also gets updated using dandi organize
{description: 'Activity in the mouse anterior lateral motor cortex (ALM) instructs
    directional movements, often seconds before movement initiation. It is unknown
    whether this preparatory activity is localized to ALM or widely distributed within
    motor cortex. Here we imaged activity across motor cortex while mice performed
    a whisker-based object localization task with a delayed, directional licking response.
    During tactile sensation and the delay epoch, object location was represented
    in motor cortex areas that are medial and posterior relative to ALM, including
    vibrissal motor cortex.', identifier: '000002', name: Anticipatory Activity in
    Mouse Motor Cortex}
$> dandi --version
0.4.1+2.g21b8505

but on my laptop -- it is all good... So likely it is a version of yaml and/or parameters to it. We need dandi wtf (hence new #57).

on bad host:

smaug:/mnt/datasets/dandi
$> welp yaml
PATH       : /usr/lib/python3/dist-packages/yaml/__init__.py
SRC PATH   : /usr/lib/python3/dist-packages/yaml/__init__.py
VERSION    : Not found
__version__: '3.13'
PACKAGE    : python3-yaml
ii  python3-yaml   3.13-2       amd64        YAML parser and emitter for Python3

on good:

$> welp yaml   
PATH       : /usr/lib/python3/dist-packages/yaml/__init__.py
SRC PATH   : /usr/lib/python3/dist-packages/yaml/__init__.py
VERSION    : Not found
__version__: '5.3'
PACKAGE    : python3-yaml
ii  python3-yaml   5.3-1        amd64        YAML parser and emitter for Python3
python3-yaml:
  Installed: 5.3-1
  Candidate: 5.3-2
  Version table:
     5.3-2 900
        900 http://deb.debian.org/debian bullseye/main amd64 Packages
        600 http://http.debian.net/debian sid/main amd64 Packages
 *** 5.3-1 100
        100 /var/lib/dpkg/status

I feel that I had such issue somewhere but now it is too late to try to remember on how to overcome.

organize: provide cmdline options for explicit number_ values

per our zoomchat with Tom, he recommended to let users specify target numbers of subjects etc they expect in the dandiset. So might be worth adding an option to organize. for "validate" separate issue #90

python library

turn dandi cli into pydandi/dandipy/...

related to this i think if we turned dandi into a library, one should be able to do:

import dandi as di
di.get_dataset('ds00001', dataset_base_dir='/data/', mmap=True)

and this call could use whatever necessary (datalad, dandi api, etc.,.) under the hood. in some ways this is similar to what reactopya does, but it does not provide direct access to the object. it would be nice if we had a zarr-based mode (https://zarr.readthedocs.io/en/stable/).

installation (pip install) might not work

reporting the failure to install on our test aws ec2 box:

yoh@ip-172-31-33-190:~$ git clone https://github.com/dandi/dandi-cli && cd dandi-cli &&  virtualenv --system-site-packages --python=python3 venvs/dev3 && source venvs/dev3/bin/activate && pip install -e .
Cloning into 'dandi-cli'...
remote: Enumerating objects: 113, done.
remote: Counting objects: 100% (113/113), done.
remote: Compressing objects: 100% (75/75), done.
remote: Total 113 (delta 56), reused 84 (delta 34), pack-reused 0
Receiving objects: 100% (113/113), 40.93 KiB | 10.23 MiB/s, done.
Resolving deltas: 100% (56/56), done.
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/yoh/dandi-cli/venvs/dev3/bin/python3
Also creating executable in /home/yoh/dandi-cli/venvs/dev3/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.
Obtaining file:///home/yoh/dandi-cli
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/yoh/dandi-cli/venvs/dev3/bin/python3 /home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmprj7_0raa
       cwd: /home/yoh/dandi-cli
  Complete output (10 lines):
  Traceback (most recent call last):
    File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 207, in <module>
      main()
    File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 197, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 48, in get_requires_for_build_wheel
      backend = _build_backend()
    File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 39, in _build_backend
      obj = getattr(obj, path_part)
  AttributeError: module 'setuptools.build_meta' has no attribute '__legacy__'
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/yoh/dandi-cli/venvs/dev3/bin/python3 /home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmprj7_0raa Check the logs for full command output.

(dev3) yoh@ip-172-31-33-190:~/dandi-cli$ apt-cache policy python3-setuptools
python3-setuptools:
  Installed: 39.0.1-2
  Candidate: 39.0.1-2
  Version table:
 *** 39.0.1-2 500
        500 http://us-east-2.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
        100 /var/lib/dpkg/status

upload: establish testing

#26 introduced upload command. We should enable some kind of unittesting for it

It seems that github actions support docker: https://help.github.com/en/articles/creating-a-docker-container-action and even I found an issue in a now removed "docker" action repository issues: https://webcache.googleusercontent.com/search?q=cache:mPPb1xbgUgEJ:https://github.com/actions/docker/issues/11+&cd=2&hl=en&ct=clnk&gl=us which suggests that docker compose is also available.

But we would need firsts to establish some user creds and obtain the key programmatically first. @mgrauer - can you help with that?

organize: option to specify keys to be used instead of current "hard coded"

We mandate subject, session, and then optional list of others,... @satra has mentioned a use case by @bendichter where people prefer to avoid using session altogether. I am yet to argue one way (against) or another (support), but here is at least an issue to document this desire/use case ;)

validate: crashes if file is not read ok

(git-annex)lena:~/proj/dandi/nwb-datasets[master]bendichter/Gao2018
$> dandi validate *nwb           
anm00314746_2015-10-20 09:36:04 (1).nwb: ok
anm00314746_2015-10-21 11:25:41 (1).nwb: ok
anm00314746_2015-10-22 15:17:38 (1).nwb: ok
anm00314756_2015-10-20 19:42:11 (1).nwb: ok
anm00314756_2015-10-23 14:10:29 (1).nwb: ok
anm00314757_2015-10-20 17:37:31 (1).nwb: ok
anm00314757_2015-10-21 18:02:48 (1).nwb: ok
anm00314758_2015-10-20 10:49:30.nwb: ok
anm00314758_2015-10-21 10:10:14 (1).nwb: ok
anm00314758_2015-10-22 11:20:47 (1).nwb: ok
anm00314758_2015-10-23 09:49:01 (1).nwb: ok
anm00314760_2015-10-20 15:52:30.nwb: ok
anm00314760_2015-10-21 16:44:27.nwb: ok
anm00314760_2015-10-22 16:39:13 (1).nwb: ok
BAYLORCD12_2018-01-25 19:16:01.nwb: ok
BAYLORCD12_2018-01-26 12:25:06.nwb: ok
Error: Could not construct DynamicTableRegion object due to The index 63 is out of range for this DynamicTable of length 63

$> dandi --version             
0.4.4+7.g2f45a27.dirty

DANDI cli

Python CLI
Girder support

support dandi identifiers

the dandi cli should support dandi identifiers. since we know what this maps to we should be able to route it directly without going through identifiers.org. so the following should all be feasible.

dandi download DANDI:000008
dandi download https://identifiers.org/DANDI:000008
dandi download https://dandiarchive.org/dandiset/000008

CMD (idea): compress

I have noted that network traffic while rcloning Svoboda's data is only about 10% of the local "write" IO .

That observation is confirmed by simply compressing the obtained .nwb files using tar/gz:

smaug:/mnt/btrfs/datasets/datalad/crawl-misc/svoboda-rclone/Exported NWB 2.0
$> du -scm Chen\ 2017*
35113   Chen 2017
3298    Chen 2017.tgz
38410   total

so indeed -- x10 factor!

Apparently hdmf/pynwb does not bother compressing stored in the .nwb data arrays. They do both document ability to pass compression parameters down (to h5py I guess) though, but as far as I saw it, compression is not on by default. Sure thing hdf5 end compression ration might not reach 10 since not all data will be compressed, but I expect that it will be notable.

As we keep running into those, it might be valuable to provide a dandi compress command which would take care about (re)compressing provided .nwb files (inplace or into a new file).
Perspective interface:

dandi compress [-i|--inplace] [-o|--output FILE] [-c|--compression METHOD (default gzip)] [-l|--level LEVEL (default 5)] [FILES]

--inplace to explicitly state to (re)compress each file in place (might want to do not really "inplace" but rather into a new file, and then replace old one -- this would provide a better workflow for git-annex'ed files, where original ones by default would be read/only)
--output filename - where to store output file (then a single FILE is expected to be provided)

When move (and copy? check) inplace (.git/annex/objects) symlinks need to be fixed up

moving a symlink with a relative path should account for a directory change.

TODO: check what happens with copy -- does it copy symlink or dereferences

downloading a single item results in an error

now that we can get download links from the gui, this should be possible but results in an error.

$ dandi download https://girder.dandiarchive.org/api/v1/item/5e7b9e41529c28f35128c743/download
Error:

more details:

$ dandi -l DEBUG --pdb download -o . https://girder.dandiarchive.org/api/v1/item/5e7b9e41529c28f35128c743/download
Traceback (most recent call last):
  File "/Users/satra/software/miniconda3/envs/dandi/bin/dandi", line 8, in <module>
    sys.exit(main())
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/cli/command.py", line 118, in wrapper
    return f(*args, **kwargs)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/cli/cmd_download.py", line 53, in download
    return download(url, output_dir, existing=existing, develop_debug=develop_debug)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/download.py", line 139, in download
    girder_server_url, asset_type, asset_id = parse_dandi_url(url)
  File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/download.py", line 75, in parse_dandi_url
    assert not u.query
AssertionError

> /Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/download.py(75)parse_dandi_url()
-> assert not u.query
(Pdb) url
'https://dandiarchive.s3.amazonaws.com/girder-assetstore/3d/2e/3d2e88d88a974644a6722bdd5790a27b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVBOFDICEV%2F20200430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200430T152456Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEPH%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDJC9loqF4xaxKesvsSK%2FAauvxzItfPkUwulK2K4nVW%2F2xmiIDfonE5UF6fB3KjGlIASW0MTXj8IC6GCFx6kD90HjOIjirro7WPFfYM%2FhhztHjwvDC3bBBH76mzMup1sr3U8wrHkw3S5wIXjIj%2B244Us0maaDVsefgO%2B8g1hfYf7SUDQiLvCe%2BdseyTn4DqAX5NS9TVEG2baaoN2u1FFkX7%2Biy1C1xq3ZAKtq%2FYQEXKZ54UlzoPrH%2BnXzwl1ex%2FKyKtJmRWuoC9CF1sk0dpHUKMvaq%2FUFMi2UKOyzApfKMsXDLIETFNj4hOcunLbVCHYvEuqtvWmAWUCkkxGHyIv%2Ft6JvLCA%3D&X-Amz-Signature=d94191f2886c8b192c4bb4fb44f9dc35cf9b4aa296914baaa64e400f67096389'

version info not being embedded properly

pip install dandi does the following.

...
  Created wheel for dandi: filename=dandi-0.0.0-cp37-none-any.whl size=67033 sha256=46af8b37a25b16b497236d7bd4aee6163b5d6fcf01da6f290bc85df90b9de8c9
  Stored in directory: /Users/satra/Library/Caches/pip/wheels/60/a6/58/d841466d5d3849c392a32d656c10daa16affd4d0cc2a0a5bdc
Successfully built dandi
Installing collected packages: tqdm, appdirs, joblib, dandi
Successfully installed appdirs-1.4.3 dandi-0.0.0 joblib-0.14.1 tqdm-4.43.0