dandi / dandi-cli Goto Github PK
View Code? Open in Web Editor NEWDANDI command line client to facilitate common operations
Home Page: https://dandi.readthedocs.io/
License: Apache License 2.0
DANDI command line client to facilitate common operations
Home Page: https://dandi.readthedocs.io/
License: Apache License 2.0
$> dandi download https://gui.dandiarchive.org/#/file-browser/folder/5e9f9588b5c9745bad9f58fe
2020-04-22 13:52:43,982 [ INFO] Downloading folder with id 5e9f9588b5c9745bad9f58fe from https://girder.dandiarchive.org/
2020-04-22 13:52:44,496 [ INFO] Traversing remote folders (sub-mouse-AAYYT) recursively and downloading them locally
sub-mouse-AAYYT_ses-20180420-sample-4_slice-20180420-slice-4_cell-20180420-sample-4.nwb: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 9.50M/9.50M [00:03<00:00, 2.91MB/s]
sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2.nwb: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 9.34M/9.34M [00:03<00:00, 2.61MB/s]
Error: HTTP error 400: get https://girder.dandiarchive.org/api/v1/file/5e9f95e7b5c9745bad9f592e/download
Response text: {"message": "Unable to connect to S3 assetstore", "type": "validation"}
(dev3) 1 21415 ->1.....................................:Wed 22 Apr 2020 01:52:50 PM EDT:.
lena:/tmp
$> dandi download https://gui.dandiarchive.org/#/file-browser/folder/5e9f9588b5c9745bad9f58fe
2020-04-22 13:53:03,607 [ INFO] Downloading folder with id 5e9f9588b5c9745bad9f58fe from https://girder.dandiarchive.org/
2020-04-22 13:53:03,930 [ INFO] Traversing remote folders (sub-mouse-AAYYT) recursively and downloading them locally
2020-04-22 13:53:04,955 [ INFO] './sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2.nwb' - same time and size, skipping
2020-04-22 13:53:04,955 [ INFO] './sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-4_slice-20180420-slice-4_cell-20180420-sample-4.nwb' - same time and size, skipping
sub-mouse-AAYYT_ses-20180420-sample-3_slice-20180420-slice-3_cell-20180420-sample-3.nwb: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 9.31M/9.31M [00:02<00:00, 3.65MB/s]
ATM ls
(see #8) is taking awhile even though we are accessing only basic metadata fields. We should see how to speed things up:
$> dandi ls /home/yoh/proj/dandi/nwb-datasets/najafi-2018-nwb/data/FN_dataSharing/nwb/mouse1_fni16_15081{7,8}*nwb
PATH SIZE EXPERIMENTER SESSION_ID SESSION_START_TIME KEYWORDS
...ni16_150817_001_ch2-PnevPanResults-170808-190057.nwb 3... Farzaneh ... 1708081... 2015-08-16/20:0...
...ni16_150818_001_ch2-PnevPanResults-170808-180842.nwb 2... Farzaneh ... 1708081... 2015-08-17/20:0...
Summary: 5... 2015-08-16/20:0...
2015-08-17/20:0...
dandi ls 18.04s user 0.56s system 79% cpu 23.369 total
so it took 23sec to just get those basic fields
if those are still to be provided in the dandiset.yaml -- they better be validated.
reflecting upon NeurodataWithoutBorders/pynwb#1077 -- PyNWB fails to open some (NWB 1.0? or just not NWB:N?) NWB files. So I wondered if we should collect a list of specification versions (Groups) and list them with dandi ls
before even trying to feed that file to PyNWB?
@bendichter what do you think?
although pip install dandi
correctly reports
ERROR: Package u'dandi' requires a different Python: 2.7.17 not in '>=3.5.1'
@bendichter ran into the situation where pip install --upgrade dandi
failed with
bdichter@smaug:~$ pip install --upgrade dandi
Collecting dandi
Using cached https://files.pythonhosted.org/packages/b9/4d/f2e8243beeb0427e713f4342017768f79fe6917e9cf2dd107ce10d23456c/dandi-0.2.0.tar.gz
Installing build dependencies ... done
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-HoAKb4/dandi/setup.py", line 20, in <module>
"version": versioneer.get_version(),
File "versioneer.py", line 1498, in get_version
return get_versions()["version"]
File "versioneer.py", line 1452, in get_versions
keywords = get_keywords_f(versionfile_abs)
File "versioneer.py", line 985, in git_get_keywords
except (FileNotFoundError, KeyError):
NameError: global name 'FileNotFoundError' is not defined
the reason for which (pip of python2) was immediately obvious. I think we should just add a check to setup.py to kaboom with an informative message before versioneer is approached.
See dandi/dandiarchive-legacy#54
We seems to not provide explicit settings to make uploads public
register/upload process
update process questions:
dandi download currently does not edit keys in dandiset.yaml that are locally present. it should do so for any key that's not generated for the nwb section. i.e. treat the archive metadata for non-nwb keys as master.
To have ability from the client to interact with available environments.
We could simplify access to available through DANDI jupyter hub, etc; but also possibly to just start (list, stop) local instances of those.
ATM implementations are separate and analysis for either to download/upload a file is happening right before actual download/upload. That forbids implementing proper "sync" functionality where analysis first to be done on either any file needs to be removed (or may be moved! ;-) ) on local/server end. So RF should be done to minimize difference between API and implementation of the two:
--sync
)--sync
and --file-mode
not being "overwrite" or "force", if some file was already newer and we didn't perform transfer)For download would be needed:
If organize
is (re)run on a full collection of files, some of which were modified, then by "injecting" new files and leaving previously organized ones we might end up breeding some . With something like --disappeared
we could detect which paths were no longer considered and offer to remove them.
Alternatively - we could make a default mode to be "complete" (default) which would imply removing files which were not "re-organized" (probably should ask first since could lead to the loss of data) and only by adding explicit --mode=incremental
(complete - default) new file added in, while allowing all previous stick around.
On .nwb files:
nib-diff
(nibabel's diff, basic implementation) would show differences in metadata fields. With a dedicated option, could also do data diffs summaries.diff
local and remote files -- we would just fetch metadata records from dandiarchive.On directories:
object id
we upload and/or later checksums (we do not upload those yet)On dandisets
Relates to the usecase of #70 where it is desired to see what files were already dandi organize
d or not. So users then could
dandi diff --find-renames /path/myorigdata /path/organized
or even
dandi diff --find-renames /path/myorigdata http://dandiarchive.org/dandiset/000XXX/draft
$ dandi upload -d 000001
2020-03-14 21:53:02,654 [ INFO] Found 136 files to consider
Error: Both paths must either be absolute or relative. Got '/net/vast-storage.ib.cluster/scratch/Mon/satra/000001/dandiset.yaml' and '000001'
a request by labs via @bendichter to know which file was converted to the dandiset. perhaps organize can upload this with the file/item metadata and dandi cli can look this up from the server.
however in general perhaps we can add a simple provenance file:
<filename 1> wasDerivedFrom <old filename> .
<filename1> <sha512_or_some_such> <sha> .
this may be useful to create a manifest file later or for checking against amazon store on download.
at present organize doesn't change checksums. so that notion is simply a flag for now.
Very basic functionality
There are a few issues around subject_id
:
dandi validate
does not enforce that subject_id
is present, but dandi organize
requires it.dandi organize
runs on a file that is missing subject_id
it complains about missing metadata, but does not specify what this missing metadata is.dandi organize
complaining, so a user will just carry on and will be confused when no data is uploaded.This is all exacerbated by the fact that ipfx, the main conversion software for ABF and DAT icephys files out of the Allen Institute, doesn't currently even have a way to add a Subject
object to the nwb file in the first place (so you can't specify subject_id
). I have made a pull request to be able to add this information here.
In the light of #21 it might be highly benefitial to compress files during upload
@mgrauer - does girder support receiving compressed payload?
@bendichter - do you have a quick way/code to assess if hdf5 file used compression, so we could include that into ls output and dynamically decide either to compress payload to girder?
similar to datalad wtf
but with details pertinent to dandi. Here is datalad example
question to @bendichter (doing it via github so there is trace or I will forget the answer and will ask again ;)): in validate
I opened a file via pynwb.NWBHDF5IO(path, 'r', load_namespaces=True)
. On some files then it causes hdmf to whine that
UserWarning: No cached namespaces found in testfile.nwb
so I wondered if for validation I have to (or not) to explicitly request loading of the namespaces?
/home/yoh/proj/dandi/trash > dandi validate testfile.nwb
/home/yoh/deb/gits/pkg-exppsy/hdmf/src/hdmf/backends/hdf5/h5tools.py:99: UserWarning: No cached namespaces found in testfile.nwb
warnings.warn(msg)
No validation errors among 1 files
/home/yoh/proj/dandi/trash > cat mksample_nwb.py
import os
import time
from datetime import datetime
from dateutil.tz import tzlocal, tzutc
tm1 = time.time()
from pynwb import NWBFile, TimeSeries
from pynwb import NWBHDF5IO
from pynwb.file import Subject, ElectrodeTable
from pynwb.epoch import TimeIntervals
t0 = time.time()
print("Took %.2f sec to import pynwb" % (t0-tm1))
filename ="testfile.nwb"
nwbfile = NWBFile('session description',
'identifier',
datetime.now(tzlocal()),
file_create_date=datetime.now(tzlocal()),
lab='a Lab',
# Keywords cause that puke upon `repr` ValueError: Not a dataset (not a dataset)
keywords=('these', 'are', 'keywords')
)
t1 = time.time()
print("Took %.2f sec to create NWB instance" % (t1-t0))
with NWBHDF5IO(filename, 'w') as io:
io.write(nwbfile, cache_spec=False)
t2 = time.time()
print("Took %.2f sec to write NWB instance" % (t2-t1))
with NWBHDF5IO(filename, 'r', load_namespaces=True) as reader:
nwbfile = reader.read()
t3 = time.time()
print("Took %.2f sec to read NWB instance" % (t3-t2))
print(nwbfile)
t4 = time.time()
print("Took %.2f sec to print repr of NWB instance" % (t4-t3))
print(filename)
we would like to be able to spit out metadata in alternative formats/renderings
pyout
-- currentjson
json_pp
(pretty printed) (both might be not a single json record, but rather json lines stream)yaml
(may be when/if interest/demand)pandas
-- dataframe rendering (will require obtaining all records first)auto
, idea by @satra: "if multiple files, show columns, if single file, show key-value pairs in rows" . So we can flip between pyout
and yaml
(or json_pp, would give details) depending on the number of filesnwb-schema says
- name: identifier
dtype: text
doc: A unique text identifier for the file. For example, concatenated lab name,
file creation date/time and experimentalist, or a hash of these and/or other
values. The goal is that the string should be unique to all other files.
but I ran into:
$> dandi ls -F path,nwb_version,identifier /home/yoh/proj/dandi/nwb-datasets/bendichter/Kriegstein2020/1853103{4,5}.nwb
PATH NWB IDENTIFIER
/home/yoh/proj/dandi/nwb-datasets/bendichter/Kriegstein2020/18531034.nwb 2.0b 638a0dd87776dd9a06e03dd658b3c702d55096871d68728254d6d814703b7630
/home/yoh/proj/dandi/nwb-datasets/bendichter/Kriegstein2020/18531035.nwb 2.0b 638a0dd87776dd9a06e03dd658b3c702d55096871d68728254d6d814703b7630
attn @bendichter
In general it would be useful to upload/download(/sync) content from the harddrive to DANDI archive by pointing to entire directories or specific files. E.g. while being within some specific study (already known to DANDI) I could simply dandi upload FILES*
without worrying to map them "properly" to some collection/folder on DANDI (girder) server.
So we need to decide on organization of datasets on Girder and locally.
There are
It is not clear to me yet on either we should define a "Collection" per each dataset or just rely on some flat or non-flat hierarchy within a specific (e.g. per center/study) collection to upload datasets.
Not unlike git (with .git
) we could define/rely on .dandi/
folder to define the dataset boundary. Such folder could contain necessary configuration/credentials etc to interoperate with DANDI archive, while user is within that directory or under any of its subdirectories.
User reported error, I cannot reproduce:
dandi download https://gui.dandiarchive.org/#/file-browser/folder/5e72840a3da50caa9adb0489
...
2020-04-14 15:19:52,862 [ INFO] Updating dandiset.yaml from obtained dandiset metadat
Error: 'generator' object is not subscriptable
Which is strange because this command works for me, as well as:
dandi download https://dandiarchive.org/dandiset/000009/draft
The user is on a Windows machine. I'll update as I gather more info from the user.
Nac users with an external drive are likely to have hidden "resource fork" files that have the prefix "._
". These files are invisible and can normally be safely ignored, but they break the organize command. These files are difficult for mac users to identify, as they are hidden in Finder, and only findable with ls -a
on the command line. I think it would be better to ignore them on organize. If not, then maybe we could throw an error right away, instead of waiting for all of the meta-data to be gathered (for 18 minutes in this case) before throwing the error.
(base) Bens-MacBook-Pro-2:dandi_staging bendichter$ dandi organize -d 000008 /Volumes/easystore5T/data/Tolias/nwb -f symlink
2020-03-17 14:43:47,494 [ INFO] Loading metadata from 1319 files
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 9.1s
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 10.4s
...
[Parallel(n_jobs=-1)]: Done 1285 tasks | elapsed: 18.1min
[Parallel(n_jobs=-1)]: Done 1319 out of 1319 | elapsed: 18.6min finished
2020-03-17 15:02:25,252 [ WARNING] Completely empty record for /Volumes/easystore5T/data/Tolias/nwb/._20171204_sample_2.nwb
Error: 1 out of 1319 files were found not containing all necessary metadata: /Volumes/easystore5T/data/Tolias/nwb/._20171204_sample_2.nwb
is there an option to change the api key once stored in the keyring?
This dataset created 87 folders, but there are only 59 subjects: https://dandiarchive.org/dandiset/000004/draft
very likely the NOID* is an appendix to the ID, but would have been impossible for cli to know.
perhaps we can provide an option to clean up IDs during conversion.
ATM disabled since was causing troubles: https://github.com/dandi/dandi-cli/blob/master/dandi/girder.py#L61 yet to provide more information, and not assigning to anyone, so available for grabbers -- just enable that code, and try to upload to any girder instance using dandi upload
As a usecase with openscope shows, it might be impossible for researchers to have all files in a single local dandiset to populate fields such as number_subjects
etc.
do it on the archive side in those cases, adjusting dandiset.yaml
(or dandiset metadata if no physical file).
extract/include in dandiset.yaml not only number_
but also actual values for the corresponding items. Then number_
could always be computed. Discussion of context remains relevant though since it could effect counts (e.g. if cell id context is within tissue id)
disallow partial uploads
more approaches?
I was alerted to the issue by @satra, reproduced:
$> dandi register -n "Anticipatory Activity in Mouse Motor Cortex" -D "Activity in the mouse anterior lateral motor cortex (ALM) instructs directional movements, often seconds before movement initiation. It is unknown whether this preparatory activity is localized to ALM or widely distributed within motor cortex. Here we imaged activity across motor cortex while mice performed a whisker-based object localization task with a delayed, directional licking response. During tactile sensation and the delay epoch, object location was represented in motor cortex areas that are medial and posterior relative to ALM, including vibrissal motor cortex."
Error: HTTP error 401: POST https://girder.dandiarchive.org/api/v1/dandi?name=Anticipatory+Activity+in+Mouse+Motor+Cortex&description=Activity+in+the+mouse+anterior+lateral+motor+cortex+%28ALM%29+instructs+directional+movements%2C+often+seconds+before+movement+initiation.+It+is+unknown+whether+this+preparatory+activity+is+localized+to+ALM+or+widely+distributed+within+motor+cortex.+Here+we+imaged+activity+across+motor+cortex+while+mice+performed+a+whisker-based+object+localization+task+with+a+delayed%2C+directional+licking+response.+During+tactile+sensation+and+the+delay+epoch%2C+object+location+was+represented+in+motor+cortex+areas+that+are+medial+and+posterior+relative+to+ALM%2C+including+vibrissal+motor+cortex.
Response text: {"message": "You must be logged in.", "type": "access"}
but it works on my local deployment which might be a bit of older version of the dandiarchive.
$> DANDI_DEVEL=1 dandi register -i local-docker -n "Anticipatory Activity in Mouse Motor Cortex" -D "Activity in the mouse anterior lateral motor cortex (ALM) instructs directional movements, often seconds before movement initiation. It is unknown whether this preparatory activity is localized to ALM or widely distributed within motor cortex. Here we imaged activity across motor cortex while mice performed a whisker-based object localization task with a delayed, directional licking response. During tactile sensation and the delay epoch, object location was represented in motor cortex areas that are medial and posterior relative to ALM, including vibrissal motor cortex."
2020-03-14 21:17:58,505 [ INFO] Registered dandiset at None/dandiset/000023/draft. Please visit and adjust metadata.
2020-03-14 21:17:58,505 [ INFO] No dandiset path was provided and no dandiset detected in the path. Here is a record for dandiset.yaml
# DO NOT EDIT this file manually.
# It can be edied online and obtained from the dandiarchive.
# It also gets updated using dandi organize
description: Activity in the mouse anterior lateral motor cortex (ALM) instructs directional
movements, often seconds before movement initiation. It is unknown whether this
preparatory activity is localized to ALM or widely distributed within motor cortex.
Here we imaged activity across motor cortex while mice performed a whisker-based
object localization task with a delayed, directional licking response. During tactile
sensation and the delay epoch, object location was represented in motor cortex areas
that are medial and posterior relative to ALM, including vibrissal motor cortex.
identifier: '000023'
name: Anticipatory Activity in Mouse Motor Cortex
Ideally should be kept within .nwb itself (extension?) -- @bendichter could correct me if I am wrong, and I should just check later, but upon each save pynwb
should regenerate UUID within a file. We could then establish a trail of UUIDs for a file, not unlike git history where parent commit which contained previous version of the file is known. This could provide a reliable mechanism for "fast-forward only" updates to/from dandiarchive. Any possible conflicts (divergence in the history of the trail, thus requiring a merge) would be beyond the scope here, although we could provide the tool to review the differences (if any) and accept one version or another or establish a "merged" one.
edit 1: note that this is very .nwb specific, and wouldn't provide remedy for any other type of file we might need to allow upload to the archive. general solution IMHO would be to use VCS-based platform.
basic validation of .pynwb files via pynwb
this line https://github.com/dandi/dandi-cli/blob/master/dandi/__init__.py#L44 returns False
when dandi cli is called in the terminal.
{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x107f7b250>,
'__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>,
'__file__': '/Users/satra/software/miniconda3/envs/dandi/bin/dandi', '__cached__': None,
're': <module 're' from '/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/re.py'>,
'sys': <module 'sys' (built-in)>}
so possibly remove the if statement and call it anytime?
the main things are publications should be a list of objects and is now being edited online.
see NeurodataWithoutBorders/pynwb#1091
It might take them a bit to resolve it properly since it would (as far as I see) require quite fundamental changes. Meanwhile we could simply do ad-hoc filtering for those few specific ones, based on nwb_version of the file, and later (when pynwb fixes it) disable that
which tries to use -d
to specify dandiset to operate on:
dandi register
dandi download dataset-permalink
dandi organize -d folder_download-typically-datasetid source_folder -f some_mode
dandi upload -d folder_download-typically-datasetid
E.g., as reported on slack, there was a bunch of
TimeSeries/data (processing/spikes/Sweep_27): argument missing
validation errors. Thanks to the feedback from @rly on nwb slack channel:
from the path, i guess that the user wishes to store spike times in a TimeSeries. the most straightforward way to do that right now is to put the spike times in the timestamps array, but the data array is still required. here, until we implement a proper Events type (see NeurodataWithoutBorders/nwb-schema#301), i recommend the user create a dummy array data consisting of ones for every value in the timestamps array.
Which gave me an idea that we might want to post-process some of the pynwb validation errors and provide guesses on what could have lead to them and how to mitigate.
came up during the call, having a command to assist with re-layouting a dataset since NWB doesn't enforce any file system organization/structure. It would be very close to what git annex view
does.
Examples of layouts found in the "wild" (thanks to @bendichter for summarizing, I hope it is ok to post here, slap me if I am wrong):
data is stored across separate servers for PHI/HIPAA reasons, and the entire lab shares data, so they do not separate by experimenter:
raw/
EC61/ (subject)
EC61_B1/ (session)
data
processed/
EC61/
EC61_B1/
imaging/
EC61/
.. is probably the most common among neurophysiology labs that have a central data storage architecture at all. Here, subjects belong to specific experimenters who have unique subject naming conventions, so their first level is “experimenter”. Most of their data is shared publicly and you can navigate their file structure here (this is the data referenced by Peter’s Database).
SenzaiY/ (experimenter)
YutaMouse-41/ (subject)
YutaMouse-41-150819/ (session)
eeg.dat
spikes.dat
...
I like that the sessions are named by subject and date, because they are easy to manage by eye.
Analogously to BIDS, we could use subject > session. I think we should allow for a bit of extra meta-data around NWB files, so our structure could be something like:
sub-01/
sub-01_20190505/
data
dashboard_config.py
sub-01_20190506/
sub-02/
sub-03/
sub-04/
$> dandi download https://gui.dandiarchive.org/\#/folder/5e6d855776569eb93f451e50
2020-03-16 23:49:38,770 [ INFO] Downloading folder with id 5e6d855776569eb93f451e50 from https://girder.dandiarchive.org/
2020-03-16 23:49:38,885 [ INFO] Traversing remote dandisets (000002) recursively and downloading them locally
2020-03-16 23:49:38,885 [ INFO] Updating fdandiset.yaml from obtained dandiset metadata
(dev3) 3 10975.....................................:Mon 16 Mar 2020 11:49:39 PM EDT:.
smaug:/mnt/datasets/dandi
$> cat 000002/dandiset.yaml
# DO NOT EDIT this file manually.
# It can be edied online and obtained from the dandiarchive.
# It also gets updated using dandi organize
{description: 'Activity in the mouse anterior lateral motor cortex (ALM) instructs
directional movements, often seconds before movement initiation. It is unknown
whether this preparatory activity is localized to ALM or widely distributed within
motor cortex. Here we imaged activity across motor cortex while mice performed
a whisker-based object localization task with a delayed, directional licking response.
During tactile sensation and the delay epoch, object location was represented
in motor cortex areas that are medial and posterior relative to ALM, including
vibrissal motor cortex.', identifier: '000002', name: Anticipatory Activity in
Mouse Motor Cortex}
$> dandi --version
0.4.1+2.g21b8505
but on my laptop -- it is all good... So likely it is a version of yaml and/or parameters to it. We need dandi wtf
(hence new #57).
on bad host:
smaug:/mnt/datasets/dandi
$> welp yaml
PATH : /usr/lib/python3/dist-packages/yaml/__init__.py
SRC PATH : /usr/lib/python3/dist-packages/yaml/__init__.py
VERSION : Not found
__version__: '3.13'
PACKAGE : python3-yaml
ii python3-yaml 3.13-2 amd64 YAML parser and emitter for Python3
on good:
$> welp yaml
PATH : /usr/lib/python3/dist-packages/yaml/__init__.py
SRC PATH : /usr/lib/python3/dist-packages/yaml/__init__.py
VERSION : Not found
__version__: '5.3'
PACKAGE : python3-yaml
ii python3-yaml 5.3-1 amd64 YAML parser and emitter for Python3
python3-yaml:
Installed: 5.3-1
Candidate: 5.3-2
Version table:
5.3-2 900
900 http://deb.debian.org/debian bullseye/main amd64 Packages
600 http://http.debian.net/debian sid/main amd64 Packages
*** 5.3-1 100
100 /var/lib/dpkg/status
I feel that I had such issue somewhere but now it is too late to try to remember on how to overcome.
per our zoomchat with Tom, he recommended to let users specify target numbers of subjects etc they expect in the dandiset. So might be worth adding an option to organize
. for "validate" separate issue #90
turn dandi cli into pydandi/dandipy/...
related to this i think if we turned dandi into a library, one should be able to do:
import dandi as di
di.get_dataset('ds00001', dataset_base_dir='/data/', mmap=True)
and this call could use whatever necessary (datalad, dandi api, etc.,.) under the hood. in some ways this is similar to what reactopya does, but it does not provide direct access to the object. it would be nice if we had a zarr-based mode (https://zarr.readthedocs.io/en/stable/).
reporting the failure to install on our test aws ec2 box:
yoh@ip-172-31-33-190:~$ git clone https://github.com/dandi/dandi-cli && cd dandi-cli && virtualenv --system-site-packages --python=python3 venvs/dev3 && source venvs/dev3/bin/activate && pip install -e .
Cloning into 'dandi-cli'...
remote: Enumerating objects: 113, done.
remote: Counting objects: 100% (113/113), done.
remote: Compressing objects: 100% (75/75), done.
remote: Total 113 (delta 56), reused 84 (delta 34), pack-reused 0
Receiving objects: 100% (113/113), 40.93 KiB | 10.23 MiB/s, done.
Resolving deltas: 100% (56/56), done.
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/yoh/dandi-cli/venvs/dev3/bin/python3
Also creating executable in /home/yoh/dandi-cli/venvs/dev3/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.
Obtaining file:///home/yoh/dandi-cli
Installing build dependencies ... done
Getting requirements to build wheel ... error
ERROR: Command errored out with exit status 1:
command: /home/yoh/dandi-cli/venvs/dev3/bin/python3 /home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmprj7_0raa
cwd: /home/yoh/dandi-cli
Complete output (10 lines):
Traceback (most recent call last):
File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 207, in <module>
main()
File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 197, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 48, in get_requires_for_build_wheel
backend = _build_backend()
File "/home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py", line 39, in _build_backend
obj = getattr(obj, path_part)
AttributeError: module 'setuptools.build_meta' has no attribute '__legacy__'
----------------------------------------
ERROR: Command errored out with exit status 1: /home/yoh/dandi-cli/venvs/dev3/bin/python3 /home/yoh/dandi-cli/venvs/dev3/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmprj7_0raa Check the logs for full command output.
(dev3) yoh@ip-172-31-33-190:~/dandi-cli$ apt-cache policy python3-setuptools
python3-setuptools:
Installed: 39.0.1-2
Candidate: 39.0.1-2
Version table:
*** 39.0.1-2 500
500 http://us-east-2.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
100 /var/lib/dpkg/status
#26 introduced upload
command. We should enable some kind of unittesting for it
It seems that github actions support docker: https://help.github.com/en/articles/creating-a-docker-container-action and even I found an issue in a now removed "docker" action repository issues: https://webcache.googleusercontent.com/search?q=cache:mPPb1xbgUgEJ:https://github.com/actions/docker/issues/11+&cd=2&hl=en&ct=clnk&gl=us which suggests that docker compose is also available.
But we would need firsts to establish some user creds and obtain the key programmatically first. @mgrauer - can you help with that?
We mandate subject, session, and then optional list of others,... @satra has mentioned a use case by @bendichter where people prefer to avoid using session altogether. I am yet to argue one way (against) or another (support), but here is at least an issue to document this desire/use case ;)
(git-annex)lena:~/proj/dandi/nwb-datasets[master]bendichter/Gao2018
$> dandi validate *nwb
anm00314746_2015-10-20 09:36:04 (1).nwb: ok
anm00314746_2015-10-21 11:25:41 (1).nwb: ok
anm00314746_2015-10-22 15:17:38 (1).nwb: ok
anm00314756_2015-10-20 19:42:11 (1).nwb: ok
anm00314756_2015-10-23 14:10:29 (1).nwb: ok
anm00314757_2015-10-20 17:37:31 (1).nwb: ok
anm00314757_2015-10-21 18:02:48 (1).nwb: ok
anm00314758_2015-10-20 10:49:30.nwb: ok
anm00314758_2015-10-21 10:10:14 (1).nwb: ok
anm00314758_2015-10-22 11:20:47 (1).nwb: ok
anm00314758_2015-10-23 09:49:01 (1).nwb: ok
anm00314760_2015-10-20 15:52:30.nwb: ok
anm00314760_2015-10-21 16:44:27.nwb: ok
anm00314760_2015-10-22 16:39:13 (1).nwb: ok
BAYLORCD12_2018-01-25 19:16:01.nwb: ok
BAYLORCD12_2018-01-26 12:25:06.nwb: ok
Error: Could not construct DynamicTableRegion object due to The index 63 is out of range for this DynamicTable of length 63
$> dandi --version
0.4.4+7.g2f45a27.dirty
the dandi cli should support dandi identifiers. since we know what this maps to we should be able to route it directly without going through identifiers.org. so the following should all be feasible.
dandi download DANDI:000008
dandi download https://identifiers.org/DANDI:000008
dandi download https://dandiarchive.org/dandiset/000008
I have noted that network traffic while rcloning Svoboda's data is only about 10% of the local "write" IO .
That observation is confirmed by simply compressing the obtained .nwb files using tar/gz:
smaug:/mnt/btrfs/datasets/datalad/crawl-misc/svoboda-rclone/Exported NWB 2.0
$> du -scm Chen\ 2017*
35113 Chen 2017
3298 Chen 2017.tgz
38410 total
so indeed -- x10 factor!
Apparently hdmf/pynwb does not bother compressing stored in the .nwb data arrays. They do both document ability to pass compression parameters down (to h5py I guess) though, but as far as I saw it, compression is not on by default. Sure thing hdf5 end compression ration might not reach 10 since not all data will be compressed, but I expect that it will be notable.
As we keep running into those, it might be valuable to provide a dandi compress
command which would take care about (re)compressing provided .nwb files (inplace or into a new file).
Perspective interface:
dandi compress [-i|--inplace] [-o|--output FILE] [-c|--compression METHOD (default gzip)] [-l|--level LEVEL (default 5)] [FILES]
--inplace
to explicitly state to (re)compress each file in place (might want to do not really "inplace" but rather into a new file, and then replace old one -- this would provide a better workflow for git-annex'ed files, where original ones by default would be read/only)--output filename
- where to store output file (then a single FILE is expected to be provided)moving a symlink with a relative path should account for a directory change.
TODO: check what happens with copy -- does it copy symlink or dereferences
now that we can get download links from the gui, this should be possible but results in an error.
$ dandi download https://girder.dandiarchive.org/api/v1/item/5e7b9e41529c28f35128c743/download
Error:
more details:
$ dandi -l DEBUG --pdb download -o . https://girder.dandiarchive.org/api/v1/item/5e7b9e41529c28f35128c743/download
Traceback (most recent call last):
File "/Users/satra/software/miniconda3/envs/dandi/bin/dandi", line 8, in <module>
sys.exit(main())
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/cli/command.py", line 118, in wrapper
return f(*args, **kwargs)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/cli/cmd_download.py", line 53, in download
return download(url, output_dir, existing=existing, develop_debug=develop_debug)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/download.py", line 139, in download
girder_server_url, asset_type, asset_id = parse_dandi_url(url)
File "/Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/download.py", line 75, in parse_dandi_url
assert not u.query
AssertionError
> /Users/satra/software/miniconda3/envs/dandi/lib/python3.7/site-packages/dandi/download.py(75)parse_dandi_url()
-> assert not u.query
(Pdb) url
'https://dandiarchive.s3.amazonaws.com/girder-assetstore/3d/2e/3d2e88d88a974644a6722bdd5790a27b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVBOFDICEV%2F20200430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200430T152456Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEPH%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDJC9loqF4xaxKesvsSK%2FAauvxzItfPkUwulK2K4nVW%2F2xmiIDfonE5UF6fB3KjGlIASW0MTXj8IC6GCFx6kD90HjOIjirro7WPFfYM%2FhhztHjwvDC3bBBH76mzMup1sr3U8wrHkw3S5wIXjIj%2B244Us0maaDVsefgO%2B8g1hfYf7SUDQiLvCe%2BdseyTn4DqAX5NS9TVEG2baaoN2u1FFkX7%2Biy1C1xq3ZAKtq%2FYQEXKZ54UlzoPrH%2BnXzwl1ex%2FKyKtJmRWuoC9CF1sk0dpHUKMvaq%2FUFMi2UKOyzApfKMsXDLIETFNj4hOcunLbVCHYvEuqtvWmAWUCkkxGHyIv%2Ft6JvLCA%3D&X-Amz-Signature=d94191f2886c8b192c4bb4fb44f9dc35cf9b4aa296914baaa64e400f67096389'
pip install dandi
does the following.
...
Created wheel for dandi: filename=dandi-0.0.0-cp37-none-any.whl size=67033 sha256=46af8b37a25b16b497236d7bd4aee6163b5d6fcf01da6f290bc85df90b9de8c9
Stored in directory: /Users/satra/Library/Caches/pip/wheels/60/a6/58/d841466d5d3849c392a32d656c10daa16affd4d0cc2a0a5bdc
Successfully built dandi
Installing collected packages: tqdm, appdirs, joblib, dandi
Successfully installed appdirs-1.4.3 dandi-0.0.0 joblib-0.14.1 tqdm-4.43.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.