neurodatawithoutborders / nwbinspector Goto Github PK
View Code? Open in Web Editor NEWTool to help inspect NWB files for compliance with NWB Best Practices
Home Page: https://nwbinspector.readthedocs.io/
License: Other
Tool to help inspect NWB files for compliance with NWB Best Practices
Home Page: https://nwbinspector.readthedocs.io/
License: Other
In light of nwbinspector becoming a dependency of dandi-cli in https://github.com/dandi/dandi-cli/, nwbinspector better becomes available for conda
distribution within conda-forge
channel so we would be able to continue having dandi
client package there for the next release too.
@jwodder , since you have experience with creating feedstock recipes, would you be so kind to prepare one for nwbinspector?
Yes.
Video files that are linked to nwbfiles under ImageSeries.external_file
should be inspected for:
from pathlib import Path
import requests
video_file_path = "documents/folder0/video0.mp4"
video_file_url = "https://dandiarchive.s3.amazonaws.com/blobs/a52/253/a5225398-ec1c-4325-8bbe-83dc73fa8c87"
def video_exist_check(video_path, nwbfile_path):
video_path_type = video_pathtype_check_absolute(video_path)
if video_path_type == "relative":
pt = Path(nwbfile_path)/Path(video_path)
return pt.exists()
elif video_path_type == "url":
return requests.head(video_path).ok
else:
return Path(video_path).exists()
def video_format_check(video_path):
if Path(video_path).suffix in [".mp4", ".avi", ".wmv", ".mov", ".flv"]:
return True
else:
return False
def video_pathtype_check_absolute(video_path):
if os.path.isabs(video_path):
return "absolute"
elif any([str(video_path).startswith(i) for i in ["ftp://", "http://", "https://"]]):
return "url"
else:
return "relative"
Yes.
In advance of DANDI intention flag feature, need to add a check that subject species is specified. Without DANDI intention this is probably just a BEST_PRACTICE_SUGGESTION.
Also need to add a check that if subject species is specified, it is specified using binomial nomenclature (with proper capitalization as well). Without DANDI intention this is probably a BEST_PRACTICE_VIOLATION since they did take the time to fill it out. There are some special cases that might be OK, like something we've done before is if it's a special sub-species like 'Rattus norvegicus - Long Evans'. Or it looks like DANDI also supports URL's for taxonomies: https://github.com/dandi/dandi-cli/blob/5229292b777f644c5620bddf2abd4738168f0e81/dandi/metadata.py#L477-L486
But both of these will get elevated to CRITICAL when run with intention=DANDI_UPLOAD or DANDI_PUBLISH.
On https://nwbinspector.readthedocs.io/en/dev/best_practices/ogen.html the link the check_optogenetic_stimulus_site_has_optogenetic_series()
is missing/broke
No.
e.g. number of elements in ElectricalSeries.electrodes
should be the same as number of 2nd dim of ElectricalSeries.data
, number of elements in TimeSeries.timestamps
should be the same as 1st dim of TimeSeries.data
Check that the name
field of all objects does not include slashes ('/' or '\').
Not really sure what the neurodata_type
should be for this, though. Maybe have it act on anything that is a subclass of a NWBContainer, and then the logic of the test scans all the items inside that container?
Or just have it be a recursive function that starts at the NWBFile container level and then recurses down any objects that are themselves containers?
It's already in the Best Practices: towards the bottom of https://nwbinspector.readthedocs.io/en/dev/best_practices/naming.html#neurodata-types (which will get cleaned up in the next week)
No response
Yes.
If a user provides a comma-separated scalar string for nwbfile.experimeter
, the inspector should recommend that they instead store an array of experimenter names. Same for keywords and related publications.
We were in the process of making something like this inspired by the previous commented code, but was also discovered in a recent manual inspection.
Basically, an entire row or column of a DynamicTable, or an entire axis of a TimeSeries, should not all be NaN.
There are some exceptions that we have to observe such as x,y,z
for ElectrodeTable. Basically any field that is required schematically but unknown in practice, we'd have to skip. But for axes optionally added or certainly custom fields, they should not be included if they are unknown.
all(np.isnan(getattr(obj, "data")))
Yes.
The most common usage of cache actions is to save the environment used to run tests so setup is not necessary.
This is always recommended to be done by hashing the current versions of the requirements.txt
so you only do fresh installs when the versions change.
Will make the CI even faster, and reduce matrix-strategy computational waste (so we could really easily extend to 3.8 testing as well).
For consistency with the other NWB repos it would be nice to rename the master branch to dev
Let's think about how to make this more extendable. How about this: We have a Check
class, and this defines:
then nwbinspector simply loops through all neuodata objects, tests which checks apply to each object, and runs them.
NWBInspector would also have a method for adding or removing Checks
Also, RoiResponseSeries.rois points to a PlaneSegmentation table
No response
Yes.
Warning: Unexpected input(s) 'yml', valid inputs are ['token', 'files', 'directory', 'flags', 'aws_curl_args', 'codecov_curl_args', 'commit_parent', 'env_vars', 'fail_ci_if_error', 'file', 'functionalities', 'gcov_args', 'gcov_executable', 'gcov_path_exclude', 'gcov_path_include', 'gcov_prefix', 'gcov_root_dir', 'move_coverage_to_trash', 'name', 'network_filter', 'override_branch', 'override_build', 'override_commit', 'override_pr', 'override_tag', 'path_to_write_report', 'root_dir', 'verbose', 'working-directory', 'xcode_derived_data', 'xcode_package']
The current Best Practices (ported from the old ones): https://nwbinspector.readthedocs.io/en/dev/best_practices/ecephys.html#anatomical-coordinates
indicates that for Allen Institute CCF v3, (+x = posterior, +y = inferior, +z = right).
Yes.
This line is cluttering up the inspect output:
Lines 111 to 112 in 7ceab62
I get why we are detecting a single unique value or two because it could be a boolean, but should we print anything for 3 or 4 unique values? What is the thinking behind this?
Right now you can save the report file with any suffix, but what is truly written is a .txt format with .rst sectioning. It would be nice to allow multiple formats to be saved such as .md, .doc, etc.
.md would be easy as the main difference is simply the section heading of #/##/###
instead of =/-/~
For .doc, it is proposed to use https://pypi.org/project/rstdoc/ ; I've tried it a bit earlier today but didn't have any luck getting it to work through the CLI. It seems to use https://pypi.org/project/pandoc/ on the backend so maybe just tapping into that library would be the way to go
Yes.
During recent NWB-DANDI dev hackathon, during demonstration of the nwbinspector I forgot to ask -- why does it produce/save log into a file instead of just printing it to the screen (which then could be redirected to a file if desired, or there could be an option --output-file FILENAME
to make it explicit)?
We've been running the inspector on DANDISets using the ros3
mode for a while now, but a lot of it is still manually specified code for resolving the s3 paths. It's pretty easy to do automatically though so might as well port over the code so we can just call a nice simple function directly each time.
Only concern is if it might produce circular dependencies when the dandi-cli merges the integration with the inspector.
Yes.
Also observing a very bizarre behavior where if there are actual Errors (like a ValueError
) triggered within one of the test functions (see PR #106 that fixes the ones I'm seeing), they aren't grabbed within the try/except
of run_checks
and included in the report, but instead are raised to console as true errors:
but if I don't include the --config dandi
they behave as expected and go into that extra section of the final report.
From a console with working directory located in a folder that has dandiset 000004 (or ros3, alternatively)...
nwbinspector ./000004 --config dandi
and compare to...
nwbinspector ./000004
Screenshot in the first section; particulars of the traceback don't really matter, it's the fact that it IS a traceback pushed to the console as the standard out instead of getting grabbed by the report.
Windows
Python
3.9
@satra has requested checks for more complete metadata associated with intracellular electrophysiology patch clamp experiments, including checking for cell and sample ids.
No response
Yes, but I would need guidance.
It only currently tells which files and how many files failed due to validations and other errors. Basically just copy the Exception traceback outside those try/excepts
I am working on a test for making sure that all time columns of a TimeIntervals object as ascending. This is a situation where a single object could throw multiple messages, one for each column. I think it would be useful to have the check functions yield
the result, which would allow us to iterate over them.
As per breakout session discussion, it may be worth it to re-examine the question of associating explicit error codes to check functions.
Main point is this could (a) simplify skipping references, (b) allow us to change underlying function name but keep error code the same.
A companion to #76 but largely orthogonal.
Although we have not standardized in dandi-cli much either, and having looked into nwbinspector insides, but keeping in mind dandi/dandi-cli#924, i.e. integration with dandi-cli, I think it would be great if internally all the "hints/warnings/whatever" were represented as some basic data structure (well -- dict
) so from CLI I could request just a dump of those hints in machine readable form (e.g. JSON). Default "rendered" of the output could format them as it does now. And then dandi-cli code could use nwbinspector as a library and just get a list of those records for the path(s) it inquire about.
To some degree such an idea of "alternative output formats" could be seen in dandi ls --format
option which provides a range of those. Also, datalad
inspired folks could use -f
with any datalad command ;)
Optional flag or config for DANDI alteration of importance levels of certain checks.
Idea from Dorota: warn the user that output report may appear different.
Some NWB files with extensions cannot be loaded purely with NWBHDF5IO(path, load_namespaces=True)
. They require the API classes to be loaded first, e.g., import ndx_franklab_novela
.
the -j --njobs flag should control number of simultaneous jobs that run inspector over NWB files in parallel
Yes.
This is showing up for many datasets:
BEST PRACTICE VIOLATION
-----------------------
1.1.1 NWBFile 'root' located in '/'
check_dataset_compression: Consider enabling compression when writing a large dataset.
It's not clear what dataset is causing the violation.
Should probably check the session start time has been set properly within a file; not just as a datetime
object, but as a valid date range for performing neurophysiology studies. E.g., if session_start_time <= datetime(1970, 1, 1)
print "Session start time may not be set to the true date of the recording."
In advance of DANDI intention flag feature, need to add a check that subject ID is specified. Without DANDI intention this is probably just a BEST_PRACTICE_SUGGESTION.
Also need to add a check that if subject ID is specified, it is specified in ISO8601. Without DANDI intention this is probably a BEST_PRACTICE_VIOLATION since they did take the time to fill it out.
But both of these will get elevated to CRITICAL when run with intention=DANDI_UPLOAD or DANDI_PUBLISH.
An idea that came up during a recent meeting with some BIDS folk would be to have a nice user-friendly way of automatically fixing common issues that come up during a run of the NWBInspector.
Clearly, some types of Best Practice violations, such as wrong data orientation, no data compression/chunking, etc. would require a complete reconversion of the data, so are not applicable here.
Likewise, the overlay/sidecar project Ryan is working on would be required for editing the form of any metadata violations (like ISO 8601, subject naming conventions, etc.) for fields that had already been written to the file.
There is, however, one case where this sort of feature could be quite useful: missing metadata
The feature would be as follows:
.yml
file (with structure very similar to how we do this in the NWBCT) would be saved and the user would be asked to review it one more time and confirm the values within it are accurate..yml
file, which would result in all the missing metadata from the .yml
being written to the NWBFile(s) in append
mode, and deleting the source .yml
file to keep things clean (only if no errors occurred during the process). Basically the idea is to 'fold' the .yml
content into the NWBFile.Yes.
not yet sure about all "features" but such config file should support but it must allow to specify
This is largely inspired by config file in bids-validator, see https://github.com/bids-standard/bids-validator#configuration
It is needed since some of such ignores are "per dataset" and thus it should be possible to specify what to ignore in a specific dataset (or dandiset) in DANDI land, instead of relying on user to know what command line options to provide.
https://github.com/organizations/NeurodataWithoutBorders/settings/repository-defaults
Yes.
Originally requested in #37 . I think that the implementation in #86 went on a different "tangent" from what my bids-validator-rotten mind had in mind: e.g. given the "stock" (provided by nwbinspector) dandi
configuration I want to disable some tests/checks, e.g. reacting to some benign check_data_orientation
as given in the example of dandi/handbook#40 .
Then I would have created some .nwbinspector.cfg
or alike which would have instructed to ignore such errors if nwbinspector was ran in this folder.
Also #86 has TODO: expose to command line
-- was that done?
No.
Just need to add the little badge displays for version, tests, coverage, etc.
I'm getting this:
1.1.2 Subject 'subject' located in ''
check_subject_age: Subject is missing age.
even if the Subject has date_of_birth
defined.
I think check_subject_age
should be changed to check_subject_age_or_dob
Yes.
At times in the past, if a starting_time
for some TimeSeries
was unknown we set it to np.nan
to indicate this fact. I just wanted to raise the question here officially to see if this is indeed encouraged or if the default behavior makes sense.
It's not stated in the best practices, and default behavior seems to assume that it's always 0.0
, but especially in cases where multiple data streams are incoming to the NWBFile this may not be correct due to synchronizing offsets.
No response
Yes.
@bendichter or @lawrence-mbf In the Best Practices for the NWBFile identifier
, I instruct PyNWB users towards the UUID project. What is the supported analog in MatNWB and do you have a link to a tutorial or other instructions that I can include?
Yes.
In advance of DANDI intention flag feature, need to add a check that subject ID is specified. Without DANDI intention this is probably just a BEST_PRACTICE_SUGGESTION, but will get elevated to CRITICAL when run with intention=DANDI_UPLOAD
or DANDI_PUBLISH
.
Two files in repo
https://github.com/NeurodataWithoutBorders/nwbinspector/blob/dev/Legal.txt
and
https://github.com/NeurodataWithoutBorders/nwbinspector/blob/dev/license.txt
do we need them both?
Testing out this new button on GitHub - when you go into the ...
options in a GitHub discussion and select 'raise in new issue'.
Anyway, this is super low priority ATM, but leaving here for future reference. Especially now we have parallelization functioning well, performance per job is less of an issue.
An analogous approach can also be made to improve the main iteration loops for the NWBFile objects vs. applicable neurodata_types in inspect_nwb
.
From Ben's original posting:
how about this. I create a objects_by_attr
dict so that the objects are only looped over once per recursion
from collections import defaultdict
def hier_org(objects, levels, reverse=None):
if reverse is None:
reverse = [False] * len(levels)
unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects))
if reverse[0]:
unique_attrs = unique_attrs[::-1]
objects_by_attr = defaultdict(list)
[objects_by_attr[getattr(obj, levels[0])].append(obj) for obj in objects]
if len(levels) > 1:
return {attr: hier_org(objects_by_attr[attr], levels[1:], reverse[1:]) for attr in unique_attrs}
else:
return {attr: objects_by_attr[attr][0] for attr in unique_attrs}
# test
from collections import namedtuple
Obj = namedtuple("Obj", "a b c")
objects = []
for a in range(3):
for b in range(3):
for c in range(3):
objects.append(Obj(a=a, b=b, c=c))
hier_org(objects, ["a", "b", "c"], [True, False, True])
Originally posted by @bendichter in #89 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.