PEP validation tool based on jsonschema. See documentation for usage.
pepkit / eido Goto Github PK
View Code? Open in Web Editor NEWValidator for PEP objects
Home Page: http://eido.databio.org
License: BSD 2-Clause "Simplified" License
Validator for PEP objects
Home Page: http://eido.databio.org
License: BSD 2-Clause "Simplified" License
PEP validation tool based on jsonschema. See documentation for usage.
we should have a service that people can upload their PEP to and it will validate for them.
Maybe even make this an API that a tool could call.
Ideally, maybe there's a schema repository.
We need to have a built-in filter that returns a processed PEP.
So, the input is, of course, a PEP. The output is a PEP that has run through sample modifiers and project modifiers -- that is, a processed PEP.
So, the user who calls the filter, if he wants to spit the output to files, needs to provide several output file paths; in fact, all the possible files that could go into a PEP:
See PR #38.
All the filter does is load the PEP with peppy, and then return the objects (as strings), which will have been processed.
@nleroy917 does this make sense to you?
How can I write a schema which will validate the existence of files specified in the subsample table? I almost always specify read1 and read2 in the subsample table because I rarely have just a single pair of FASTQ files per sample (i.e. usually from multiple lanes). The schema below (adapted from the examples page) passes validation without any FASTQ files present so I assume when the read1 and read2 attributes are arrays it doesn't check for the existence of each item in the array?
description: Schema
imports:
- http://schema.databio.org/pep/2.0.0.yaml
properties:
samples:
type: array
items:
type: object
properties:
sample_name:
type: string
description: "Name of the sample"
read1:
anyOf:
- type: string
description: "Fastq file for read 1"
- type: array
items:
type: string
read2:
anyOf:
- type: string
description: "Fastq file for read 2 (for paired-end experiments)"
- type: array
items:
type: string
required_files:
- read1
- read2
files:
- read1
- read2
required:
- sample_name
- read1
- read2
required:
- samples
eido repository is > 20M, not sure why. probably need to rewrite history on it.
I am using Eido as a part of snakemake
to enforce the PEP Metadata format declaration for my work. I have written the required schemas after following the tutorial, however, when I try to run validation, I get the following error:
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.9/site-packages/snakemake/__init__.py", line 593, in snakemake
workflow.include(
File "/opt/homebrew/lib/python3.9/site-packages/snakemake/workflow.py", line 1182, in include
exec(compile(code, snakefile.get_path_or_uri(), "exec"), self.globals)
File "/Users/g-kodes/Documents/Pharmacogenetic-Analysis-Pipeline/workflow/Snakefile", line 31, in <module>
# DEFINE CONTEXT-VARIABLES:
File "/opt/homebrew/lib/python3.9/site-packages/snakemake/workflow.py", line 1267, in pepschema
eido.validate_project(project=pep, schema=schema, exclude_case=True)
File "/opt/homebrew/lib/python3.9/site-packages/eido/validation.py", line 45, in validate_project
_validate_object(project_dict, preprocess_schema(schema_dict), exclude_case)
File "/opt/homebrew/lib/python3.9/site-packages/eido/schema.py", line 32, in preprocess_schema
"items" in schema_dict[PROP_KEY]["_samples"]
KeyError: '_samples'
I have tried importing my PEP schemas using peppy
the indicated Python package, and it imports fine there. When I try to validate manually using the eido
cli, I receive the following error which has the same issue, so I don't think this is a snakemake
or peppy
issue:
Traceback (most recent call last):
File "/opt/homebrew/bin/eido", line 8, in <module>
sys.exit(main())
File "/opt/homebrew/lib/python3.9/site-packages/eido/cli.py", line 89, in main
validate_project(p, args.schema, args.exclude_case)
File "/opt/homebrew/lib/python3.9/site-packages/eido/validation.py", line 45, in validate_project
_validate_object(project_dict, preprocess_schema(schema_dict), exclude_case)
File "/opt/homebrew/lib/python3.9/site-packages/eido/schema.py", line 32, in preprocess_schema
"items" in schema_dict[PROP_KEY]["_samples"]
KeyError: '_samples'
As noted:
databio/schema.databio.org@cfb577b
for a PEP, any attributes with subsamples will have arrays, where attribute without will not... so this will be universal to just about all attributes in all PEPs.
Eido needs to accommodate this.
@stolarczyk is the mkdocs serving working for you like this? Here's what I get;
~/code/eido$ mkdocs serve
INFO - Building documentation...
WARNING - Config value: 'pypi_name'. Warning: Unrecognised configuration name: pypi_name
Running AutoDocumenter plugin
[Errno 2] No such file or directory: '/home/nsheff/code/eido/docs_jupyter/build/cli.md'
is this something local to my setup? I can build other sites without problem...
It would be great if eido
could simply print the help when called in the command line with no arguments instead of throwing an error.
something like:
eido.validate(pep, schema):
...
eido should accept a URL for the schema:
eido -p example/cfg.yaml -s http://schemas.databio.org/bed_maker.yaml Reading sample annotations sheet: '/home/nsheff/code/bedmaker/example/samples_to_convert.csv'
Storing sample table from file '/home/nsheff/code/bedmaker/example/samples_to_convert.csv'
Traceback (most recent call last):
File "/home/nsheff/.local/bin/eido", line 8, in <module>
sys.exit(main())
File "/home/nsheff/.local/lib/python3.7/site-packages/eido/eido.py", line 199, in main
validate_project(p, args.schema, args.exclude_case)
File "/home/nsheff/.local/lib/python3.7/site-packages/eido/eido.py", line 126, in validate_project
schema_dict = _read_schema(schema=schema)
File "/home/nsheff/.local/lib/python3.7/site-packages/eido/eido.py", line 97, in _read_schema
raise TypeError("schema has to be either a dict or a path to an existing file")
TypeError: schema has to be either a dict or a path to an existing file
I've just done something similar in henge. Perhaps we should move this function into ubiquerg
:
Is there a way to specify which amendments when using eido? I cannot find anything about this in the documentation, and I would like to be able to validate a project with a subset of its amendments against a specific schema.
It would also be nice to be able to specify amendments when converting a project to a different format using eido, or at least activate all amendments so that at all information is present in the csv output.
I'm trying to add to the docs a list of how eido differs from basic jsonschema. I don't think this is listed anywhere. is this complete?
required_input_attrs
, which allows a schema author to specify which attributes must point to files that exist.input_attrs
specifies which attributes point to files that may or may not exist.imports
section that lists schemas that should be validate prior to this schema (more detailed description of importing can be found here: http://eido.databio.org/en/dev/demo/)Sample
attributes of types: "string", "number", "boolean" (if schema restricts an attribute to type X, an array of Xs is also valid).Since eido is really only relevant in the context of PEP, do you think I should consolidate the eido docs page into the new pep spec site? I mean, we've already basically started documenting eido at pep.databio.org... so as I'm trying to make this a self-contained docs page, I'm realizing that I'm duplicating a lot of info...
eido should also indicate missing optional attributes, the way looper used to do that.
...
...
...
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pypiper/__init__.py:6: in <module>
from .manager import *
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pypiper/manager.py:32: in <module>
from pipestat import PipestatError, PipestatManager
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pipestat/__init__.py:8: in <module>
from .pipestat import (
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pipestat/pipestat.py:24: in <module>
from .reports import HTMLReportBuilder, _create_stats_objs_summaries
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pipestat/reports.py:13: in <module>
from eido import read_schema
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/eido/__init__.py:7: in <module>
from .conversion import *
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/eido/conversion.py:6: in <module>
from pkg_resources import iter_entry_points
E ModuleNotFoundError: No module named 'pkg_resources'
Hello guys,
firstly thx for your project and the efforts you pushed into it :)
After weeks of utilizing PEP without any issues I encountered this error the last days:
RemoteYAMLError in line 3 of /sc-scratch/sc-scratch-btg/olik_splicing_project/splice-prediction/snakemake_workflows/Snakemake_Main/Snakefile:
Could not load remote file: http://schema.databio.org/pep/2.1.0.yaml. Original exception: <HTTPError 403: 'Forbidden'>
File "/sc-scratch/sc-scratch-btg/olik_splicing_project/splice-prediction/snakemake_workflows/Snakemake_Main/Snakefile", line 3, in <module>
File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/validation.py", line 50, in validate_project
File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/schema.py", line 76, in read_schema
File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/schema.py", line 68, in _recursively_read_schemas
File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/schema.py", line 76, in read_schema
File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/peppy/utils.py", line 122, in load_yaml
At first I thought it could be the settings of the cluster, where I run the pipeline, but even on my home computer the same error gets produced.
Sanity check with wget: wget http://schema.databio.org/pep/2.1.0.yaml
runs without any problems.
But python:
from urllib.request import urlopen
urlopen("http://schema.databio.org/pep/2.1.0.yaml")
returns urllib.error.HTTPError: HTTP Error 403: Forbidden
.
Do you have ideas for the potential reason?
Also: Does there exist a possibility to use a local pep-version-config file? So that instead of pep_version: 2.1.0
in the pep-configuration file, I can directly reference a local file?
My set-up:
After peppy update eido's filter for csv raises error within _convert_sample_to_row()
method.
The validate_inputs
function behaves differently and has more responsibilities than other validation functions, which was dictated by our use case in looper. Instead of raising an exception, it records missing files and calculates their sizes. Here's an example:
validate_inputs(sample=p.samples[0], schema="schema.yaml")
1 input files missing, job input size was not calculated accurately
Out[5]:
{'missing': ['/Users/mstolarczyk/Desktop/testing/eido/file11A.txt'],
'required_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
'all_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
'/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
'input_file_size': 0.0}
So based on this output it is the responsibility of the client software to decide what to do in case one or more files are missing.
Originally posted by @stolarczyk in #26 (comment)
Latest updates in pepkit allowed users to use PEP without config.yaml
- just having sample_table.csv
; however when sample_table.csv
does not have sample_name
column (eg. it has sample
instead) then the only way to specify this and allow correct validation from command line is to have config.yaml
defining new sample table index column name.
What we need is to make sure eido can validate only sample_table.csv
without config.yaml
even when sample_table
has different index column name. The idea is to allow user passing this information to schema.yaml
and make eido read that.
Possible options:
schema.yaml
as index column"1. Value specified in Project constructor
2. Value specified in Config
3. Default value (sample_name)
to
1. Value specified in Project constructor
2. Value specified in Config
3. Value specified in schema.yaml
4. Default value (sample_name)
This way schema will not be required to validate if there is a config or if the sample column name is default.
Tasks to do here:
sample_table.csv
without config.yaml
in case of different index_column_name
Giving a bogus filter fails silently:
eido convert https://raw.githubusercontent.com/databio/bedshift_analysis/master/pep_main/project_config.yaml -f bogus_filter
It should instead say bogus filter not found. options are: {filters}
. Or something.
Per discussion, we would like eido to use the same attribute names as pipestat (project vs config) while still maintaining backwards compatibility.
Originally posted by @donaldcampbelljr in pepkit/pipestat#85 (comment)
Hi guys,
thx for your tool! Your concept is really cool and a handy feature :)
I have decided to integrate this tool in my Snakemake-Pipeline.
However, I now already stumbled multiple times over the issue of having metadata-files that fail the eido evaluation, but the Error messages which are returned by the tool are not helping at all.
Thus, every time when I face such an error I have to invest a lot of time to finally figure out what's the reason for the failing validation.
Here is a minimal reproducible example:
pep_schemal.yaml
description: Minimal example
imports:
- http://schema.databio.org/pep/2.1.0.yaml
properties:
samples:
type: array
items:
type: object
properties:
sample_directory:
type: string
pattern: "^/\\S+$|None"
input_no_error.csv
sample_name,sample_directory
test,/testung
input_error.csv
sample_name,sample_directory
test,testung
Then the output:
# No error
$ eido validate input_no_error.csv -s pep_schema.yaml
Validation successful
# Error
$ eido validate input_error.csv -s pep_schema.yaml
Traceback (most recent call last):
File "/Users/oliverkuchler/miniforge3/envs/snakemake7/bin/eido", line 10, in <module>
sys.exit(main())
^^^^^^
File "/Users/oliverkuchler/miniforge3/envs/snakemake7/lib/python3.12/site-packages/eido/cli.py", line 159, in main
validator(*arguments)
File "/Users/oliverkuchler/miniforge3/envs/snakemake7/lib/python3.12/site-packages/eido/validation.py", line 73, in validate_project
_validate_object(
File "/Users/oliverkuchler/miniforge3/envs/snakemake7/lib/python3.12/site-packages/eido/validation.py", line 45, in _validate_object
instance_name = error.instance[sample_name_colname]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
Is there any plans for the future to improve the output?
It would be very cool if one could at least see, because of which input the validation fails.
If peppy.Project objects are no longer attmap
, then sample objects will not validate as objects because they are MutableMapping
, which doesn't validate by default with jsonschema, Due to the type customization:
https://python-jsonschema.readthedocs.io/en/stable/validate/#validating-types
It's the same as this issue: python-jsonschema/jsonschema#592
The answer to that we have to customize the validator so that MutableMapping
will pass as type object.
The current eido dev docs say:
In the above example, we listed read1 and read2 attributes as required. This will enforce that these attributes must be defined on the samples, but for this example, this is not enough -- these also must point to files that exist. Checking for files is outside the scope of JSON Schema, which only validates JSON documents, so eido extends JSON Schema with the ability to specify which attributes should point to files.
Eido provides two ways to do it: input_attrs and required_input_attrs. The basic input_attrs is simply used to specify which attributes point to files, which are not required to exist. This is useful for tools that want to calculate the total size of any provided inputs, for example. The required_input_attrs list specifies that the attributes point to files that must exist, otherwise the PEP doesn't validate. Here's an example of specifying an optional and required input attribute:
What should we name these?
required_input_attrs
input_attrs
@stolarczyk says current actual implementation is using:
required_inputs_attr
all_inputs_attr
Today talking with some nf-core Nextflow developers, it came up that it would be useful to be able to output a processed PEP, either in CSV format or in yaml/json format.
So, think of it as a PEP (yaml+csv) -> YAML converter... it's kind of a "filter" that would read the PEP and output it in the other format. This is basically what looper does when it creates the sample yaml files, which can be modulated with looper plugins. The difference here I guess is that we don't need all the rest of the looper capability -- just the printing of sample yaml files, perhaps all in one file. We need just some command-line tool that would output the PEP in YAML format.
I think this might make sense to have as part of eido, since it already provides a command-line interface... And in fact, could go to the point of, maybe, extracting out the looper sample-writing capabilities to put into eido. In that case, the plugin system may actually be useful here.
@stolarczyk thoughts?
I'd like to get a list of errors, for example, like this:
https://python-jsonschema.readthedocs.io/en/stable/errors/
but right now there's no real way to do this; could we restructure to allow this kind of construct instead of just throwing the exceptions?
The function definition for convert_project
will successfully convert a pep given a valid filter, however it runs a sys.exit(0)
which seems like odd behavior. We are utilizing this for the pephub server and it results in server crashes.
Attempts to circumvent this by directly calling run_filter
only display the conversion result to stdout
, is it possible to return as yaml
instead of returning None
?
It looks like the use_case
parameter in _validate_object
is unused, and that some other functions have that parameter simply for passage to _validate_object
. @nsheff is this a known issue and something that's been retained for backward- or cross-project-compatibility, or can we try removing it?
A CLI to validate a PEP against a schema.
Maybe something like: eido --pep pep-config.yaml --schema schema.yaml
$ eido validate project_config.yml -s schema.yml
Traceback (most recent call last):
File "micromamba/envs/PEP/bin/eido", line 10, in <module>
sys.exit(main())
^^^^^^
File "micromamba/envs/PEP/lib/python3.12/site-packages/eido/cli.py", line 159, in main
validator(*arguments)
TypeError: validate_project() takes 2 positional arguments but 3 were given
Install eido 0.1.9
If you have a sample with an attribute with multiple values, the CSV writer will write them into a CSV in a python list form:
eido convert https://raw.githubusercontent.com/pepkit/nf-core-pep/master/samplesheet_test.csv --st-index sample -f csv
Result:
Found 2 samples with non-unique names: {'WT_REP1', 'RAP1_UNINDUCED_REP2'}. Attempting to auto-merge.
Running plugin csv
sample,sample,fastq_1,fastq_2,strandedness
WT_REP2,WT_REP2,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,RAP1_UNINDUCED_REP1,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz,,reverse
RAP1_IAA_30M_REP1,RAP1_IAA_30M_REP1,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz,reverse
WT_REP1,WT_REP1,"['https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz', 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz']","['https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz', 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz']",reverse
RAP1_UNINDUCED_REP2,RAP1_UNINDUCED_REP2,"['https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz', 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz']",,reverse
Might make more sense to do this in the way of multiple rows per sample, for the purposes of the CSV filter ๐คฎ
How can I get the list of availlable output formats?
nsheff@zither:~/code/bedshift_paper$ eido convert --help
usage: eido convert [-h] [--st-index ST_INDEX] -f FORMAT [-n SAMPLE_NAME [SAMPLE_NAME ...]]
[-a ARGS [ARGS ...]]
PEP
Convert a PEP using an available filter
positional arguments:
PEP Path to a PEP configuration file in yaml format.
optional arguments:
-h, --help show this help message and exit
--st-index ST_INDEX Sample table index to use, samples are identified by 'sample_name' by default.
-f FORMAT, --format FORMAT
Path to a PEP schema file in yaml format.
-n SAMPLE_NAME [SAMPLE_NAME ...], --sample-name SAMPLE_NAME [SAMPLE_NAME ...]
Name of the samples to inspect.
-a ARGS [ARGS ...], --args ARGS [ARGS ...]
Provide arguments to the filter function (e.g. arg1=val1 arg2=val2).
Also, is the -f
command docs here correct? I think that's the wrong help string.
There are some issues in looper that are related to this, since right now the pipeline interface is basically serving this purpose (poorly)...
How JK does something similar in snakemake:
https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/master/rules/common.smk#L11
Docs are not building on dev because:
from peppy.utils import load_yaml
ImportError: cannot import name 'load_yaml' from 'peppy.utils' (/home/docs/checkouts/readthedocs.org/user_builds/eido/envs/dev/lib/python3.7/site-packages/peppy/utils.py)
Right now, the validation functions are printing out notices and then raising exceptions. Instead, what if the raising of exception should happen by the calling tool?
The validation function would collect and format the errors. Then, the calling function should decide what to do with it. For example, in the case of the CLI, it should print it out. But in other cases, we may just want a list of the validation error. So, should the printing move from the validation function to the calling context?
Is there a schema for a generic PEP? It may be nice to validate PEPs generally.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.