pepkit / eido Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 6.0 380 KB

Validator for PEP objects

Home Page: http://eido.databio.org

License: BSD 2-Clause "Simplified" License

Python 48.10% Jupyter Notebook 39.22% CSS 1.47% HTML 10.76% Shell 0.20% Dockerfile 0.24%

pep validation jsonschema conversion metadata

eido's Introduction

PEP validation tool based on jsonschema. See documentation for usage.

eido's People

Contributors

Stargazers

Watchers

Forkers

alaindomissy nleroy917 stolarczyk fabianegli redmar-van-den-berg vreuter

eido's Issues

web service

we should have a service that people can upload their PEP to and it will validate for them.

Maybe even make this an API that a tool could call.

Ideally, maybe there's a schema repository.

Processed PEP filter

We need to have a built-in filter that returns a processed PEP.

So, the input is, of course, a PEP. The output is a PEP that has run through sample modifiers and project modifiers -- that is, a processed PEP.

So, the user who calls the filter, if he wants to spit the output to files, needs to provide several output file paths; in fact, all the possible files that could go into a PEP:

project config
sample table
subsample table

See PR #38.

All the filter does is load the PEP with peppy, and then return the objects (as strings), which will have been processed.

@nleroy917 does this make sense to you?

Check existence of files in subsample_table

How can I write a schema which will validate the existence of files specified in the subsample table? I almost always specify read1 and read2 in the subsample table because I rarely have just a single pair of FASTQ files per sample (i.e. usually from multiple lanes). The schema below (adapted from the examples page) passes validation without any FASTQ files present so I assume when the read1 and read2 attributes are arrays it doesn't check for the existence of each item in the array?

description: Schema
imports:
  - http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        read1:
          anyOf:
            - type: string
              description: "Fastq file for read 1"
            - type: array
              items:
                type: string
        read2:
          anyOf:
            - type: string
              description: "Fastq file for read 2 (for paired-end experiments)"
            - type: array
              items:
                type: string
      required_files:
        - read1
        - read2
      files:
        - read1
        - read2
      required:
        - sample_name
        - read1
        - read2
required:
  - samples

removelarge historical files

eido repository is > 20M, not sure why. probably need to rewrite history on it.

[BREAKING] KeyError: '_samples' Cant validate schema

I am using Eido as a part of snakemake to enforce the PEP Metadata format declaration for my work. I have written the required schemas after following the tutorial, however, when I try to run validation, I get the following error:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/snakemake/__init__.py", line 593, in snakemake
    workflow.include(
  File "/opt/homebrew/lib/python3.9/site-packages/snakemake/workflow.py", line 1182, in include
    exec(compile(code, snakefile.get_path_or_uri(), "exec"), self.globals)
  File "/Users/g-kodes/Documents/Pharmacogenetic-Analysis-Pipeline/workflow/Snakefile", line 31, in <module>
    # DEFINE CONTEXT-VARIABLES:
  File "/opt/homebrew/lib/python3.9/site-packages/snakemake/workflow.py", line 1267, in pepschema
    eido.validate_project(project=pep, schema=schema, exclude_case=True)
  File "/opt/homebrew/lib/python3.9/site-packages/eido/validation.py", line 45, in validate_project
    _validate_object(project_dict, preprocess_schema(schema_dict), exclude_case)
  File "/opt/homebrew/lib/python3.9/site-packages/eido/schema.py", line 32, in preprocess_schema
    "items" in schema_dict[PROP_KEY]["_samples"]
KeyError: '_samples'

I have tried importing my PEP schemas using peppy the indicated Python package, and it imports fine there. When I try to validate manually using the eido cli, I receive the following error which has the same issue, so I don't think this is a snakemake or peppy issue:

Traceback (most recent call last):
  File "/opt/homebrew/bin/eido", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/lib/python3.9/site-packages/eido/cli.py", line 89, in main
    validate_project(p, args.schema, args.exclude_case)
  File "/opt/homebrew/lib/python3.9/site-packages/eido/validation.py", line 45, in validate_project
    _validate_object(project_dict, preprocess_schema(schema_dict), exclude_case)
  File "/opt/homebrew/lib/python3.9/site-packages/eido/schema.py", line 32, in preprocess_schema
    "items" in schema_dict[PROP_KEY]["_samples"]
KeyError: '_samples'

accommodating subsamples

As noted:

databio/schema.databio.org@cfb577b

for a PEP, any attributes with subsamples will have arrays, where attribute without will not... so this will be universal to just about all attributes in all PEPs.

Eido needs to accommodate this.

docs format?

@stolarczyk is the mkdocs serving working for you like this? Here's what I get;

~/code/eido$ mkdocs serve
INFO    -  Building documentation... 
WARNING -  Config value: 'pypi_name'. Warning: Unrecognised configuration name: pypi_name 
Running AutoDocumenter plugin

[Errno 2] No such file or directory: '/home/nsheff/code/eido/docs_jupyter/build/cli.md'

is this something local to my setup? I can build other sites without problem...

Error on argument-less CLI call

It would be great if eido could simply print the help when called in the command line with no arguments instead of throwing an error.

Python API

something like:

eido.validate(pep, schema):
   ...

accept URLs for schemas

eido should accept a URL for the schema:

eido -p example/cfg.yaml -s http://schemas.databio.org/bed_maker.yaml Reading sample annotations sheet: '/home/nsheff/code/bedmaker/example/samples_to_convert.csv'
Storing sample table from file '/home/nsheff/code/bedmaker/example/samples_to_convert.csv'
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/eido", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.7/site-packages/eido/eido.py", line 199, in main
    validate_project(p, args.schema, args.exclude_case)
  File "/home/nsheff/.local/lib/python3.7/site-packages/eido/eido.py", line 126, in validate_project
    schema_dict = _read_schema(schema=schema)
  File "/home/nsheff/.local/lib/python3.7/site-packages/eido/eido.py", line 97, in _read_schema
    raise TypeError("schema has to be either a dict or a path to an existing file")
TypeError: schema has to be either a dict or a path to an existing file

I've just done something similar in henge. Perhaps we should move this function into ubiquerg:

https://github.com/databio/henge/blob/cae29759ae3fe7fa009d4f888e3f8cd69db0f165/henge/henge.py#L218-L240

Use eido with project modifiers

Is there a way to specify which amendments when using eido? I cannot find anything about this in the documentation, and I would like to be able to validate a project with a subset of its amendments against a specific schema.

It would also be nice to be able to specify amendments when converting a project to a different format using eido, or at least activate all amendments so that at all information is present in the csv output.

Docs questions

Features list

I'm trying to add to the docs a list of how eido differs from basic jsonschema. I don't think this is listed anywhere. is this complete?

required input files. Eido adds required_input_attrs, which allows a schema author to specify which attributes must point to files that exist.
optional input files. input_attrs specifies which attributes point to files that may or may not exist.
project and sample validation. Eido validates project attributes separately from sample attributes.
schema imports. Eido adds imports section that lists schemas that should be validate prior to this schema (more detailed description of importing can be found here: http://eido.databio.org/en/dev/demo/)
automatic multi value support. Eido validates successfully for singular or plural Sample attributes of types: "string", "number", "boolean" (if schema restricts an attribute to type X, an array of Xs is also valid).

Docs location

Since eido is really only relevant in the context of PEP, do you think I should consolidate the eido docs page into the new pep spec site? I mean, we've already basically started documenting eido at pep.databio.org... so as I'm trying to make this a self-contained docs page, I'm realizing that I'm duplicating a lot of info...

missing optional

eido should also indicate missing optional attributes, the way looper used to do that.

Hitting an `ImportError` through indirect dependency on `eido` on Python 3.10

...
...
...
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pypiper/__init__.py:6: in <module>
    from .manager import *
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pypiper/manager.py:32: in <module>
    from pipestat import PipestatError, PipestatManager
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pipestat/__init__.py:8: in <module>
    from .pipestat import (
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pipestat/pipestat.py:24: in <module>
    from .reports import HTMLReportBuilder, _create_stats_objs_summaries
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/pipestat/reports.py:13: in <module>
    from eido import read_schema
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/eido/__init__.py:7: in <module>
    from .conversion import *
../../../.cache/pypoetry/virtualenvs/looptrace-FyWUP9y7-py3.10/lib/python3.10/site-packages/eido/conversion.py:6: in <module>
    from pkg_resources import iter_entry_points
E   ModuleNotFoundError: No module named 'pkg_resources'

RemoteYAMLError

Hello guys,

firstly thx for your project and the efforts you pushed into it :)

After weeks of utilizing PEP without any issues I encountered this error the last days:

RemoteYAMLError in line 3 of /sc-scratch/sc-scratch-btg/olik_splicing_project/splice-prediction/snakemake_workflows/Snakemake_Main/Snakefile:
Could not load remote file: http://schema.databio.org/pep/2.1.0.yaml. Original exception: <HTTPError 403: 'Forbidden'>
  File "/sc-scratch/sc-scratch-btg/olik_splicing_project/splice-prediction/snakemake_workflows/Snakemake_Main/Snakefile", line 3, in <module>
  File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/validation.py", line 50, in validate_project
  File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/schema.py", line 76, in read_schema
  File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/schema.py", line 68, in _recursively_read_schemas
  File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/eido/schema.py", line 76, in read_schema
  File "/home/kuechleo/mambaforge/envs/snakemake/lib/python3.10/site-packages/peppy/utils.py", line 122, in load_yaml

At first I thought it could be the settings of the cluster, where I run the pipeline, but even on my home computer the same error gets produced.
Sanity check with wget: wget http://schema.databio.org/pep/2.1.0.yaml runs without any problems.
But python:

from urllib.request import urlopen
urlopen("http://schema.databio.org/pep/2.1.0.yaml")

returns urllib.error.HTTPError: HTTP Error 403: Forbidden.

Do you have ideas for the potential reason?

Also: Does there exist a possibility to use a local pep-version-config file? So that instead of pep_version: 2.1.0 in the pep-configuration file, I can directly reference a local file?

My set-up:

Using snakemake
eido 0.1.6 pyhd8ed1ab_0 conda-forge
peppy 0.35.3 pyhd8ed1ab_0 conda-forge

CSV filter bug after peppy update

After peppy update eido's filter for csv raises error within _convert_sample_to_row() method.

Uniform eido validation behavior for inputs

The validate_inputs function behaves differently and has more responsibilities than other validation functions, which was dictated by our use case in looper. Instead of raising an exception, it records missing files and calculates their sizes. Here's an example:

validate_inputs(sample=p.samples[0], schema="schema.yaml")

1 input files missing, job input size was not calculated accurately

Out[5]: 
{'missing': ['/Users/mstolarczyk/Desktop/testing/eido/file11A.txt'],
 'required_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
 'all_inputs': {'/Users/mstolarczyk/Desktop/testing/eido/file11A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file11B.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12A.txt',
  '/Users/mstolarczyk/Desktop/testing/eido/file12B.txt'},
 'input_file_size': 0.0}

So based on this output it is the responsibility of the client software to decide what to do in case one or more files are missing.

Originally posted by @stolarczyk in #26 (comment)

Allow specifying `index_column_name` without `config.yaml`

Latest updates in pepkit allowed users to use PEP without config.yaml - just having sample_table.csv; however when sample_table.csv does not have sample_name column (eg. it has sample instead) then the only way to specify this and allow correct validation from command line is to have config.yaml defining new sample table index column name.

What we need is to make sure eido can validate only sample_table.csv without config.yaml even when sample_table has different index column name. The idea is to allow user passing this information to schema.yaml and make eido read that.

Possible options:

Tell eido "always use the first column defined in the schema.yaml as index column"
Change the selection priority order described here from:

1. Value specified in Project constructor
2. Value specified in Config
3. Default value (sample_name)

1. Value specified in Project constructor
2. Value specified in Config
3. Value specified in schema.yaml
4. Default value (sample_name)

This way schema will not be required to validate if there is a config or if the sample column name is default.

Tasks to do here:

Allow sample_table.csv without config.yaml in case of different index_column_name
Update eido documentation

Warn if bad filter

Giving a bogus filter fails silently:

eido convert https://raw.githubusercontent.com/databio/bedshift_analysis/master/pep_main/project_config.yaml -f bogus_filter

It should instead say bogus filter not found. options are: {filters}. Or something.

Eido should use the same attribute names as pipestat for its schemas

          Per discussion, we would like eido to use the same attribute names as pipestat (project vs config) while still maintaining backwards compatibility.

Originally posted by @donaldcampbelljr in pepkit/pipestat#85 (comment)

Improve Error-Messages

Hi guys,

thx for your tool! Your concept is really cool and a handy feature :)

I have decided to integrate this tool in my Snakemake-Pipeline.
However, I now already stumbled multiple times over the issue of having metadata-files that fail the eido evaluation, but the Error messages which are returned by the tool are not helping at all.
Thus, every time when I face such an error I have to invest a lot of time to finally figure out what's the reason for the failing validation.

Here is a minimal reproducible example:

pep_schemal.yaml

description: Minimal example

imports:
   - http://schema.databio.org/pep/2.1.0.yaml

properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_directory:
          type: string
          pattern: "^/\\S+$|None"

input_no_error.csv

sample_name,sample_directory
test,/testung

input_error.csv

sample_name,sample_directory
test,testung

Then the output:

# No error
$ eido validate input_no_error.csv -s pep_schema.yaml
Validation successful

# Error
$ eido validate input_error.csv -s pep_schema.yaml
Traceback (most recent call last):
  File "/Users/oliverkuchler/miniforge3/envs/snakemake7/bin/eido", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/oliverkuchler/miniforge3/envs/snakemake7/lib/python3.12/site-packages/eido/cli.py", line 159, in main
    validator(*arguments)
  File "/Users/oliverkuchler/miniforge3/envs/snakemake7/lib/python3.12/site-packages/eido/validation.py", line 73, in validate_project
    _validate_object(
  File "/Users/oliverkuchler/miniforge3/envs/snakemake7/lib/python3.12/site-packages/eido/validation.py", line 45, in _validate_object
    instance_name = error.instance[sample_name_colname]
                    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^

Is there any plans for the future to improve the output?
It would be very cool if one could at least see, because of which input the validation fails.

Object type for mutable mapping

If peppy.Project objects are no longer attmap, then sample objects will not validate as objects because they are MutableMapping, which doesn't validate by default with jsonschema, Due to the type customization:

https://python-jsonschema.readthedocs.io/en/stable/validate/#validating-types

It's the same as this issue: python-jsonschema/jsonschema#592

The answer to that we have to customize the validator so that MutableMapping will pass as type object.

name for input attrs

The current eido dev docs say:

In the above example, we listed read1 and read2 attributes as required. This will enforce that these attributes must be defined on the samples, but for this example, this is not enough -- these also must point to files that exist. Checking for files is outside the scope of JSON Schema, which only validates JSON documents, so eido extends JSON Schema with the ability to specify which attributes should point to files.

Eido provides two ways to do it: input_attrs and required_input_attrs. The basic input_attrs is simply used to specify which attributes point to files, which are not required to exist. This is useful for tools that want to calculate the total size of any provided inputs, for example. The required_input_attrs list specifies that the attributes point to files that must exist, otherwise the PEP doesn't validate. Here's an example of specifying an optional and required input attribute:

What should we name these?

required_input_attrs
input_attrs

@stolarczyk says current actual implementation is using:

required_inputs_attr
all_inputs_attr

Output processed PEP

Today talking with some nf-core Nextflow developers, it came up that it would be useful to be able to output a processed PEP, either in CSV format or in yaml/json format.

So, think of it as a PEP (yaml+csv) -> YAML converter... it's kind of a "filter" that would read the PEP and output it in the other format. This is basically what looper does when it creates the sample yaml files, which can be modulated with looper plugins. The difference here I guess is that we don't need all the rest of the looper capability -- just the printing of sample yaml files, perhaps all in one file. We need just some command-line tool that would output the PEP in YAML format.

I think this might make sense to have as part of eido, since it already provides a command-line interface... And in fact, could go to the point of, maybe, extracting out the looper sample-writing capabilities to put into eido. In that case, the plugin system may actually be useful here.

@stolarczyk thoughts?

retrieving actual errors

I'd like to get a list of errors, for example, like this:

https://python-jsonschema.readthedocs.io/en/stable/errors/

but right now there's no real way to do this; could we restructure to allow this kind of construct instead of just throwing the exceptions?

`convert_project` exits program

The function definition for convert_project will successfully convert a pep given a valid filter, however it runs a sys.exit(0) which seems like odd behavior. We are utilizing this for the pephub server and it results in server crashes.

Attempts to circumvent this by directly calling run_filter only display the conversion result to stdout, is it possible to return as yaml instead of returning None?

Unused parameter in validate_* functions

It looks like the use_case parameter in _validate_object is unused, and that some other functions have that parameter simply for passage to _validate_object. @nsheff is this a known issue and something that's been retained for backward- or cross-project-compatibility, or can we try removing it?

CLI

A CLI to validate a PEP against a schema.

Maybe something like: eido --pep pep-config.yaml --schema schema.yaml

eido validate is broken in v0.2

Example

$ eido validate project_config.yml -s schema.yml 
Traceback (most recent call last):
  File "micromamba/envs/PEP/bin/eido", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "micromamba/envs/PEP/lib/python3.12/site-packages/eido/cli.py", line 159, in main
    validator(*arguments)
TypeError: validate_project() takes 2 positional arguments but 3 were given

Workaround

Install eido 0.1.9

CSV filter output for multi-value sample attributes

If you have a sample with an attribute with multiple values, the CSV writer will write them into a CSV in a python list form:

eido convert https://raw.githubusercontent.com/pepkit/nf-core-pep/master/samplesheet_test.csv --st-index sample -f csv

Result:

Found 2 samples with non-unique names: {'WT_REP1', 'RAP1_UNINDUCED_REP2'}. Attempting to auto-merge.
Running plugin csv
sample,sample,fastq_1,fastq_2,strandedness
WT_REP2,WT_REP2,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,RAP1_UNINDUCED_REP1,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz,,reverse
RAP1_IAA_30M_REP1,RAP1_IAA_30M_REP1,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz,reverse
WT_REP1,WT_REP1,"['https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz', 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz']","['https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz', 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz']",reverse
RAP1_UNINDUCED_REP2,RAP1_UNINDUCED_REP2,"['https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz', 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz']",,reverse

Might make more sense to do this in the way of multiple rows per sample, for the purposes of the CSV filter 🤮

list of available output formats

How can I get the list of availlable output formats?

nsheff@zither:~/code/bedshift_paper$ eido convert --help
usage: eido convert [-h] [--st-index ST_INDEX] -f FORMAT [-n SAMPLE_NAME [SAMPLE_NAME ...]]
                    [-a ARGS [ARGS ...]]
                    PEP

Convert a PEP using an available filter

positional arguments:
  PEP                   Path to a PEP configuration file in yaml format.

optional arguments:
  -h, --help            show this help message and exit
  --st-index ST_INDEX   Sample table index to use, samples are identified by 'sample_name' by default.
  -f FORMAT, --format FORMAT
                        Path to a PEP schema file in yaml format.
  -n SAMPLE_NAME [SAMPLE_NAME ...], --sample-name SAMPLE_NAME [SAMPLE_NAME ...]
                        Name of the samples to inspect.
  -a ARGS [ARGS ...], --args ARGS [ARGS ...]
                        Provide arguments to the filter function (e.g. arg1=val1 arg2=val2).

Also, is the -f command docs here correct? I think that's the wrong help string.

connection to looper

There are some issues in looper that are related to this, since right now the pipeline interface is basically serving this purpose (poorly)...

See:
pepkit/looper#215
pepkit/looper#25

Brainstorming ways to do this

How JK does something similar in snakemake:

https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/master/rules/common.smk#L11

Upgrade the way errors are printed for CLI usage of eido

https://nfcore.slack.com/archives/C031QH57DSS/p1664455327444139

ImportError: cannot import name 'load_yaml' from 'peppy.utils'

Docs are not building on dev because:

    from peppy.utils import load_yaml
ImportError: cannot import name 'load_yaml' from 'peppy.utils' (/home/docs/checkouts/readthedocs.org/user_builds/eido/envs/dev/lib/python3.7/site-packages/peppy/utils.py)

Handling errors

Right now, the validation functions are printing out notices and then raising exceptions. Instead, what if the raising of exception should happen by the calling tool?

The validation function would collect and format the errors. Then, the calling function should decide what to do with it. For example, in the case of the CLI, it should print it out. But in other cases, we may just want a list of the validation error. So, should the printing move from the validation function to the calling context?

schema for generic PEP

Is there a schema for a generic PEP? It may be nice to validate PEPs generally.