pha4ge / hamronization Goto Github PK

View Code? Open in Web Editor NEW

137.0 18.0 25.0 5.54 MB

Parse multiple Antimicrobial Resistance Analysis Reports into a common data structure

License: GNU Lesser General Public License v3.0

Python 92.50% Shell 6.98% Dockerfile 0.52%

bioinformatics antimicrobial-resistance parsers data-harmonization

hamronization's People

Contributors

Stargazers

Watchers

hamronization's Issues

Reduce code duplication between parsers

Redundant code between parsers should be refactor and modularised.

Should tackle #22 and facilitate #23

Sanity check parsers

Need folks to come in and sanity check parser output

I pip installed this tool and am trying to run the following:
hamronize staramr staramr_out/detailed_summary.tsv --reference_database_version db_v_1 --analysis_software_version tool_v_1 --format tsv --output hamr_out/staramr_out.tsv

I get the following error:

Traceback (most recent call last):
  File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
    output_format=args.format)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
    first_result = next(self)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
    return next(self.hAMRonized_results)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/StarAmrIO.py", line 40, in parse
    result['_gene_name'] = result['Gene']
KeyError: 'Gene'

I see that a new version was recently pushed. Should I use a different download method?

Global variable refactoring

Use of global variables should probably be removed, this should fall out of larger refactoring to remove code duplication.

Should tackle #22 and facilitate #23

Add fARGene

This is a suggestion to add the tool fARGene to the hAMRonization tool list. fARGene detects AMR genes based on pre-defined HMM models (provided together with the tool). It would be great to have the fARGene output also standardized in form of a hAMRonization summary. The output is described in the fARGene tutorial. I attached an example output folder (command: fargene -i contigs.fasta --hmm-model class_a -o output_dir ) here: output_dir.zip

Dummy one-hit output for each tool

Generate a single same hit in each tool's format for a simple test-case

Add Parser: resfinder

Programmatic use of parsers

This should be made easier, currently you also have to externally run the parser script.

Add Parser: srst2

Add a parser for srst2

Flag to filter report for genomic/non-genomic audience

For the interactive report or tabular report possibly have an option to just summarise results (genome, gene, tool, versions, phenotype annotation) and the full genomics results (i.e., the whole spec with start/stop contig coverage etc).

Clearly specify which tool output files needing used

Update documentation to clearly specify WHICH output file (and its name/naming scheme) the parsers are designed to work on for each tool.

Add same documentation to CLI for each relevant parser.

Extend schema(s) to include point mutation info

Our schemas currently only incorporate resistance gene detection information, but don't include fields that are relevant to point mutations. Point mutations (and other variants like indels) are important mechanisms of antibiotic resistance and several of our tools include that type of information in their output.

Gene detection information was incorporated first because it was generally deemed to be simpler and more consistent than point mutations, but our schema should support both types of information.

hamronize summarize - local variable 'parsed_report' reference before assignment

Describe the bug
Summarize function failed when no genes were present in the input files. This should be caught and handled by summarize function.

Add Parser: staramr

Add a parser for staramr

[BUG] `KeyError: 'reference_database_name'` when running summarize

Describe the bug

I get the following error when running with summarize

Warning: <_io.TextIOWrapper name='WAL001-megahit.mapping.potential.ARG.deeparg.json' mode='r' encoding='UTF-8'> report is empty
Traceback (most recent call last):
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/Interfaces.py", line 299, in generic_cli_interface
    hAMRonization.summarize.summarize_reports(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/summarize.py", line 752, in summarize_reports
    combined_reports = combined_reports.sort_values(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
    return func(*args, **kwargs)
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in <listcomp>
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/generic.py", line 1849, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'reference_database_name'

Input

hamronize \
    summarize \
    <huge_list_of_jsons> \
    -t interactive \
     \
    -o hamronization_combined_report.html

Input file
I can send a zip of the entire privately if necessary (includes unpublished data)

Error log
See above

hAMRonization Version
1.1.0

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: SUSE Linux Enterprise High Performance Computing 15 SP1
Version: hAMRronization 1.1.0

Additional context
Add any other context about the problem here.
If applicable, include dependency versions such as pandas version and Python version.

Input file format and structure validation

In each parser, I suggest adding a validation for the expected file format (tsv, txt, json..) and structure (fields present), giving a clean exit when it fails.

stderr output for empty files instead of crash

Flag overlapping ranges in hAMRonization

If ranges of detected AMR genes overlap in genomic coords >90% then flag them in the summary html somehow.

Problem: indices are 1-based AND 0-based in different tools (and many tools don't have genomic coords at all)

Update README

Improve installation instructions
Update parsers included
Add wiki with small tutorial

Genetic_variation_type

At the moment the 'variant type' is set as non- mandatory but probably should be made mandatory. This attribute was added as part of the introduction of variant spec

hAMRonization/hAMRonization/hAMRonizedResult.py

Line 25 in ee85f6d

genetic_variation_type: str = None

Decide on a single 'authoritiative' schema format

There are several schema definition technologies available to us:

Ideally we would have a single 'authoritative' schema, and any other schema could be automatically derived from it. Which schema definition technology would make the most sense to use as the authoritative schema? Would it be possible to derive all the others from it in a robust and automated way?

Type cleaning of fields needs implemented

For each parser, fields need tidied and coerced to appropriate type if they are float or ints.

Includes stripping "%" etc.

Add xlsx output

Wrong Docker link

Dear Finley,
I am Giovanni Iacono from EFSA, we had a video call the other day.
The compilation of the Dockerfile from https://github.com/pha4ge/hAMRonization_workflow fails at task 18 RUN cd data/test && bash get_test_data.sh && cd ../..

The reason is that in the current repository https://github.com/pha4ge/hamronization the data folder is not present. This folder is present in https://github.com/pha4ge/hAMRonization_workflow.

Also a question, the Dockerfile in https://github.com/pha4ge/hAMRonization_workflow installs only the parsers, correct ?

Testing: flake8 linting shows errors & warnings but workflow step doesn't fail

As part of our testing suite, we run flake8 against the parsers and the AntimicrobialResistanceGenomicAnalysisResult class, using GitHub Actions.

We can see in the logs that flake8 is finding issues, but the workflow steps are passing anyways. See:

https://github.com/pha4ge/harmonized-amr-parsers/runs/546839506?check_suite_focus=true#step:4:23

The linting workfow steps should fail if there are linting issues.

Implementing logging and debug flag to simplify exception messages

To handle concerns raised by @cimendes in #39 and try and make issues with input selection clearer to users e.g., #54 we should add proper use of logging library, default to simple error messages and add a --debug flag to argparse which displays the full traceback.

Add boolean debug flag in the generic CLI parser (default False):
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L217
Set up the logging levels based on args.debug: https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L257 (e.g., https://stackoverflow.com/questions/14097061/easier-way-to-enable-verbose-logging)
Add wrong input file exception using logging library, specifically add try: and except KeyError: at
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L66 that explains to the user that expected input columns can't be found and to check if they are using the correct AMR prediction file.
Update the validation of input fields exception at https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L57 to use logging + debug

Using owl:equivalentClass to connect things semantically

I noticed the JSON-LD spec has things like

  "owl:equivalentClass": {
    "@id": "edam_data:1050"
  },

The cross-referencing to ontology terms is great, but I think owl:equivalentClass is too strong. It is harmless right now but in future if that was somehow converted into native owl context for this and other terms (as a result of data fedration etc) it brings along reasoner baggage that might not be desired. For reaching across vocabularies, how about skos:closeMatch or skos:exactMatch? In particular, very few GenEpiO terms are used in owl context as rdf:Properties so to link to them via owl:equivalentClass might lead to misinterpretation by some brainless computer somewhere.

After another round of GenEpiO edits I'll circle back to check out the term mappings here.

Cheers!

Point mutations for RGI + AMRFinderPlus?

Should we ignore point mutations in output for RGI+AMRFinderPlus as the specification does not support it.

Add parser: Mykrobe

AMRfinder aminoacid reference lenght being saved in gene lenght field

The amrfinder parsers are saving the aminoacid reference lenght in the reference_gene_lenght field

Output options

Currently the only output options for parsers are tsv or json printed to stdout.

While users can redirect from CLI it might be nice to give an output_file option

add support for resfinder v4 output

Adjust resfinder parser to support output from resfinder v4
Ideally, this will still be a single parser that auto detects resfinder v3 or resfinder v4 and parses them appropriately a la https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/RgiIO.py#L23
and https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/AmrFinderPlusIO.py#L22

Simplify AntimicrobialResistanceResult.read()

The AntimicrobialResistanceGenomicAnalysisResult.read() method takes a dictionary as input and loads values into the class by matching the keys in the dictionary against the attribute names in the class.

Since each dictionary lookup could fail, each lookup is wrapped in a try: / except: block. This leads to a really verbose (and inefficient?) implementation.

There may be a simpler way to convert from a dict to our AntimicrobialResistanceGenomicAnalysisResult class via a namedtuple and/or a dataclass

PyPi not updated, 1.0.4 tarball reports version 1.0.3

Hello,

We are using haAMRonization in a pipeline (nf-core/funcscan), and I saw there was a new version of the tool so I went to update it in our pipeline.

However when I went to do so, I saw the update wasn't on bioconda, and when I tired up update the recipe - the CI test failed saying that 1.0.4 doesn't exist on pypi.

Secondly, when I went to build the package locally myself from the tarball under the releases page, I've saw that if I run hamronization --version it still reports 1.0.4.

It would be maybe good to have a release update with the correct version

(or ideally, if possible 1.0.5 and a fix #66 included. I would try to contribute this myself as it doesn't seem complicated but I'm not a python dev unfortunately)

Galaxy implementation

To facilitate adding to production workflows we should make a galaxy tool wrapper for hAMRonization.

There are special converter type tools that should make this a little easier.

[BUG] Generated output does not follow CSP rules

Describe the bug
The generated output HTML uses inline JavaScript code. This is a violation of CSP rules.

Input
Any input

Input file
Any

Error log
NA

hAMRonization Version
NA

Expected behavior
The generated output file follows CSP rules.

Desktop (please complete the following information):

OS: [e.g. iOS] any
Browser [e.g. chrome, safari] any
Version [e.g. 22] any

JSON/SALAD validation needs updated

Automated validation of results needs added/done

help understanding resfinder run

Hi devs, I'm trying to run hamronize on some results I generated from the latest resfinder docker image. Here is the command I am trying to run:

hamronize resfinder resfinder/resfinder/results/ResFinder_results_tab.txt  --reference_database_version db_v_1 --analysis_software_version tool_v_1 --output hamr_out/resfinder_out.tsv

I get the following error:

Traceback (most recent call last):
  File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
    output_format=args.format)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
    first_result = next(self)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
    return next(self.hAMRonized_results)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/ResFinderIO.py", line 48, in parse
    report = json.load(handle)
  File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The only json produced by my resfinder run is the std_format_under_development.json, and I don't think hamronize is indicating it wants to use this file over the resfinder results table. Is this an issue with my resfinder run (hamronize expects resfinder to produce different jsons), an issue with my input arguments, or something else?

Summary options

Create a summary report of just the AMR genes detected per genome with linked detailed reports.

One line per sample
One line per software

Groot parser implementation

With the following single-entry output this is currently what is being parsed:

OqxA.3003470.EU370913.4407527-4408202.4553 266 657 3D648M6D

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}

This is the current output:

    assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'OqxA'
    assert result.gene_name == 'OqxA.3003470.EU370913' 
    assert result.reference_database_id == 'argannot'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == 'OqxA.3003470.EU370913.4407527-4408202.4553' 
    assert result.analysis_software_name == 'groot'
    assert result.analysis_software_version == '0.0.1'
    assert result.reference_gene_length == 657
    assert result.coverage_depth == 266

As you can see, the gene_symbol, gene_name and reference_accession fields are all storing the same information.

I've a bit of trouble mapping the fields in OqxA.3003470.EU370913.4407527-4408202.4553 to the spec. The gene_name technically this is not present in the report. Should we store the same value as gene_symbol or keep is as None?
For the reference_accession, shouldn't we keep just the EU370913 value? I'm unsure of what 3003470 represents, as well as the 4407527-4408202.4553.

Any input is welcomed!

Fix issue of very similar runs falsely combining results in summary

If the exact same tool is run with different settings (but all the other metadata stays the same such as version) summarize falsely combines them e.g., running RGI on contigs and reads, hamronizing each output, and them combining them in a summary will end up with an interactive summary that both RGI contig and RGI-bwt results are from the same simple run of RGI.

Solution: Summarize should try and treat each file separately and assign a new config # if multiple hAMRonized files are supplied

https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/summarize.py#L16

ORF_ID missing once RGI report hAMRonized

Hi Finlay,

when running ORFs through the RGI-hARMonization pipeline, the ORF_ID (important for final hAMR report) is skipped.

Here https://github.com/SvetlanaUP/hAMRonization/blob/master/hAMRonization/RgiIO.py
you can see that I fixed ORF_ID self.field_mapping manually (Fixed 'ORF_ID': 'None', 'Contig': 'input_sequence_id' (should be opposite) ).

This solution works for us, but maybe for the future would be good to have it defined in the RgiIO.py, e.g. if there is no 'Contig' use 'ORF_ID'.

Thanks,
Svetlana

Request for Zenodo archive

For citing hamronization in publications, it would be helpful to have a persistent DOI. This could be easily achieved by linking the GitHub repo to Zenodo, see the GitHub referencing docs.

Srst2 parser implementation

With the following single-entry output this is currently what is being parsed:

Sample DB gene allele coverage depth diffs uncertainty divergence length maxMAF clusterid seqid annotation
Dummy ResFinder oqxA oqxA 100.0 75.852 1snp 0.152 660 0.037 470 1995 oqxA_1_V00622; V00622; fluoroquinolone

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", "reference_database_id": 'resfinder'}

This is the current output:

   assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'oqxA'
    assert result.gene_name == 'oqxA'
    assert result.reference_database_id == 'ResFinder'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == '1995'
    assert result.analysis_software_name == 'srst2'
    assert result.analysis_software_version == '0.0.1'
    assert result.coverage_percentage == 100
    assert result.reference_gene_length == 660
    assert result.coverage_depth == 75.852

My question is regarding the reference_database_id that is currently required in the metadata, but it's being (correctly!) parsed from the report file. I suggest removing this from the required metadata fields.

RgiIO.py: Typo in line 79

In pha4ge/hAMRonization/blob/master/hAMRonization/RgiIO.py
Typo in line 79: 'Percentage Length of '

HTH,
Svetlana

Automated tests for parsers

Tests need implemented for new parsers

AMR Variant detection - Parsers to be updated

The following parsers need to be updated to comply with the new spec:

The following parsers currently are skipping the variant detection results:

AmrFinderPlusIO.py
AmrPlusPlusIO.py: this tool more tells you that you should check for SNPs in certain genes so I'm inclined to not include variant parsing for it.
RgiIO.py
AribaIO.py

The following parsers should be added:

mykrobe (#69)
pointfinder

Add custom error handling

When file is passed without the expected format/fields, a generic error is thrown. This can be easily handled with a custom exception informing the user about why it's failing.

Add CONTRIBUTING.md

https://github.com/github/docs/blob/main/CONTRIBUTING.md

Obtain specification field data information from JSON schema

In the hAMRonizedResult class definition, the field terms for the parsers, and their value types, should be obtained from the schema JSON file and not be hardcoded into the tool. The necessary file is already provided in the schema/ directory. A parser should be included to retrieve this information directly from the file, facilitating the update of the field terms when necessary.

⬇️
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L13-L52

pha4ge / hamronization Goto Github PK

hamronization's People

Contributors

Stargazers

Watchers

Forkers

hamronization's Issues

Recommend Projects

Recommend Topics

Recommend Org