Giter Site home page Giter Site logo

pha4ge / hamronization Goto Github PK

View Code? Open in Web Editor NEW
127.0 18.0 25.0 5.53 MB

Parse multiple Antimicrobial Resistance Analysis Reports into a common data structure

License: GNU Lesser General Public License v3.0

Python 92.50% Shell 6.98% Dockerfile 0.52%
bioinformatics antimicrobial-resistance parsers data-harmonization

hamronization's People

Contributors

alexmanuele avatar antunderwood avatar awitney avatar cimendes avatar danymatute avatar dfornika avatar fmaguire avatar imendes93 avatar jodyphelan avatar pvanheus avatar raphenya avatar thanhleviet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hamronization's Issues

PyPi not updated, 1.0.4 tarball reports version 1.0.3

Hello,

We are using haAMRonization in a pipeline (nf-core/funcscan), and I saw there was a new version of the tool so I went to update it in our pipeline.

However when I went to do so, I saw the update wasn't on bioconda, and when I tired up update the recipe - the CI test failed saying that 1.0.4 doesn't exist on pypi.

Secondly, when I went to build the package locally myself from the tarball under the releases page, I've saw that if I run hamronization --version it still reports 1.0.4.

It would be maybe good to have a release update with the correct version

(or ideally, if possible 1.0.5 and a fix #66 included. I would try to contribute this myself as it doesn't seem complicated but I'm not a python dev unfortunately)

help understanding resfinder run

Hi devs, I'm trying to run hamronize on some results I generated from the latest resfinder docker image. Here is the command I am trying to run:

hamronize resfinder resfinder/resfinder/results/ResFinder_results_tab.txt  --reference_database_version db_v_1 --analysis_software_version tool_v_1 --output hamr_out/resfinder_out.tsv  

I get the following error:

Traceback (most recent call last):
  File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
    output_format=args.format)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
    first_result = next(self)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
    return next(self.hAMRonized_results)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/ResFinderIO.py", line 48, in parse
    report = json.load(handle)
  File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The only json produced by my resfinder run is the std_format_under_development.json, and I don't think hamronize is indicating it wants to use this file over the resfinder results table. Is this an issue with my resfinder run (hamronize expects resfinder to produce different jsons), an issue with my input arguments, or something else?

[BUG] Generated output does not follow CSP rules

Describe the bug
The generated output HTML uses inline JavaScript code. This is a violation of CSP rules.

Input
Any input

Input file
Any

Error log
NA

hAMRonization Version
NA

Expected behavior
The generated output file follows CSP rules.

Desktop (please complete the following information):

  • OS: [e.g. iOS] any
  • Browser [e.g. chrome, safari] any
  • Version [e.g. 22] any

Extend schema(s) to include point mutation info

Our schemas currently only incorporate resistance gene detection information, but don't include fields that are relevant to point mutations. Point mutations (and other variants like indels) are important mechanisms of antibiotic resistance and several of our tools include that type of information in their output.

Gene detection information was incorporated first because it was generally deemed to be simpler and more consistent than point mutations, but our schema should support both types of information.

Srst2 parser implementation

With the following single-entry output this is currently what is being parsed:

Sample DB gene allele coverage depth diffs uncertainty divergence length maxMAF clusterid seqid annotation
Dummy ResFinder oqxA oqxA 100.0 75.852 1snp 0.152 660 0.037 470 1995 oqxA_1_V00622; V00622; fluoroquinolone

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", "reference_database_id": 'resfinder'}

This is the current output:

   assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'oqxA'
    assert result.gene_name == 'oqxA'
    assert result.reference_database_id == 'ResFinder'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == '1995'
    assert result.analysis_software_name == 'srst2'
    assert result.analysis_software_version == '0.0.1'
    assert result.coverage_percentage == 100
    assert result.reference_gene_length == 660
    assert result.coverage_depth == 75.852

My question is regarding the reference_database_id that is currently required in the metadata, but it's being (correctly!) parsed from the report file. I suggest removing this from the required metadata fields.

Summary options

Create a summary report of just the AMR genes detected per genome with linked detailed reports.

One line per sample
One line per software

Add fARGene

This is a suggestion to add the tool fARGene to the hAMRonization tool list. fARGene detects AMR genes based on pre-defined HMM models (provided together with the tool). It would be great to have the fARGene output also standardized in form of a hAMRonization summary. The output is described in the fARGene tutorial. I attached an example output folder (command: fargene -i contigs.fasta --hmm-model class_a -o output_dir ) here: output_dir.zip

Using owl:equivalentClass to connect things semantically

I noticed the JSON-LD spec has things like

  "owl:equivalentClass": {
    "@id": "edam_data:1050"
  },

The cross-referencing to ontology terms is great, but I think owl:equivalentClass is too strong. It is harmless right now but in future if that was somehow converted into native owl context for this and other terms (as a result of data fedration etc) it brings along reasoner baggage that might not be desired. For reaching across vocabularies, how about skos:closeMatch or skos:exactMatch? In particular, very few GenEpiO terms are used in owl context as rdf:Properties so to link to them via owl:equivalentClass might lead to misinterpretation by some brainless computer somewhere.

After another round of GenEpiO edits I'll circle back to check out the term mappings here.

Cheers!

RgiIO.py: Typo in line 79

In pha4ge/hAMRonization/blob/master/hAMRonization/RgiIO.py
Typo in line 79: 'Percentage Length of '

HTH,
Svetlana

Global variable refactoring

Use of global variables should probably be removed, this should fall out of larger refactoring to remove code duplication.

Should tackle #22 and facilitate #23

Output options

Currently the only output options for parsers are tsv or json printed to stdout.

While users can redirect from CLI it might be nice to give an output_file option

staramr issue

I pip installed this tool and am trying to run the following:
hamronize staramr staramr_out/detailed_summary.tsv --reference_database_version db_v_1 --analysis_software_version tool_v_1 --format tsv --output hamr_out/staramr_out.tsv

I get the following error:

Traceback (most recent call last):
  File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
    output_format=args.format)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
    first_result = next(self)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
    return next(self.hAMRonized_results)
  File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/StarAmrIO.py", line 40, in parse
    result['_gene_name'] = result['Gene']
KeyError: 'Gene'

I see that a new version was recently pushed. Should I use a different download method?

AMR Variant detection - Parsers to be updated

The following parsers need to be updated to comply with the new spec:

The following parsers currently are skipping the variant detection results:

The following parsers should be added:

Galaxy implementation

To facilitate adding to production workflows we should make a galaxy tool wrapper for hAMRonization.

There are special converter type tools that should make this a little easier.

ORF_ID missing once RGI report hAMRonized

Hi Finlay,

when running ORFs through the RGI-hARMonization pipeline, the ORF_ID (important for final hAMR report) is skipped.

Here https://github.com/SvetlanaUP/hAMRonization/blob/master/hAMRonization/RgiIO.py
you can see that I fixed ORF_ID self.field_mapping manually (Fixed 'ORF_ID': 'None', 'Contig': 'input_sequence_id' (should be opposite) ).

This solution works for us, but maybe for the future would be good to have it defined in the RgiIO.py, e.g. if there is no 'Contig' use 'ORF_ID'.

Thanks,
Svetlana

Flag overlapping ranges in hAMRonization

If ranges of detected AMR genes overlap in genomic coords >90% then flag them in the summary html somehow.

Problem: indices are 1-based AND 0-based in different tools (and many tools don't have genomic coords at all)

Add custom error handling

When file is passed without the expected format/fields, a generic error is thrown. This can be easily handled with a custom exception informing the user about why it's failing.

Simplify AntimicrobialResistanceResult.read()

The AntimicrobialResistanceGenomicAnalysisResult.read() method takes a dictionary as input and loads values into the class by matching the keys in the dictionary against the attribute names in the class.

Since each dictionary lookup could fail, each lookup is wrapped in a try: / except: block. This leads to a really verbose (and inefficient?) implementation.

There may be a simpler way to convert from a dict to our AntimicrobialResistanceGenomicAnalysisResult class via a namedtuple and/or a dataclass

Wrong Docker link

Dear Finley,
I am Giovanni Iacono from EFSA, we had a video call the other day.
The compilation of the Dockerfile from https://github.com/pha4ge/hAMRonization_workflow fails at task 18 RUN cd data/test && bash get_test_data.sh && cd ../..

The reason is that in the current repository https://github.com/pha4ge/hamronization the data folder is not present. This folder is present in https://github.com/pha4ge/hAMRonization_workflow.

Also a question, the Dockerfile in https://github.com/pha4ge/hAMRonization_workflow installs only the parsers, correct ?

Implementing logging and debug flag to simplify exception messages

To handle concerns raised by @cimendes in #39 and try and make issues with input selection clearer to users e.g., #54 we should add proper use of logging library, default to simple error messages and add a --debug flag to argparse which displays the full traceback.

  1. Add boolean debug flag in the generic CLI parser (default False):
    https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L217

  2. Set up the logging levels based on args.debug: https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L257 (e.g., https://stackoverflow.com/questions/14097061/easier-way-to-enable-verbose-logging)

  3. Add wrong input file exception using logging library, specifically add try: and except KeyError: at
    https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L66 that explains to the user that expected input columns can't be found and to check if they are using the correct AMR prediction file.

  4. Update the validation of input fields exception at https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L57 to use logging + debug

Groot parser implementation

With the following single-entry output this is currently what is being parsed:

OqxA.3003470.EU370913.4407527-4408202.4553 266 657 3D648M6D

The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}

This is the current output:

    assert result.input_file_name == 'Dummy'
    assert result.gene_symbol == 'OqxA'
    assert result.gene_name == 'OqxA.3003470.EU370913' 
    assert result.reference_database_id == 'argannot'
    assert result.reference_database_version == '2019-Jul-28'
    assert result.reference_accession == 'OqxA.3003470.EU370913.4407527-4408202.4553' 
    assert result.analysis_software_name == 'groot'
    assert result.analysis_software_version == '0.0.1'
    assert result.reference_gene_length == 657
    assert result.coverage_depth == 266

As you can see, the gene_symbol, gene_name and reference_accession fields are all storing the same information.

I've a bit of trouble mapping the fields in OqxA.3003470.EU370913.4407527-4408202.4553 to the spec. The gene_name technically this is not present in the report. Should we store the same value as gene_symbol or keep is as None?
For the reference_accession, shouldn't we keep just the EU370913 value? I'm unsure of what 3003470 represents, as well as the 4407527-4408202.4553.

Any input is welcomed!

Update README

  • Improve installation instructions
  • Update parsers included
  • Add wiki with small tutorial

Decide on a single 'authoritiative' schema format

There are several schema definition technologies available to us:

  1. JSON Schema
  2. SALAD
  3. AVRO
  4. JSON-LD

Ideally we would have a single 'authoritative' schema, and any other schema could be automatically derived from it. Which schema definition technology would make the most sense to use as the authoritative schema? Would it be possible to derive all the others from it in a robust and automated way?

Fix issue of very similar runs falsely combining results in summary

If the exact same tool is run with different settings (but all the other metadata stays the same such as version) summarize falsely combines them e.g., running RGI on contigs and reads, hamronizing each output, and them combining them in a summary will end up with an interactive summary that both RGI contig and RGI-bwt results are from the same simple run of RGI.

Solution: Summarize should try and treat each file separately and assign a new config # if multiple hAMRonized files are supplied

https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/summarize.py#L16

Flag to filter report for genomic/non-genomic audience

For the interactive report or tabular report possibly have an option to just summarise results (genome, gene, tool, versions, phenotype annotation) and the full genomics results (i.e., the whole spec with start/stop contig coverage etc).

[BUG] `KeyError: 'reference_database_name'` when running summarize

Describe the bug

I get the following error when running with summarize

Warning: <_io.TextIOWrapper name='WAL001-megahit.mapping.potential.ARG.deeparg.json' mode='r' encoding='UTF-8'> report is empty
Traceback (most recent call last):
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/Interfaces.py", line 299, in generic_cli_interface
    hAMRonization.summarize.summarize_reports(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/summarize.py", line 752, in summarize_reports
    combined_reports = combined_reports.sort_values(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
    return func(*args, **kwargs)
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in <listcomp>
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/generic.py", line 1849, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'reference_database_name'

Input

hamronize \
    summarize \
    <huge_list_of_jsons> \
    -t interactive \
     \
    -o hamronization_combined_report.html

Input file
I can send a zip of the entire privately if necessary (includes unpublished data)

Error log
See above

hAMRonization Version
1.1.0

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: SUSE Linux Enterprise High Performance Computing 15 SP1
  • Version: hAMRronization 1.1.0

Additional context
Add any other context about the problem here.
If applicable, include dependency versions such as pandas version and Python version.

Obtain specification field data information from JSON schema

In the hAMRonizedResult class definition, the field terms for the parsers, and their value types, should be obtained from the schema JSON file and not be hardcoded into the tool. The necessary file is already provided in the schema/ directory. A parser should be included to retrieve this information directly from the file, facilitating the update of the field terms when necessary.

⬇️
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L13-L52

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.