pha4ge / hamronization Goto Github PK
View Code? Open in Web Editor NEWParse multiple Antimicrobial Resistance Analysis Reports into a common data structure
License: GNU Lesser General Public License v3.0
Parse multiple Antimicrobial Resistance Analysis Reports into a common data structure
License: GNU Lesser General Public License v3.0
Need folks to come in and sanity check parser output
I pip installed this tool and am trying to run the following:
hamronize staramr staramr_out/detailed_summary.tsv --reference_database_version db_v_1 --analysis_software_version tool_v_1 --format tsv --output hamr_out/staramr_out.tsv
I get the following error:
Traceback (most recent call last):
File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
sys.exit(main())
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
hAMRonization.Interfaces.generic_cli_interface()
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
output_format=args.format)
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
first_result = next(self)
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
return next(self.hAMRonized_results)
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/StarAmrIO.py", line 40, in parse
result['_gene_name'] = result['Gene']
KeyError: 'Gene'
I see that a new version was recently pushed. Should I use a different download method?
This is a suggestion to add the tool fARGene to the hAMRonization tool list. fARGene detects AMR genes based on pre-defined HMM models (provided together with the tool). It would be great to have the fARGene output also standardized in form of a hAMRonization summary. The output is described in the fARGene tutorial. I attached an example output folder (command: fargene -i contigs.fasta --hmm-model class_a -o output_dir
) here: output_dir.zip
Generate a single same hit in each tool's format for a simple test-case
This should be made easier, currently you also have to externally run the parser script.
Add a parser for srst2
For the interactive report or tabular report possibly have an option to just summarise results (genome, gene, tool, versions, phenotype annotation) and the full genomics results (i.e., the whole spec with start/stop contig coverage etc).
Update documentation to clearly specify WHICH output file (and its name/naming scheme) the parsers are designed to work on for each tool.
Add same documentation to CLI for each relevant parser.
Our schemas currently only incorporate resistance gene detection information, but don't include fields that are relevant to point mutations. Point mutations (and other variants like indels) are important mechanisms of antibiotic resistance and several of our tools include that type of information in their output.
Gene detection information was incorporated first because it was generally deemed to be simpler and more consistent than point mutations, but our schema should support both types of information.
Describe the bug
Summarize function failed when no genes were present in the input files. This should be caught and handled by summarize function.
Add a parser for staramr
Describe the bug
I get the following error when running with summarize
Warning: <_io.TextIOWrapper name='WAL001-megahit.mapping.potential.ARG.deeparg.json' mode='r' encoding='UTF-8'> report is empty
Traceback (most recent call last):
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/bin/hamronize", line 8, in <module>
sys.exit(main())
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/hamronize.py", line 7, in main
hAMRonization.Interfaces.generic_cli_interface()
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/Interfaces.py", line 299, in generic_cli_interface
hAMRonization.summarize.summarize_reports(
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/summarize.py", line 752, in summarize_reports
combined_reports = combined_reports.sort_values(
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
return func(*args, **kwargs)
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in sort_values
keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in <listcomp>
keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/generic.py", line 1849, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'reference_database_name'
Input
hamronize \
summarize \
<huge_list_of_jsons> \
-t interactive \
\
-o hamronization_combined_report.html
Input file
I can send a zip of the entire privately if necessary (includes unpublished data)
Error log
See above
hAMRonization Version
1.1.0
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
If applicable, include dependency versions such as pandas version and Python version.
In each parser, I suggest adding a validation for the expected file format (tsv, txt, json..) and structure (fields present), giving a clean exit when it fails.
If ranges of detected AMR genes overlap in genomic coords >90% then flag them in the summary html somehow.
Problem: indices are 1-based AND 0-based in different tools (and many tools don't have genomic coords at all)
At the moment the 'variant type' is set as non- mandatory but probably should be made mandatory. This attribute was added as part of the introduction of variant spec
There are several schema definition technologies available to us:
Ideally we would have a single 'authoritative' schema, and any other schema could be automatically derived from it. Which schema definition technology would make the most sense to use as the authoritative schema? Would it be possible to derive all the others from it in a robust and automated way?
For each parser, fields need tidied and coerced to appropriate type if they are float or ints.
Includes stripping "%" etc.
Dear Finley,
I am Giovanni Iacono from EFSA, we had a video call the other day.
The compilation of the Dockerfile from https://github.com/pha4ge/hAMRonization_workflow fails at task 18 RUN cd data/test && bash get_test_data.sh && cd ../..
The reason is that in the current repository https://github.com/pha4ge/hamronization the data folder is not present. This folder is present in https://github.com/pha4ge/hAMRonization_workflow.
Also a question, the Dockerfile in https://github.com/pha4ge/hAMRonization_workflow installs only the parsers, correct ?
As part of our testing suite, we run flake8
against the parsers and the AntimicrobialResistanceGenomicAnalysisResult
class, using GitHub Actions.
We can see in the logs that flake8
is finding issues, but the workflow steps are passing anyways. See:
https://github.com/pha4ge/harmonized-amr-parsers/runs/546839506?check_suite_focus=true#step:4:23
The linting workfow steps should fail if there are linting issues.
To handle concerns raised by @cimendes in #39 and try and make issues with input selection clearer to users e.g., #54 we should add proper use of logging
library, default to simple error messages and add a --debug
flag to argparse which displays the full traceback.
Add boolean debug flag in the generic CLI parser (default False):
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L217
Set up the logging levels based on args.debug
: https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L257 (e.g., https://stackoverflow.com/questions/14097061/easier-way-to-enable-verbose-logging)
Add wrong input file exception using logging library, specifically add try:
and except KeyError:
at
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/Interfaces.py#L66 that explains to the user that expected input columns can't be found and to check if they are using the correct AMR prediction file.
Update the validation of input fields exception at https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L57 to use logging + debug
I noticed the JSON-LD spec has things like
"owl:equivalentClass": {
"@id": "edam_data:1050"
},
The cross-referencing to ontology terms is great, but I think owl:equivalentClass is too strong. It is harmless right now but in future if that was somehow converted into native owl context for this and other terms (as a result of data fedration etc) it brings along reasoner baggage that might not be desired. For reaching across vocabularies, how about skos:closeMatch or skos:exactMatch? In particular, very few GenEpiO terms are used in owl context as rdf:Properties so to link to them via owl:equivalentClass might lead to misinterpretation by some brainless computer somewhere.
After another round of GenEpiO edits I'll circle back to check out the term mappings here.
Cheers!
Should we ignore point mutations in output for RGI+AMRFinderPlus as the specification does not support it.
The amrfinder
parsers are saving the aminoacid reference lenght in the reference_gene_lenght
field
Currently the only output options for parsers are tsv or json printed to stdout.
While users can redirect from CLI it might be nice to give an output_file option
Adjust resfinder parser to support output from resfinder v4
Ideally, this will still be a single parser that auto detects resfinder v3 or resfinder v4 and parses them appropriately a la https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/RgiIO.py#L23
and https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/AmrFinderPlusIO.py#L22
The AntimicrobialResistanceGenomicAnalysisResult.read()
method takes a dictionary as input and loads values into the class by matching the keys in the dictionary against the attribute names in the class.
Since each dictionary lookup could fail, each lookup is wrapped in a try: / except:
block. This leads to a really verbose (and inefficient?) implementation.
There may be a simpler way to convert from a dict to our AntimicrobialResistanceGenomicAnalysisResult
class via a namedtuple and/or a dataclass
Hello,
We are using haAMRonization in a pipeline (nf-core/funcscan), and I saw there was a new version of the tool so I went to update it in our pipeline.
However when I went to do so, I saw the update wasn't on bioconda, and when I tired up update the recipe - the CI test failed saying that 1.0.4 doesn't exist on pypi.
Secondly, when I went to build the package locally myself from the tarball under the releases page, I've saw that if I run hamronization --version
it still reports 1.0.4.
It would be maybe good to have a release update with the correct version
(or ideally, if possible 1.0.5 and a fix #66 included. I would try to contribute this myself as it doesn't seem complicated but I'm not a python dev unfortunately)
To facilitate adding to production workflows we should make a galaxy tool wrapper for hAMRonization.
There are special converter type tools that should make this a little easier.
Describe the bug
The generated output HTML uses inline JavaScript code. This is a violation of CSP rules.
Input
Any input
Input file
Any
Error log
NA
hAMRonization Version
NA
Expected behavior
The generated output file follows CSP rules.
Desktop (please complete the following information):
Automated validation of results needs added/done
Hi devs, I'm trying to run hamronize on some results I generated from the latest resfinder docker image. Here is the command I am trying to run:
hamronize resfinder resfinder/resfinder/results/ResFinder_results_tab.txt --reference_database_version db_v_1 --analysis_software_version tool_v_1 --output hamr_out/resfinder_out.tsv
I get the following error:
Traceback (most recent call last):
File "/home/ewissel/miniconda3/bin/hamronize", line 8, in <module>
sys.exit(main())
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/hamronize.py", line 7, in main
hAMRonization.Interfaces.generic_cli_interface()
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 275, in generic_cli_interface
output_format=args.format)
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 115, in write
first_result = next(self)
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/Interfaces.py", line 75, in __next__
return next(self.hAMRonized_results)
File "/home/ewissel/miniconda3/lib/python3.7/site-packages/hAMRonization/ResFinderIO.py", line 48, in parse
report = json.load(handle)
File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/home/ewissel/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/ewissel/miniconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The only json produced by my resfinder run is the std_format_under_development.json
, and I don't think hamronize is indicating it wants to use this file over the resfinder results table. Is this an issue with my resfinder run (hamronize expects resfinder to produce different jsons), an issue with my input arguments, or something else?
Create a summary report of just the AMR genes detected per genome with linked detailed reports.
One line per sample
One line per software
With the following single-entry output this is currently what is being parsed:
OqxA.3003470.EU370913.4407527-4408202.4553 266 657 3D648M6D
The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", 'reference_database_id': "argannot"}
This is the current output:
assert result.input_file_name == 'Dummy'
assert result.gene_symbol == 'OqxA'
assert result.gene_name == 'OqxA.3003470.EU370913'
assert result.reference_database_id == 'argannot'
assert result.reference_database_version == '2019-Jul-28'
assert result.reference_accession == 'OqxA.3003470.EU370913.4407527-4408202.4553'
assert result.analysis_software_name == 'groot'
assert result.analysis_software_version == '0.0.1'
assert result.reference_gene_length == 657
assert result.coverage_depth == 266
As you can see, the gene_symbol
, gene_name
and reference_accession
fields are all storing the same information.
I've a bit of trouble mapping the fields in OqxA.3003470.EU370913.4407527-4408202.4553
to the spec. The gene_name
technically this is not present in the report. Should we store the same value as gene_symbol
or keep is as None
?
For the reference_accession
, shouldn't we keep just the EU370913
value? I'm unsure of what 3003470
represents, as well as the 4407527-4408202.4553
.
Any input is welcomed!
If the exact same tool is run with different settings (but all the other metadata stays the same such as version) summarize falsely combines them e.g., running RGI on contigs and reads, hamronizing each output, and them combining them in a summary will end up with an interactive summary that both RGI contig and RGI-bwt results are from the same simple run of RGI.
Solution: Summarize should try and treat each file separately and assign a new config # if multiple hAMRonized files are supplied
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/summarize.py#L16
Hi Finlay,
when running ORFs through the RGI-hARMonization pipeline, the ORF_ID (important for final hAMR report) is skipped.
Here https://github.com/SvetlanaUP/hAMRonization/blob/master/hAMRonization/RgiIO.py
you can see that I fixed ORF_ID self.field_mapping manually (Fixed 'ORF_ID': 'None', 'Contig': 'input_sequence_id' (should be opposite) ).
This solution works for us, but maybe for the future would be good to have it defined in the RgiIO.py, e.g. if there is no 'Contig' use 'ORF_ID'.
Thanks,
Svetlana
For citing hamronization in publications, it would be helpful to have a persistent DOI. This could be easily achieved by linking the GitHub repo to Zenodo, see the GitHub referencing docs.
With the following single-entry output this is currently what is being parsed:
Sample DB gene allele coverage depth diffs uncertainty divergence length maxMAF clusterid seqid annotation
Dummy ResFinder oqxA oqxA 100.0 75.852 1snp 0.152 660 0.037 470 1995 oqxA_1_V00622; V00622; fluoroquinolone
The metadata passed is the following:
metadata = {"analysis_software_version": "0.0.1", "reference_database_version": "2019-Jul-28", "input_file_name": "Dummy", "reference_database_id": 'resfinder'}
This is the current output:
assert result.input_file_name == 'Dummy'
assert result.gene_symbol == 'oqxA'
assert result.gene_name == 'oqxA'
assert result.reference_database_id == 'ResFinder'
assert result.reference_database_version == '2019-Jul-28'
assert result.reference_accession == '1995'
assert result.analysis_software_name == 'srst2'
assert result.analysis_software_version == '0.0.1'
assert result.coverage_percentage == 100
assert result.reference_gene_length == 660
assert result.coverage_depth == 75.852
My question is regarding the reference_database_id
that is currently required in the metadata, but it's being (correctly!) parsed from the report file. I suggest removing this from the required metadata fields.
In pha4ge/hAMRonization/blob/master/hAMRonization/RgiIO.py
Typo in line 79: 'Percentage Length of '
HTH,
Svetlana
Tests need implemented for new parsers
The following parsers need to be updated to comply with the new spec:
The following parsers currently are skipping the variant detection results:
The following parsers should be added:
When file is passed without the expected format/fields, a generic error is thrown. This can be easily handled with a custom exception informing the user about why it's failing.
In the hAMRonizedResult class definition, the field terms for the parsers, and their value types, should be obtained from the schema JSON file and not be hardcoded into the tool. The necessary file is already provided in the schema/
directory. A parser should be included to retrieve this information directly from the file, facilitating the update of the field terms when necessary.
⬇️
https://github.com/pha4ge/hAMRonization/blob/master/hAMRonization/hAMRonizedResult.py#L13-L52
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.