arkadiy-garber / fegenie Goto Github PK

View Code? Open in Web Editor NEW

53.0 4.0 11.0 101.65 MB

HMM-based identification and categorization of iron genes and iron gene operons in genomes and metagenomes

License: GNU Affero General Public License v3.0

Python 92.07% Shell 2.76% R 4.71% Dockerfile 0.46%

iron metagenomics annotation genes magnetosome oxidation reduction siderophore transporter schubert

fegenie's Introduction

FeGenie

Please see the Wiki page for introduction and tutorial on how to use this tool.

Citing FeGenie:

Garber AI, Nealson KH, Okamoto A, McAllister SM, Chan CS, Barco RA and Merino N (2020) FeGenie: A Comprehensive Tool for the Identification of Iron Genes and Iron Gene Neighborhoods in Genome and Metagenome Assemblies. Front. Microbiol. 11:37. doi: 10.3389/fmicb.2020.00037

Special thanks to Michael Lee for helping to put together the Conda environment for FeGenie. Thanks to Natasha Pavlovikj for creating the Conda recipe for FeGenie. Thanks to Michał Sitko for creating a Dockerfile for FeGenie.

Easy Installation (if you have Conda installed)

conda create -n fegenie -c conda-forge -c bioconda -c defaults fegenie=1.0 --yes
conda activate fegenie
FeGenie.py -h

and when you are done using FeGenie and would like to deactivate the Conda environment for FeGenie

conda deactivate

Installation (if you don't have Conda)

git clone https://github.com/Arkadiy-Garber/FeGenie.git
cd FeGenie
bash setup.sh
./FeGenie.py -h

Quick-start

FeGenie.py -bin_dir /directory/of/bins/ -bin_ext fasta -t 16

The argument for -bin_ext needs to represent the filename extension of the FASTA files in the selected directory that you would like analyzed (e.g. fa, fasta, fna, etc).

Quick-start (if you installed using the 'setup_noconda.sh' script)

./FeGenie.py -bin_dir /directory/of/bins/ -bin_ext fasta -t 16 -out output_fegenie

hmms/iron directory can be found within FeGenie's main repository -t 8 means that 8 threads will be used for HMMER and BLAST. If you have less than 16 available on your system, set this number lower (default = 1)

Tutorial (Binder)

FeGenie introductory slideshow:

Content | Video presentation

FeGenie video tutorial:

Content | Video presentation

To start the tutorial, hit the 'launch binder' button below, and follow the commands in 'Walkthrough'

(Initially forked from here. Thank you to the awesome binder team!)

Walkthrough

Enter the main FeGenie directory

cd FeGenie

print the FeGenie help menu

FeGenie -h

run FeGenie on test dataset

FeGenie.py -bin_dir genomes/ -bin_ext fna -out fegenie_out

Go into the output directory and check out the output files

cd fegenie_out
less FeGenie-geneSummary-clusters.csv

run FeGenie on gene calls

FeGenie.py -bin_dir ORFs/ -bin_ext faa -out fegenie_out --orfs

run FeGenie on gene calls, and use reference database (RefSeq sub-sample) for cross-validation

FeGenie.py -bin_dir ORFs/ -bin_ext faa -out fegenie_out --orfs -ref refseq_db/refseq_nr.sample.faa

Running with docker

In case of running FeGenie with docker the only dependency you need to have installed is docker itself (installation guide).

With docker installed you can run FeGenie in the following way:

docker run -it -v $(pwd):/data --env iron_hmms=/data/hmms/iron --env rscripts=/data/rscripts note/fegenie-deps ./FeGenie.py -bin_dir /data/test_dataset -bin_ext txt -out fegenie_out -t $(nproc)

./FeGenie.py ... follows normal, non-dockerized flow of arguments.

Beware that you need to mount directories which contain files FeGenie is supposed to read. If you are not familiar with docker then run docker run command from the directory into which you cloned FeGenie repository. If all the files you pass to FeGenie are in inside this directory and you use relative filepaths (like e.g. hmms/iron) everything will work just fine.

Upcoming Updates (we welcome more suggestions, which can be submitted as an Issue)

Ability to accept previously-annotated genomes and gene-calls.
Include Cytochrome 579 (and possible rusticyanin)
Improve dilineation between MtrA and MtoA for better resolution with respect to identification of iron reduction and iron oxidation, respectively.
Option to report absolute values for gene counts (rather than normalized gene counts)
Include option to release all results (regardless of whether rules for reporting were met)
Identification of iron-sulfur proteins.

fegenie's People

Contributors

Stargazers

Watchers

Forkers

note thexiyang astrobiomike jameyzhu rui0511z enuhblaise nancy-merino vicmarin789 coryklujeske drrumble jbitencourt

fegenie's Issues

No protein sequence for Cyc2

Hey ! Thank for your tool :)

I am looking for gene involved in iron oxydation in genomes and MAGs. When Cyc2 is found there are not proteins sequences associated. It is the only gene for which it is noted as "empty" in protein_sequence column in geneSummary outputs. Is not normal ? if not, why do I always get this result ?

Thanks,

Eva

Use of ORFs

Hi,

Thank you for your tool. I am currently using FeGenie on contigs and have tried to use it on ORFs as well. My results are quite different and those for ORFs are incoherent. I have the impression that in the current version, there is a problem with the use of ORFs. I just saw that you have put in your Upcoming Updates "the ability to accept previously annotated genomes and gene calls". Is this coming soon?

Thank you!

Error "hmmsearch: not found ... local variable 'hmmout' referenced before assignment" while running FeGenie

Hi, I'm interested in identifying iron related genes within a metatranscriptomics dataset. I have attempted providing a fasta with gene (nucleotide) or protein sequences as bellow, but I hit an error that says hmm not found. Please see below.

Thanks! Ana~

command: FeGenie.py -bin_dir ORFs/ -bin_ext fasta -out TEST_out --orfs

All required arguments provided!

HMM
AsbE_petrobactin_synthesis-rep
Cyc1
Cyc2_repCluster1
.
.
.
GACE_1846
GACE_1847
nramp
zip
starting main pipeline...

2
Looking for following iron-related functional category: iron_storage
sh: 1: hmmsearch: not foundta: 33%
rm: cannot remove 'TEST_out/test_proteins.fasta-HMM/Transferrin_TbpB_binding_protein_Haemophilus_influenzae_P44971.hmm.txt': No such file or directory
FeGenie cannot find the correct hmmsearch output files. If you provided gene or ORF-call sequences, please be sure to specify this in the command using the '--orfs' flag
Traceback (most recent call last):
File "/home/.../FeGenie/FeGenie.py", line 3083, in
main()
File "/home/.../FeGenie/FeGenie.py", line 821, in main
for line in hmmout:
UnboundLocalError: local variable 'hmmout' referenced before assignment

'Namespace' object has no attribute 'hbm'

Hi Arkadiy,

I am trying to use FeGenie to identify proteins involved in iron reduction.

My inputs are two amino acid fasta files from two metatranscritptome data. they are significant genes estimated by edgeR.
I ran the command:
(fegenie) sxxn@Seans-MacBook-Pro ~ % FeGenie.py -bin_dir /Volumes/SEAN/edgeR_DEG/ --orfs --skip -bin_ext fasta -out fegenie_out

then i got this:
Ok, proceeding with analysis!
All required arguments provided!

Traceback (most recent call last):
File "/Users/sxxn/miniconda3/envs/fegenie/bin/FeGenie.py", line 3020, in
main()
File "/Users/sxxn/miniconda3/envs/fegenie/bin/FeGenie.py", line 2744, in main
if args.hbm:
^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'hbm'

and here is my input fasta:

4085_2
MRLTTKGRFAVTAMIDVALRQHAGPVTLAGIAERQKISLSYLEQLFGKLRRNQLVASTRGPGGGYTLAKPLAAVSVADIISAVDEPLDATSCGGRGNCHDDHPCMTHDLWMSLNARMHEYLSSVNLDHLVRQQGIKGCANEAQPVTIGKAPARRIPVMATA*
7_1
MKVTKIFTHGLLWVIIVMLVIPPGIMAQDTEETVQSAQFNEEELAQMLAPIALYPDSLIAEILMASTYPIEVVEAER
9_1
MKGMKIYIKGLSWVIIVMLMMPPGLMAQDSGQIEQPVKFNKEELAQMLAPIALYPDSLIAQILMASTYPIEVVEAERWIRKNKNLTGDELDSALQEKTWDPSVKSLCHFPDILYAMSEKLDQTTKLGDAFLSQQDEVMDTIQELRRKAQEQGNLTTTKEQKVIVEQETIYIEPANPEIIYVPAYDPLYVYGPWWYPAYP

Any idea I can fix this error?

Thanks,
Sean

ValueError: could not convert string to float: 'EMPTY' in -bams option

Running FeGenie using -bams options with 4 metagenomes, each with their own single bam file. FeGenie proceeds through the mapping of the first set of contigs then crashes. This happens on the first file no matter what (I have put them in different orders) and each individual metagenome and bam file work fine when run singly with the -bam option.

processing... P1DNA.contigs.fa
Output depth matrix to geomics_bam_test_out/P1DNA.contigs.fa.depth
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-07-03T11:59:07
Output matrix to geomics_bam_test_out/P1DNA.contigs.fa.depth
0: Opening bam: /Users/gabrielle/Desktop/GenomeFun/FeGenie/Testing/BamFiles/P1DNA.bam
Processing bam files
Thread 0 finished: P1DNA.bam with 3840282 reads and 2255380 readsWellMapped
Creating depth matrix file: geomics_bam_test_out/P1DNA.contigs.fa.depth
Closing most bam files
Closing last bam file
Finished
processing... P1DNA.contigs.fa
Traceback (most recent call last):
  File "/Users/gabrielle/Desktop/GenomeFun/FeGenie/FeGenie.py", line 2707, in <module>
    main()
  File "/Users/gabrielle/Desktop/GenomeFun/FeGenie/FeGenie.py", line 2341, in main
    Dict[cell][process].append(float(depthDict[cell][contig]))
ValueError: could not convert string to float: 'EMPTY'

Note--this was happening last week as well but I was worried it was perhaps because of how I had gotten the single -bam option to work by the metabat install, so I did not post then. Attaching my bam location file just in case.
Bamlocations2.txt

Error Syntax warning and location

Hi,
I am encountering the following error when I am trying to run the software. Any ideas on how to resolve it?

About iron-sulfur proteins

Hi,thanks for your good jobs. I want to identify the iron-sulfur proteins of MAGs. I see that you will update this function in the near future. When will it be updated? is there a tool for this? thank you!

Why the FeGenie installed by conda and the FeGenie installed manually work very differently？

Hi,

Thanks for the tool. Since I found that the annotations obtained from FeGenie installed by conda were few. I test the FeGenie installed by conda and the FeGenie installed manually. The results of these two approaches are very different. Why?

Thank you!

Issue with Installation

Please help. Running the command (bash setup_noconda.sh) returns an error as given below:

test_dataset/
test_dataset/Aggregatibacter_actinomycetemcomitans.txt
test_dataset/Mariprofundus_ferrooxydans_PV-1.txt
test_dataset/Streptococcus_mutans.txt
test_dataset/Pseudomonas_aeruginosa_PA01.txt
test_dataset/Porphyromonas_gingivalis.txt
test_dataset/Rhodopseudomonas_palustris_TIE-1.txt
test_dataset/Geobacter_bemidjiensis.txt
test_dataset/Shewanella_oneidensis_MR-1.txt
test_dataset/Magnetospirillum_magneticum_AMB-1.txt
test_dataset/Rhodopseudomonas_palustris_TIE-1.txt-proteins.faa
test_dataset/Acidithiobacillus_ferrooxidans.txt
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
Warning in install.packages("grid", repos = "http://cran.us.r-project.org") :
'lib = "/usr/lib64/R/library"' is not writable
Error in install.packages("grid", repos = "http://cran.us.r-project.org") :
unable to install packages
Execution halted
Error: unexpected end of input
Execution halted
setup_noconda.sh: line 8: syntax error near unexpected token (' setup_noconda.sh: line 8: Rscript -e 'install.packages("ggplot2", repos = "http://cran.us.r-project.org")''

How can I overcome this issue?

An error with FeGenie installed without conda (hmm path problem)

Hi,

I've encountered an error during running FeGenie installed without conda.

The error message is as below.

#######
Traceback (most recent call last):
File "./FeGenie.py", line 458, in main
test = open(bits)
FileNotFoundError: [Errno 2] No such file or directory: '/HMM-bitcutoffs.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./FeGenie.py", line 1984, in
main()
File "./FeGenie.py", line 466, in main
location = allButTheLast(location, "/")
#######

This error can be solved by manually setting a path to "iron_hmms".
$ export iron_hmms=/path_to/FeGenie/hmms/iron/

Do I have to set the hmm path for FeGenie installed without conda?

Thanks.

Where to hmms/iron

According to the README, the hmms/iron directory can be found within FeGenie's main repository. I was assuming that you meant the github repository, but I can't find it there. Can you point me in the right direction as to where to find this?

How to use -bams for a coassembly of some metagenomes

Dear Arkadiy Garber,

I just watched your FeGenie lesson and tutorial in the BVCN and, first of all, I wanted to thank you for the nice tool (and teaching materials) you & your colleagues released!
I could easily install FeGenie in our server and now I am excited to give it a try.

I have a set of bins generated with several binning tools + das tool from 4 co-assembled metagenomes, so I have 4 bam files to which all bins relate.

My question is: would -bams work if a provide a tab-delimited file with all bin names in the first column and then, in the second column, a pathway to the folder containing the 4 bam files without specifying each bam file name?

From the tutorial I get this function is possible by having each bam file in a different column, right? In the near future I will get 12 co-assembled metagenomes and it would be easier to just point to a folder with all bam files :D

If this function is not supported yet, here is my suggestion :)
Thanks and best regards,
Paula Dalcin Martins

FileNotFoundError on "Easy Installation (if you don't have Conda)"

Hello, congratulations for your work.

I had an error while executing the last command on "Easy Installation (if you don't have Conda)" insctructions:

./FeGenie.py -h

, which throws to me:

Traceback (most recent call last): File "./FeGenie.py", line 459, in main test = open(bits) FileNotFoundError: [Errno 2] No such file or directory: '/HMM-bitcutoffs.txt' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./FeGenie.py", line 2170, in <module> main() File "./FeGenie.py", line 467, in main location = allButTheLast(location, "/") UnboundLocalError: local variable 'location' referenced before assignment

With a little lookup, I replaced line #462 of FeGenie.py file from:

os.system("which FeGenie.py > mainDir.txt")

, to

os.system("which ./FeGenie.py > mainDir.txt")

, which solved the problem for me. Just in case anyone has the same error.

Annotation collisions and confusion about HMM model names

Hello Arkadiy,

I used FeGenie and I think it works very well and is simple to use. I am not sure where I can find more information about some of the HMM model calls. I checked the publication text and supplementary files, but can't find what the names DFE_0465 and DFE_0448 refer to.

This came up because I'm trying to reconcile some results between different annotation software.
I've attached the few instances where KEGG and FeGenie annotations disagree.
FeGenieCollisions.txt

I was able to find closely related sequences in NCBI refseq_protein using BLASTp, which i summarize below.

BLASTp calls the proteins identified as FmnB as FAD:protein FMN transferase.
BLASTp calls the proteins identified as DmkB and DmkA are a polyprenyl synthetase family protein and UbiA prenyltransferase proteins, which are what DmkA and DmkB are I think.
BLASTp and KEGG agree that the proteins identified as DFE_0465 and DFE_0448 are the cytochrome c proteins shown in the attachment.

Any help on understanding how to reconcile these annotations and on where to find more information for the HMMs included in the software would be great.

Thanks,
Keith

where was the data for protein DmkA obtained from？

Hi,

Thank you for your tool. We would like to ask where the data for protein DmkA was obtained from. We didn't find the protein sequence of DmkA in NR database. And the proteins annotated as DmkA using FeGenie was re-annotated using NR database and the result was 1,4-dihydroxy-2-naphthoate octaprenyltransferase, prenyltransferase, etc.

Thank you!

Undetermined error

Hello, I ran FeGenie successfully (thanks)
I was wondering the meaning of such error message (in bold) at the end on the job. May you tell

I ran the following command./FeGenie.py -bin_dir ./FeGenie_bins -bin_ext fasta -t 4 -out output_fegenie5
checking arguments
.
.
.
All required arguments provided!
Finding ORFs for MTL4.1-CDSs.fasta
starting main pipeline...
.
Looking for following iron-related functional category: iron_storage
analyzing MTL4.1-CDSs.fasta: 100%
.
.........
.........
.
Looking for following iron-related functional category: magnetosome_formation
analyzing MTL4.1-CDSs.fasta: 100%
Consolidating summary files into one master summary file
Identifying genomic proximities and putative operons
Traceback (most recent call last):
File "./FeGenie.py", line 2170, in
main()
File "./FeGenie.py", line 843, in main
CoordDict[i][contig].append(int(numOrf))
ValueError: invalid literal for int() with base 10: '130050|ID:75309160|bioA'

geneSummary file

Hi Arkadiy,

Just a quick question -- are the outputs in the geneSummary files showing all encoded HMM hits and their ORFs or are these hits that are both encoded and have coverage in the supplied BAMs? Thanks!!

Best,
Joy

Easy install with conda didn't locate packages

Hi. I'm trying to install FeGenie with conda (v4.8.3) but when I run the first command line for the conda install I get the following error messages:

Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package clone
E: Unable to locate package https://github.com/Arkadiy-Garber
E: Couldn't find any package by glob 'https://github.com/Arkadiy-Garber'
E: Couldn't find any package by regex 'https://github.com/Arkadiy-Garber'

I'm running Ubuntu 18.04.4 in case that helps.

bam option looks for jgi_summarize_bam_contig_depths

Hi Arkadiy
Thanks for your work developing FeGenie. I installed it on a MacBookPro with Conda and the basic commands are working on my dataset, but when I tried to use the -bam option I got the following error.

processing... GeoMICS_all_contigs.fa
sh: jgi_summarize_bam_contig_depths: command not found
processing... GeoMICS_all_contigs.fa
Traceback (most recent call last):
  File "/Users/gabrielle/Desktop/GenomeFun/FeGenie/FeGenie.py", line 2293, in main
    depth = open("%s/contigDepths/%s.depth" % (args.out, cell))
FileNotFoundError: [Errno 2] No such file or directory: 'geomics_all_bam_test_out/contigDepths/GeoMICS_all_contigs.fa.depth'

I recognized this script name from my prior efforts at binning this metagenome, so I just installed metabat2 in the same Conda environment and now the bam option appears to work. I am sure this is not the preferred solution, posting here in case it helps someone else or to get a better fix.

Error in hclust(d = dist(x = fegenie.scaled))

Error in hclust(d = dist(x = fegenie.scaled)) :
NA/NaN/Inf in foreign function call (arg 10)
Calls: as.dendrogram -> hclust
Execution halted

'rm: cannot remove (...): Directory not empty' error?

Hi there,

Just installed FeGenie (again congrats on the tool!) via conda and tried running it on my bins as such: FeGenie.py -bin_dir Gal_FeGenie/ -bin_ext fa -t 16 -out Gal_FeGenie_out.

Although in the end of the pipeline I do get the expected output files, as it runs I keep getting errors for all phases of the it past finding ORFs as such:

checking arguments
.
.
.
All required arguments provided!

Finding ORFs for Bin_24_1_1-contigs.fa
Finding ORFs for Bin_24_5_1-contigs.fa
Finding ORFs for Bin_28_2-contigs.fa
Finding ORFs for Bin_50_3-contigs.fa
Finding ORFs for Bin_56_3-contigs.fa
Finding ORFs for Bin_56_4-contigs.fa
Finding ORFs for Bin_56_6-contigs.fa
Finding ORFs for Bin_56_7_1-contigs.fa
Finding ORFs for Bin_56_8-contigs.fa
starting main pipeline...

.
Looking for following iron-related functional category: iron_aquisition-heme_oxygenase
analyzing Bin_24_1_1-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_24_1_1-contigs.fa-HMM’: Directory not empty
analyzing Bin_24_5_1-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_24_5_1-contigs.fa-HMM’: Directory not empty
analyzing Bin_28_2-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_28_2-contigs.fa-HMM’: Directory not empty
analyzing Bin_50_3-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_50_3-contigs.fa-HMM’: Directory not empty
analyzing Bin_56_3-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_56_3-contigs.fa-HMM’: Directory not empty
analyzing Bin_56_4-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_56_4-contigs.fa-HMM’: Directory not empty
analyzing Bin_56_6-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_56_6-contigs.fa-HMM’: Directory not empty
analyzing Bin_56_7_1-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_56_7_1-contigs.fa-HMM’: Directory not empty
analyzing Bin_56_8-contigs.fa: 100%   
rm: cannot remove ‘Gal_FeGenie///Bin_56_8-contigs.fa-HMM’: Directory not empty

Any ideas? Can't be sure if the pipeline ran that well with all this...

Thanks in advance :)

Renamed contigs?

Hi Arkadiy,
Thank you for all that you do! I am noticing that FeGenie renumbers/names contigs within individual MAGs' depth files. For those of us who cross-reference output from different tools, like anvi'o, would it be possible to retain the original contig names? For example, if I'm looking at output about c_000000001 from anvi'o and I use the same input fastas for FeGenie analysis, it would be really great to have that same c_000000001 contig be tied to the same MAG that anvi'o analyzed. Does this make sense? I double-checked that the input MAGs fastas for FeGenie were those made by the SUMMARIZE program within anvi'o.
Thank you!!
Best,
Joy

Normalized gene abundances in 'FeGenie-heatmap-data.csv'?

Hi, I just tested out FeGenie and it seems to be a very convenient tool to hunt for iron-metabolism-genes; many thanks for bringing it out for us in the microbiology community! I was wondering if you could shed more light on what those numbers actually represent in the output file FeGenie-heatmap-data.csv. I'm assuming "normalized abundance of genes per functional category" in each of the genomes being tested?

Option to output prodigal nucleotide file?

Hello, I've used your tool and it works great! I was wondering if you could add an option to output a nucleotide fasta file, similar to the proteins.faa file created during the Prodigal ORF searching. I would like to do some metatranscriptome mappings to the hits, and so this option would be quite useful. Thanks!

Error when cluster ORFs

Hi there,

I've run into a new problem with running FeGenie. Previously I've run the script with no issue however now I receive the following error everytime I try to run

Consolidating summary files into one master summary file Identifying genomic proximities and putative operons Clustering ORFs... Looking for Thermincola S-layer cytochromes and Geobacter-related porin-cytochrome operons Pre-processing of final outout file rm: cannot remove 'fegenie_out4/GeoThermin.csv': Text file busy rm: cannot remove 'fegenie_out4/magnetosome_formation-summary.csv': Text file busy rm: cannot remove 'fegenie_out4/FinalSummary-dereplicated-clustered-blast.csv': Text file busy rm: cannot remove 'fegenie_out4/Afreen_L.plantarum202195_GCA_010586945.1_ASM1058694v1_genomic.fna-thermincola.blast': Text file busy Counting heme-binding motifs rm: cannot remove 'fegenie_out4/FeGenie-summary.csv': Text file busy mv: cannot move 'fegenie_out4/FeGenie-summary-blasthits.csv' to 'fegenie_out4/FeGenie-summary.csv': Operation not permitted Final processing of output Traceback (most recent call last): File "/home/qiime/FeGenie/FeGenie.py", line 2719, in <module> main() File "/home/qiime/FeGenie/FeGenie.py", line 1596, in main if ls[6] != "cluster": IndexError: list index out of range

I see that updates were made to the FeGenie.py script 27 days ago I'm assuming this may be a an error do to the addition of some categories, however I'm not sure how to troubleshoot it. Any help you can offer would be appreciated.

Thanks

geneSummary output protein sequence EMPTY

Hi,

I have several hits in the geneSummary.csv file witn EMPTY in the protein sequence column. What does that mean? I provided ORFs to analyse but am unsure now if I can use the results.

iron_oxidation,protein-sequences.faa,713031,sulfocyanin,48.8,45.4,3638,0,0,0,MSVRTSTSVALSVGLGLSLVGGNALTPASAASSRYISYNAHNHTAKIILIGALNNSNQGMNFDGYANGKAIFTVPLGTKVTVAYSDGASKPHSAEIAPWSASIPAAAVTPAFKGAASADYANGSEKGDPISVFTFTASKAGKYRIMCGVTAHAILGMWDVLQVSKSAKVATLN
iron_oxidation,protein-sequences.faa,207671,Cyc2_repCluster2,298.9,27.8,3639,1,0,0,EMPTY
iron_oxidation,protein-sequences.faa,438947,Cyc2_repCluster2,333.9,27.8,3640,1,0,0,EMPTY

Thank you!!

Regarding --all_results option

Hi,
Thanks for the pipeline. It made my work easy. I am interested in identifying iron siderophore synthesis in my metagenomic samples. When I run the fegenie pipeline without --all_results the output heatmap.csv shows iron_aquisition-siderophore_synthesis as 0,0,0,0,0.
if I include --all_results tag in my command it shows as iron_aquisition-siderophore_synthesis,2592,2488,5138,5511,1979.
It is very strange to get 0 counts considering my samples are from a natural environment and consist of a diverse bacterial community.

What's the difference between the above two commands (with and without --all_results tag)?
The help section says --all_results: report all results, regardless of clustering patterns and operon structure

I didn't clearly understand what does that actually means.

Does it have anything to do with search accuracy?

Heatmap has same values regardless of what inflation factor I use

Hi there, and thank you for this tool! I am trying to run some metagenomes through FeGenie and have been successful so far, but I am trying to understand the numbers in FeGenie-heatmap-data.csv a little better. From what I understand, the default is gene counts divided by the predicted number of ORFs for each metagenome, multiplied by 1000. However, I have tried inputting '-inflation 100' (to get a percentage) and '-inflation 1' (to see the raw number), but the output is exactly the same as the default, making me question if I have done something wrong. All of my input files are .fasta, and I am also using the flag '--meta', if that is important. Most of the heatmap numbers are either 0 or in the low hundreds, ranging from 1 to 424.

Any insight would be very helpful. Thanks!

Error - ValueError: could not convert string to float: 'EMPTY'

Hi there,
I am aiming to make plots with coverage data for several bam files. I get .depth files and .csv files (as well as ORF_calls and HMM_results. However, there are no plots. Here is my input and the error:

FeGenie.py -bin_dir /home/lloydlab/klds1515/SVA08.16_Metagenomes/071817KLmetagenome-45705703/Bins_from_Derep_Dastool/Fasta_files/Indiv_Anvio_fastas -bin_ext fasta -out /klds1515/SVA08.16_Metagenomes/071817KLmetagenome-45705703/Bins_from_Derep_Dastool/Fegenie/Output/Plots -t 16 -bams Bam_file_for_fegenie.txt --makeplots --norm

Thread 0 finished: VKAB123_transcripts_MAGs_no_ribo.bam with 15432 reads and 8719 readsWellMapped
Creating depth matrix file: /klds1515/SVA08.16_Metagenomes/071817KLmetagenome-45705703/Bins_from_Derep_Dastool/Fegenie/Output/Plots/Woeseia_stnF.depth
Closing most bam files
Closing last bam file
Finished
processing... Woeseia_stnF
Traceback (most recent call last):
File "/home/lloydlab/anaconda3/envs/fegenie/bin/FeGenie.py", line 3020, in
main()
File "/home/lloydlab/anaconda3/envs/fegenie/bin/FeGenie.py", line 2586, in main
Dict[cell][process].append(float(depthDict[cell][contig]))
ValueError: could not convert string to float: 'EMPTY'

Thank you!

Rewrite using a workflow manager (snakemake, nextflow)

Awesome product!

Many problems mentioned in the issues could be solved by rewriting the workflow using a proper manager. This would also properly parallelize tasks.

Best

Permission denied when moving files

Hi Arkadiy, thanks very much for developing this great tool :)

I'm trying to implement FeGenie, however, near the end of the analysis this error is raised:

`Pre-processing of final outout file
Counting heme-binding motifs
mv: cannot move '../results/fegenie/FeGenie-summary-blasthits.csv/FeGenie-summary-blasthits.csv' to '../results/fegenie/FeGenie-summary-blasthits.csv/FeGenie-summary.csv': Permission denied
Final processing of output

Traceback (most recent call last):
File "/home/fooa/miniconda3/envs/fegenie/bin/FeGenie.py", line 3020, in
main()
File "/home/fooa/miniconda3/envs/fegenie/bin/FeGenie.py", line 1641, in main
infile = open(outDirectory + "/FeGenie-summary.csv")
FileNotFoundError: [Errno 2] No such file or directory: '../results/fegenie/FeGenie-summary-blasthits.csv/FeGenie-summary.csv'`

I still get the cluster.csv file but the summary.csv file is empty. I have a feeling this might be due to the server I'm using at my uni, but wondering if this is a problem you've encountered before or might have a solution for?

Thanks

Aidan

Unable to run FeGenie.py

Hello,

I just downloaded your script (nonconda way) and hoped to use it to search for Fe associated genes in my metagenome. However, I ran into an error when I typed ./FeGenie.py -h. This same error also occurred when I tried to run the program following your instruction ./FeGenie.py -hmm_lib hmms/iron -bin_dir /directory/of/bins/ -bin_ext fasta -t 16 -out output_fegenie:

Traceback (most recent call last):
File "./FeGenie.py", line 506, in main
test = open(bits)
FileNotFoundError: [Errno 2] No such file or directory: '/HMM-bitcutoffs.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./FeGenie.py", line 3006, in
main()
File "./FeGenie.py", line 514, in main
location = allButTheLast(location, "/")
UnboundLocalError: local variable 'location' referenced before assignment

Thanks!
Clare

false negatives?

Hey! I do have another question!

After annotating my MAGs, I saw that FeGenie didn't find any transport-related clusters in any of my MAGs, which wouldn't make sense biologically (I have, among others, several cyanobacterial MAGs, and they must get their iron somewhere, right?). If I use the --all_results flag, I get some transport genes, but I'm not sure I should use them, since you mention in a different thread that this flag can create false-positives.

I imagine something goes wrong during the clustering step? I looked into one MAG specifically. According to the output produced by --all_results, it has the three EfeUOB genes, all next to each other, but they don't show up when I run the same MAG in strict mode. Are the other genes that should be present for the cluster to be complete?

Sorry for the basic question, I'm very new to the iron metabolism world :)

I can send you the MAG I looked into, or the output files, if needed.

Thanks!

IndexError during Final processing of output

Running the quick-start fasta files directly following a conda install gives the following error:

FeGenie.py -bin_dir test_pao1/ -bin_ext txt -t 1
checking arguments
.
.
.
All required arguments provided!

Finding ORFs for Pseudomonas_aeruginosa_PA01.txt
starting main pipeline...

.
Looking for following iron-related functional category: iron_aquisition-heme_oxygenase
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_aquisition-heme_transport
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_aquisition-iron_transport
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_aquisition-siderophore_synthesis
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_aquisition-siderophore_transport
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_gene_regulation
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_oxidation
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_reduction
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: iron_storage
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   

.
Looking for following iron-related functional category: magnetosome_formation
analyzing Pseudomonas_aeruginosa_PA01.txt: 100%   


Consolidating summary files into one master summary file
Identifying genomic proximities and putative operons
Clustering ORFs...

.
Looking for Thermincola S-layer cytochromes and Geobacter-related porin-cytochrome operons
Pre-processing of final outout file
Counting heme-binding motifs
Final processing of output

Traceback (most recent call last):
  File "/home/snip/FeGenie/FeGenie.py", line 2167, in <module>
    main()
  File "/home/snip/FeGenie/FeGenie.py", line 1414, in main
    memoryDict[dataset][orf]["seq"] = ls[10]
IndexError: list index out of range

Platform: Debian 10
Python 3.7.6

error depth = open("%s/%s.depth" % (outDirectory, cell)) | HELP PLS.

Hi Arkadiy,

I am trying to run FeGenie with various bam file aligning to 2 different genomes. The command I am using is:
FeGenie.py -bin_dir final_megahit_co_assemblies/ -bin_ext fa -out fegenie_total_A_C_output -t 10 -bams WL_fegenie_bam_map_test.tsv --meta --makeplots

Except with the absolute paths, the thing is I keep getting the following as a output,
I have removed and reinstalled FeGenie and reconfigured my environment and etc but I can't figure out whats wrong.

here is the error output

...

Looking for Thermincola S-layer cytochromes and Geobacter-related porin-cytochrome operons
Pre-processing of final outout file
Counting heme-binding motifs
Final processing of output

Writen summary to file: fegenie_total_A_C_output/FeGenie-geneSummary-clusters.csv for visual inspection
Writen summary to file: fegenie_total_A_C_output/FeGenie-geneSummary.csv for downstream parsing and analyses
Writing heatmap-formatted output file: fegenie_total_A_C_output/FeGenie-heatmap-data.csv

processing... final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa
Output depth matrix to fegenie_total_A_C_output/final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa.depth
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-07-03T11:59:07
Output matrix to fegenie_total_A_C_output/final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa.depth
0: Opening bam: /Users/oliviaannerumble/bioinformatics/Waterlogging_project/Metatranscriptomic_WL_files/metaT_DRY_A/RQCF_DRY_A_files/Dry_A_bam_files/sorted_bam_files/WD-19_S10_L001_A_paired_mapped_sorted.bam
0: Opening bam: /Users/oliviaannerumble/bioinformatics/Waterlogging_project/Metatranscriptomic_WL_files/metaT_DRY_A/RQCF_DRY_A_files/Dry_A_bam_files/sorted_bam_files/WD-19_S10_L002_A_paired_mapped_sorted.bam
Processing bam files
WARNING: your aligner reports an incorrect NM field. You should run samtools calmd! nm < ins + del: cmatch=0 nm=33 ( insert=0 + del=44 + mismatch=33 == 77) D00472:89:HLFHMBCXX:1:1103:11614:22137 1:N:0:GAGATTCC+TAATCTTA
Thread 0 finished: WD-19_S10_L001_A_paired_mapped_sorted.bam with 1140574 reads and 117839 readsWellMapped
Thread 0 finished: WD-19_S10_L002_A_paired_mapped_sorted.bam with 1168778 reads and 120993 readsWellMapped
Creating depth matrix file: fegenie_total_A_C_output/final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa.depth
Closing most bam files
Closing last bam file
Finished
processing... final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa
Traceback (most recent call last):
File "/Users/oliviaannerumble/miniconda3/envs/fegenie/bin/FeGenie.py", line 2636, in main
depth = open("%s/contigDepths/%s.depth" % (args.out, cell))
FileNotFoundError: [Errno 2] No such file or directory: 'fegenie_total_A_C_output/contigDepths/final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa.depth'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/oliviaannerumble/miniconda3/envs/fegenie/bin/FeGenie.py", line 3139, in
main()
File "/Users/oliviaannerumble/miniconda3/envs/fegenie/bin/FeGenie.py", line 2652, in main
depth = open("%s/%s.depth" % (outDirectory, cell))
FileNotFoundError: [Errno 2] No such file or directory: 'fegenie_total_A_C_output/final_megahit_co_assemblies/WL_A_metaG_co_assemblies.fa.depth'

Tagged release and license file

Hi,

I would like to create conda package for FeGenie.
In order to do this, having a license file that allows distribution in the repo is required, and having official tagged release is preferable.
Thus, I was wondering if you can add license file, as well as tag a release of FeGenie.

Thank you,
Natasha

relative abundace?

Hello Arkadiy,

--norm, if include this flag, the gene counts for each iron gene category to be normalized to the number of predicted ORFs in each genome or metagenome. Without normalization, FeGenie will create a heatmap-compatible CSV output with raw gene counts. With normalization, FeGenie will create a heatmap-compatible with 'normalized gene abundances'.

I compared the normalization and non-normalization data and found the normalized gene abundances if include this flag is 1000 times bigger than the percentage of “raw gene counts/predicted ORFs numbers” I calculated by myself. Is that correct? Thank you.

DIAMOND Verification: local variable 'idxDict' referenced before assignment

After running FeGenie with a reference database (nr.dmnd) I get the error shown below.
As an output I get the complete FeGenie-summary.csv and an empty file FeGenie-summary-altered.csv.

I think I can work with the complete summary file, but still wanted to report the bug.

Command
FeGenie.py -bin_dir . -bin_ext fsa -out ../output_fegenie_seqs/ --meta -ref /share/references/library/nr -t 10

Error:

Performing Diamond BLASTx search of putative iron genes against reference database
Counting heme-binding motifs
Final processing of output

Traceback (most recent call last):
  File "/XXX/XXX/miniconda3/envs/fegenie/bin/FeGenie.py", line 3020, in <module>
    main()
  File "/XXX/XXX/miniconda3/envs/fegenie/bin/FeGenie.py", line 2355, in main
    ls[0] + "," + ls[1] + "," + str(idxDict[ls[2]]) + "," + ls[3] + "," + ls[4] + "," + ls[
UnboundLocalError: local variable 'idxDict' referenced before assignment

running programm with reference database for blast searches throws error

Hi,

When I run the program providing a reference database it throws an error which I cannot figure out how to solve. Any help is appreciated!

Cheers,
Christoph

/Final processing of output

Traceback (most recent call last):
File "/home/ckeuschn/miniconda3/envs/fegenie/bin/FeGenie.py", line 3020, in
main()
File "/home/ckeuschn/miniconda3/envs/fegenie/bin/FeGenie.py", line 2355, in main
ls[0] + "," + ls[1] + "," + str(idxDict[ls[2]]) + "," + ls[3] + "," + ls[4] + "," + ls[
^^^^^^^
UnboundLocalError: cannot access local variable 'idxDict' where it is not associated with a value
/

ValueError: invalid literal for int() with base 10: 'protein'

Dear Arkadiy,

I am getting the error below - could you help me understand what is going on and how to fix this?

Consolidating summary files into one master summary file
Identifying genomic proximities and putative operons
Traceback (most recent call last):
  File "/proj/pdmartins/FeGenie/FeGenie.py", line 2170, in <module>
    main()
  File "/proj/pdmartins/FeGenie/FeGenie.py", line 843, in main
    CoordDict[i][contig].append(int(numOrf))
ValueError: invalid literal for int() with base 10: 'protein'

I do get files such as FinalSummary.csv and .csv files for each category (i.e. iron_reduction-summary.csv, etc).

The commands I am running:

source activate fegenie
FeGenie.py -bin_dir /proj/pdmartins/2020_iron_reactor_analyses/analyses_2/ -bin_ext faa -t 16 -out /proj/pdmartins/2020_iron_reactor_analyses/analyses_2/fegenie --orfs --meta

The input file is a prodigal-generated and prokka-annotated fasta amino acid file.
Example of sequence in this file:

unbinned_NHGMMMNG_144129_2E-6E-farnesyl_diphosphate_synthase
MPDRITAGVDAVLDELLSERRLPDGLRGMMAYHLGWVDEDLRALPVRQRSKYGGKKMRAV
LCALACEAAGGDLETAFPAAAAVELVQNFSLVHDDIEDGDRERRHRPTVWVRWGVPQAIN
TGSAMQALVNAAVLRTPAPAETVLDVLRALTAAMVEMTEGQHLDIAFQDRTDVSVAEYED
MASRKTGALMEAAAYTGARLAMSDNRRLAAWRQFGRAFGQAFQARDDLLGVTGVPSVTGK
PVGNDIRARKKALPLLHALAHATPGDRVLLGRAFSNQAVSDEDVGRVTEVMERSGALDAT
RESVERATRSALEAFEATGALGPAADQIREMVSRAVGREQ

Here is the full slurm output file:
slurm-835700.txt

Thank you very much for your help!
Paula

generate multiple output types and/or reuse intermediate files

Hi,
I have been playing with FeGenie on a small metagenome and successfully generated summary files of absolute #s of genes, normalized gene#s and gene coverage using a bam file. However to get each of these I had to rerun the whole thing each time from my fasta files.
For a v. large metagenome the gene finding and hmm searching parts are a time consuming step. Is there way to generate the multiple flavors of the summary output in the same run? Or to be able to direct FeGenie to use the ORF_calls and HMM_results folders generated by the previous run on the same fasta files?
Thanks
Gabrielle

arkadiy-garber / fegenie Goto Github PK

fegenie's Introduction

FeGenie

Citing FeGenie:

Easy Installation (if you have Conda installed)

Installation (if you don't have Conda)

Quick-start

Quick-start (if you installed using the 'setup_noconda.sh' script)

Tutorial (Binder)

Walkthrough

Running with docker

Upcoming Updates (we welcome more suggestions, which can be submitted as an Issue)

fegenie's People

Contributors

Stargazers

Watchers

Forkers

fegenie's Issues

command: FeGenie.py -bin_dir ORFs/ -bin_ext fasta -out TEST_out --orfs

Recommend Projects

Recommend Topics

Recommend Org