qiyunlab / hgtector Goto Github PK

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

evolution genomics horizontal-gene-transfer

hgtector's Introduction

HGTector2

The development of HGTector is now at qiyunlab. Versions starting from 2.0b3 will be released from this repo. Please access HGTector using the new URL: https://github.com/qiyunlab/HGTector.

HGTector2 is a completely re-engineered software tool, featuring a fully automated analytical pipeline with smart determination of parameters which requires minimum human involvement, a re-designed command-line interface which facilitates standardized scientific computing, and a high-quality Python 3 codebase.

HGTector is a computational pipeline for genome-wide detection of putative horizontal gene transfer (HGT) events based on sequence homology search hit distribution statistics.

Documentation

What's New

Installation

Tutorials

References

Quick start

Set up a Conda environment and install dependencies:

conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
conda activate hgtector

Install HGTector2:

pip install git+https://github.com/qiyunlab/HGTector.git

Then you will be able to type hgtector to run the program. Here are more details of installation.

Build a reference database using the default protocol:

hgtector database -o db_dir --default

Or download a pre-built database as of 2023-01-02, and compile it.

Prepare input file(s). They should be multi-Fasta files of amino acid sequences (faa). Each file represents the whole protein set of a complete or partial genome.

Perform homology search:

hgtector search -i input.faa -o search_dir -m diamond -p 16 -d db_dir/diamond/db -t db_dir/taxdump

Perform HGT prediction:

hgtector analyze -i search_dir -o analyze_dir -t hgtdb/taxdump

Examine the prediction results under the analyze_dir directory.

It is recommended that you read the first run, second run and real runs pages to get familiar with the pipeline, the underlying methodology, and the customization options.

License

Citation

Zhu Q, Kosoy M, Dittmar K. HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers. BMC Genomics. 2014. 15:717.

hgtector's People

Contributors

Stargazers

Watchers

hgtector's Issues

Analyzer.pl R problem

Hi - Thank you very much for making this script available!
I am running into the following problem while executing analyzer.pl. The problem seems to be related to reading in the table. Please see below. What I can find is that the last two entries in the string of numbers are separated by a space rather than a comma. I ran the command multiple times but this is a reoccurring issue. Any suggestion to fix this problem would be truly appreciated.

I wish you a great weekend,
best regards,

Markus

Computing statistics...
All protein sets:
Self group:
Skipped.
Close group:
Global cutoff (0.25) = 10.900.
Performing Hartigan's dip test... done.
D = 0.038228, p-value = 2.2e-16
The weight distribution is significantly non-unimodal.
Performing kernel density estimation... done.
Problem running the R command:
d<-data.frame(x=c(-1.95084446,-1.76566503,...............................................,1.603058e-05,1.077472e-05,6.715086e-06,3.881277e-06,2.079954e-06,1.032685e-06 4.744944e-07))
Got the error:
:1:12757: unexpected numeric constant
1: 1135e-06,4.180106e-06,7.093232e-06,1.124340e-05,1.651814e-05,2.246770e-05,2.829661e-05,3.300089e-05,3.563260e-05,3.560213e-05,3.288999e-05,2.806529e-05,2.209492e-05,1.603058e-05,1.077472e-05,6

Extracting downloaded genomic data...Killed

Hi,
first of all I would like to thank you for developing this tool!

Secondly, I am running the command line
hgtector database --output <output_dir> --cats microbe --sample 1 --rank species --reference --representative --compile diamond
to build the database. As far as I got it, it has finished to download the genomes but I am not sure the extraction process has been run through, since it always stop with:
Extracting downloaded genomic data...Killed

Could you clarify to me what's going on here? I am not sure whether the operation is finished and if I have all the necessary files.

Bests,
Rob

python_http_client issue

Hi Qiyun,
I am using the HGTector2 pipeline. When I ran the example, the screen always traceback me a messages showed as follow:
Traceback (most recent call last):
File "/mdata/luozh/softwares/miniconda/envs/hgtector/lib/python3.7/http/client.py", line 554, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/mdata/luozh/softwares/miniconda/envs/hgtector/lib/python3.7/http/client.py", line 521, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

what's wrong with this? waiting for your repy. thank you very much
Regards

Error in database command

Hello,

I have hgtector 2.0b3 installed and Diamond 0.9.26.

I ran the database command, but it returns an error. This is the command and the message:

hgtector database --output database --cats all --sample 1 --rank species --reference --representative --compile diamond
Database building started at 2021-03-02 19:41:31.820355.
Connecting to the NCBI FTP server... done.
Using local file taxdump.tar.gz.
Reading NCBI taxonomy database...Traceback (most recent call last):
File "/home/zaki/miniconda3/bin/hgtector", line 96, in
main()
File "/home/zaki/miniconda3/bin/hgtector", line 35, in main
module(args)
File "/home/zaki/miniconda3/lib/python3.7/site-packages/hgtector/database.py", line 119, in call
self.retrieve_taxdump()
File "/home/zaki/miniconda3/lib/python3.7/site-packages/hgtector/database.py", line 262, in retrieve_taxdump
f.extract('names.dmp', self.tmpdir)
File "/home/zaki/miniconda3/lib/python3.7/tarfile.py", line 2033, in extract
tarinfo = self.getmember(member)
File "/home/zaki/miniconda3/lib/python3.7/tarfile.py", line 1752, in getmember
tarinfo = self._getmember(name)
File "/home/zaki/miniconda3/lib/python3.7/tarfile.py", line 2327, in _getmember
members = self.getmembers()
File "/home/zaki/miniconda3/lib/python3.7/tarfile.py", line 1763, in getmembers
self._load() # all members, we first have to
File "/home/zaki/miniconda3/lib/python3.7/tarfile.py", line 2350, in _load
tarinfo = self.next()
File "/home/zaki/miniconda3/lib/python3.7/tarfile.py", line 2281, in next
self.fileobj.seek(self.offset - 1)
File "/home/zaki/miniconda3/lib/python3.7/gzip.py", line 368, in seek
return self._buffer.seek(offset, whence)
File "/home/zaki/miniconda3/lib/python3.7/_compression.py", line 143, in seek
data = self.read(min(io.DEFAULT_BUFFER_SIZE, offset))
File "/home/zaki/miniconda3/lib/python3.7/gzip.py", line 471, in read
uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: invalid code lengths set

I am aware that there is a pre-built default database for download, but I need a broader database, not only microbial database. I tryed with this file (and following your compiling codes and so on), and it worked properly, but in order to compile the complete databases I need to run the database comand with "--cats all" or to download from somewhere the complete database, if any.

Could you please help me with this error or to provide me some link for download the complete database?

Thank you very much in advance

_tkinter.TclError: couldn't connect to display "localhost:14.0"

I am trying to set up hgtector to run on a set of bacterial genomes in a remote server. I created a conda environment as per the instructions and I was able to run the search part of the second run example. However, when I try to run the analysis part I get the error below

Analysis started at 2020-06-12 15:16:01.101009.
Reading local taxonomy database...
Done. Read 92 taxa.
Reading homology search results...
  o55h7: 5045 proteins.
Done. Read search results of 1 samples.
Auto-inferring plausible taxIds for input genomes based on taxonomy of search results...
  o55h7: 562 (Escherichia coli) (covering 93.2933% best hits).
Refining taxonomy database...
Done. Retained 92 taxa.
All input genomes belong to 562 (species Escherichia coli).
Auto-inferred self group:
  562 (species Escherichia coli)
Self group has 5 taxa.
Auto-inferred close group:
  543 (family Enterobacteriaceae)
Close group has 15 taxa.
Calculating protein scores by group...
  o55h7
Done.
Summarizing scores of all proteins... done.
Protein scores written to scores.tsv.
Removed 687 ORFans.
Removed 89 outliers.
Predicting HGTs...
Calculating thresholds for clustering...
Close group:
Traceback (most recent call last):
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/bin/hgtector", line 96, in <module>
    main()
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/bin/hgtector", line 35, in main
    module(args)
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/site-packages/hgtector/analyze.py", line 160, in __call__
    self.predict_hgt()
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/site-packages/hgtector/analyze.py", line 720, in predict_hgt
    self.plot_hist(self.df[group].tolist(),
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/site-packages/hgtector/analyze.py", line 1027, in plot_hist
    fig = plt.figure(figsize=(5, 5))
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/site-packages/matplotlib/pyplot.py", line 540, in figure
    figManager = new_figure_manager(num, figsize=figsize,
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 3337, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/site-packages/matplotlib/backends/_backend_tk.py", line 876, in new_figure_manage
r_given_figure
    window = tk.Tk(className="matplotlib")
  File "/home/groups/hbfraser/modules/packages/conda/4.6.14/envs/hgtector/lib/python3.8/tkinter/__init__.py", line 2261, in __init__
    self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: couldn't connect to display "localhost:14.0"

My guess is that it is trying to show me some plots but there is no graphical interface in the remote server so it is crashing. Is that correct? can I disable the graphical output?

Thanks

EOF error in database import

Hi Qiyunzhu,
I have been trying to compile the default database for HGT detection in my genome sequence
I used the following code
!hgtector database --output output_dir --cats microbe --sample 1 --rank species --reference --representative --compile diamond

After importing and sorting the files from NCBI, it gives me this error:
Traceback (most recent call last):
File "/usr/local/bin/hgtector", line 96, in
main()
File "/usr/local/bin/hgtector", line 35, in main
module(args)
File "/usr/local/lib/python3.7/site-packages/hgtector/database.py", line 137, in call
self.download_genomes()
File "/usr/local/lib/python3.7/site-packages/hgtector/database.py", line 541, in download_genomes
f'RETR {remote_dir}/{fname}', f.write)
File "/usr/local/lib/python3.7/ftplib.py", line 442, in retrbinary
with self.transfercmd(cmd, rest) as conn:
File "/usr/local/lib/python3.7/ftplib.py", line 399, in transfercmd
return self.ntransfercmd(cmd, rest)[0]
File "/usr/local/lib/python3.7/ftplib.py", line 365, in ntransfercmd
resp = self.sendcmd(cmd)
File "/usr/local/lib/python3.7/ftplib.py", line 273, in sendcmd
return self.getresp()
File "/usr/local/lib/python3.7/ftplib.py", line 236, in getresp
resp = self.getmultiline()
File "/usr/local/lib/python3.7/ftplib.py", line 222, in getmultiline
line = self.getline()
File "/usr/local/lib/python3.7/ftplib.py", line 210, in getline
raise EOFError
EOFError

Since I am new to the coding, It will be helpful if you can explain this error to me.

I can try manual compiling too, but instructions are bin unclear.
Downloaded file is in tar.xz format
should we compile using the following code?
echo $'accession\taccession.version\ttaxid' > prot.accession2taxid
zcat taxon.map.gz | awk -v OFS='\t' '{split($1, a, "."); print a[1], $1, $2}' >> prot.accession2taxid

diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db

rm prot.accession2taxid

Where to we insert the downloaded file? What will be the output? how do we use it during HGT search ?

makeblastdb fails for stdb.faa

Building stdb.faa fails with the following error

Warning: (1306.7) [makeblastdb] NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_350334_130.14.22.10_9008__PrepareRelease_Linux64-Centos_1481139955/c++/compilers/unix/../../src/objects/seq/../seqloc/Seq_id.cpp", line 2146: Error: ncbi::objects::CSeq_id::Set() - Negative, excessively large, or non-numeric gi ID WP_012634963.1


No volumes were created because no sequences were found.

BLAST Database creation error: Defline lacks a proper ID around line 1

Do you think HGTector is suitable for a Candidate phylum with only 4 draft genomes obtained via metagenome binning?

Hello, I was just wondering if it's necessary to perform HGT search on a Candidate phylum. In my opinion, I thought HGT search should be based on the taxa with clear phylogenetic boundaries and nice representatives. The genomes I am dealing with are a Candidate phylum 'Tectomicrobia' (yet in NCBI taxonomy, it's still a genus in deltaproteobacteria, taxid 93171). All available genomes are from metagenomes (lots of hypothetical proteins, 70-90% completeness). Do you think it's viable to find HGT events in these genomes?

Also, HGTector seems to be unable to recognize the taxid:
Enter the TaxID of v42, or press Enter if you don't know:91371
Warning: Invalid TaxID:

It's 2016 now and the taxaset was based on 2002 version.

My another question is about the databaser.py, I haven't tested this script due to some disk space issue. I want to know if it will generate the database compatible with RAPsearch/DIAMOND and the peak space usage of databaser.py.

Thanks.

Kind,Fang

Bug: makeblastdb "FASTA-Reader: Title ends with at least 50 valid amino acid characters. Was the sequence accidentally put in the title line?"

This is not a bug related to HGTector.

While setting up HGTector I realized that the BLAST version on our server is causing trouble. I did not use databaser.py to set up the protein database, but I downloaded the standard database. After unpacking everything I wanted to create the BLAST database using the settings specified in databaser.py, but I was getting this error.

FASTA-Reader: Title ends with at least 50 valid amino acid characters.  Was the sequence accidentally put in the title line?

A quick look at stdb.faa revealed no obvious issue. I decided to upgrade my blast version (from 2.2.31 --> 2.6.0) and the issue was gone.

Best

Carl-Eric

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Hello
I have installed HGTector through Conda and used the "ref107" database to test it. But when I typed the command "hgtector search -i o55h7.faa -o /bigdata/yuanyang/software/HGTector-test -m diamond -d /bigdata/yuanyang/software/HGTector-test/diamond/db -t /bigdata/yuanyang/software/HGTector-test/taxdump", it showed the error:
Homology search started at 2020-05-31 21:02:54.802337.
Traceback (most recent call last):
File "/home/yuanyang/miniconda3/envs/hgtector/bin/hgtector", line 96, in
main()
File "/home/yuanyang/miniconda3/envs/hgtector/bin/hgtector", line 35, in main
module(args)
File "/home/yuanyang/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 131, in call
self.args_wf(args)
File "/home/yuanyang/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 330, in args_wf
if not isdir(dir_):
File "/home/yuanyang/miniconda3/envs/hgtector/lib/python3.8/genericpath.py", line 42, in isdir
st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Could you help me? Thanks for your help very much.

Best regards

is the pre-built database still available?

I'm getting a '404 page not found' error at the web address provided in the README.md to download the standard database:

https://u21438014.dl.dropboxusercontent.com/u/21438014/HGTector/stdb_20170630.tar

analyze fails when product empty

I was trying to hgtector analyze after running search with a custom database. My database has no product information so my search results look like:

# ID: S0000001
# Length: 729
# Product: 
# Score: 1499.6
# Hits: 116
WP_135038321.1  99.2    0.0e+00 1489.9  100.0   85831
WP_167961307.1  96.2    0.0e+00 1450.6  100.0   2715212
WP_073346841.1  95.5    0.0e+00 1441.0  100.0   1871006
WP_008998151.1  93.7    0.0e+00 1424.5  100.0   556259

This caused a weird error asking me to specify taxonomy even when I had passed both --self-tax and --input-tax. After some detective work I figured out that the regular expression supposed to match the comment lines does not match my empty product line in the lines below:

https://github.com/DittmarLab/HGTector/blob/8f1c449f4b6fef4aad6361cc1a368362982bb243/hgtector/analyze.py#L275-L280

That happens because the line.rstrip() instruction removes the trailing whitespace in my product line, preventing the followingp.match(line) from actually matching the line.

It can be solved in several ways, In my case I just specified that rstrip should only remove trailing breaklines by changing line 279 to: line = line.rstrip("\n")

An error in hgtector database command

Hello,

I was trying to run the Second run in the doc with a slight modification and encountered an error.

I ran the following command :
hgtector database -c microbe -t 2 -s 1 -r superkingdom --reference --compile blast -o <hoge>/hgtectordb

...and received an error message:

Traceback (most recent call last):
  File "/.../anaconda3-5.3.1/envs/hgtector/bin/hgtector", line 96, in <module>.
    main()
  File "/.../anaconda3-5.3.1/envs/hgtector/bin/hgtector", line 35, in main
    module(args)
  File "/.../anaconda3-5.3.1/envs/hgtector/lib/python3.7/site-packages/hgtector/database.py", line 155, in __ call__.
    self.compile_database()
  File "/.../anaconda3-5.3.1/envs/hgtector/lib/python3.7/site-packages/hgtector/database.py", line 744, in compile_database
    self.build_blast_db()
  File "/.../anaconda3-5.3.1/envs/hgtector/lib/python3.7/site-packages/hgtector/database.py", line 763, in build_blast_db
    raise ValueError(f'makeblastdb failed with error code {ec}.')
ValueError: makeblastdb failed with error code 1.

Looking at the internal implementation, I ran the command executed by run_command() for makeblastdb:
makeblastdb -dbtype prot -in hgtectordb/db.faa -out a -title db -parse_seqids -taxid_map hgtectordb/taxon.map.gz

and found an error saying

Error: NCBI C++ Exception:
    T0 "/.../ncbistr.cpp", line 800: Error: ncbi::NStr::StringToInt8() - Cannot convert string 'Ll_23F?????) ??? ??????????????????????????????????????????????????????????????????????????????????????????) (SP??LA??2e??g???? ?O?8?2e' to Int8 (m_Pos = 0)

It seems that makeblastdb's -taxid_map does not support gziped files, so it'd be better to store the taxon.map file in its unzipped state.

Best Regards,

Getting one hit per protein(itself)

Hi qiyun：
I am running the my own genome with local blast nr database but i am getting just one hit per protein that too the same one with 100 percent identity. and i tested sample files to get the same results.
I also try to change the stdb database that contains one representative per species from all available nonredundant RefSeq prokaryotic proteomes. Unfortunately, the same results have emerged.
Here is my contig：
selfTax=REE1:651137
searchTool=BLAST
blastp=blastp
blastdbcmd=blastdbcmd
protdb=/share/home/yuanqingc/db/HGTector/stdb
taxdump=/share/home/yuanqingc/db/HGTector/taxdump
prot2taxid=/share/home/yuanqingc/db/HGTector/prot2taxid.txt
threads=38
queries=38
identity=30
coverage=50
maxHits=100

And i also try to provide pre-computed search results. The same results have emerged.
Any idea where the problem might be?waitting on the line : )
Thanks

ValueError: Invalid self-alignment method: None.

Hi,
I was trying out the newest release 2.0b3 and the tutorial (second run).

hgtector search -i o55h7.faa.gz -o . -m diamond -d ref107/diamond/db.dmnd -t ref107/taxdump/

I am getting the following error:

Homology search started at 2020-12-28 14:00:52.356490.
Traceback (most recent call last):
  File "/home/user/python-venv/bin/hgtector", line 4, in <module>
    __import__('pkg_resources').run_script('hgtector==2.0b3', 'hgtector')
  File "/home/user/python-venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 665, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/user/python-venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1470, in run_script
    exec(script_code, namespace, namespace)
  File "/home/user/python-venv/lib/python3.7/site-packages/hgtector-2.0b3-py3.7.egg/EGG-INFO/scripts/hgtector", line 96, in <module>
  File "/home/user/python-venv/lib/python3.7/site-packages/hgtector-2.0b3-py3.7.egg/EGG-INFO/scripts/hgtector", line 35, in main
  File "/home/user/python-venv/lib/python3.7/site-packages/hgtector-2.0b3-py3.7.egg/hgtector/search.py", line 131, in __call__
    def usearch5(query, db, type, out, threads = '4', evalue = '100', alignment = 'local'):
  File "/home/user/python-venv/lib/python3.7/site-packages/hgtector-2.0b3-py3.7.egg/hgtector/search.py", line 380, in args_wf
ValueError: Invalid self-alignment method: None.

HGTector for plants

Hi Qiyun,

I'm so glad to see the update of HGTector.

I have worked on this software for a couple of days and found that the TaxID of Phyllostachys (38705) was not found in taxonomy database. And then I realized that the database was based on microbes.

Is it possible to use HGTector2 to predict the putative horizontal transferred genes in plants? Such as Phyllostachys (38705)?

Thanks!

TaxID nan is not found in taxonomy database.

Hello
I used hgtector to analyze my MAGsm,but system can not auto-infer taxonomy for my MAGs, so I specify manually tax ID , it still has errors.
(hgtector) [xxx@node1 2.gene_pre]$ hgtector analyze -i search -o analyze -t /xxx/database/hgtector/taxdump/ --input-tax bin.19:189774
Analysis started at 2020-06-27 20:51:20.514960.
Reading local taxonomy database...
Done. Read 40245 taxa.
Reading homology search results...
bin.19: 3049 proteins.
Done. Read search results of 1 samples.
User-specified TaxIDs of input genomes:
bin.19: 189774 (unclassified Chloroflexi (miscellaneous)).
Refining taxonomy database...
Traceback (most recent call last):
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/util.py", line 422, in _get_taxon
return taxdump[tid]
KeyError: nan

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/xxx/miniconda3/envs/hgtector/bin/hgtector", line 96, in
main()
File "/xxx/miniconda3/envs/hgtector/bin/hgtector", line 35, in main
module(args)
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/analyze.py", line 134, in call
self.assign_taxonomy()
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/analyze.py", line 349, in assign_taxonomy
self.taxdump = refine_taxdump(self.sum_taxids(), self.taxdump)
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/util.py", line 864, in refine_taxdump
ancs = set().union([get_lineage(x, taxdump) for x in tids])
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/util.py", line 864, in
ancs = set().union([get_lineage(x, taxdump) for x in tids])
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/util.py", line 654, in get_lineage
taxon = _get_taxon(cid, taxdump)
File "/xxx/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/util.py", line 424, in _get_taxon
raise ValueError(f'TaxID {tid} is not found in taxonomy database.')
ValueError: TaxID nan is not found in taxonomy database.

Could you help me? Thanks for your help very much.

Best regards

HGTector stuck at "Step 1: Searcher - batch protein sequence homology search."

Hi Qiyunzhu,
As you suggested, I have created, without any issue, a local database using the databaser.py. However, HGTector fail to progress through the first step. After 24 hours, it is still working on the first set of 10 proteins:

"Validating task...
Done.

Step 1: Searcher - batch protein sequence homology search.

-> Searcher: Batch sequence homology searching and filtering. <-
Reading input data...
WP: 7278 proteins.
Done. 7278 proteins from 1 set(s) to query.
Reading search results from previous run(s)...
Done. 0 results found, remaining 7278 proteins to search.
Reading taxonomy database... done. 1491434 records read.
Reading protein-to-TaxID dictionary... done. 34512566 records read.
Reading taxonomy records from previous run(s)... done. 1 taxa and 13 ranks read.
Enter the TaxID of WP, or press Enter if you don't know:578459
Taxonomy of input protein sets:
WP: Rhodotorula graminis WP1 (578459)
Batch homology search of WP (7278 queries) started.
BLASTing KPV71431.1,KPV71432.1,KPV71433.1,KPV71434.1,KPV71435.1,KPV71436.1,KPV71437.1,KPV71438.1,KPV71439.1,KPV71440.1..."

This sound suspicious.

Here is my config file:

"selfTax=Rhoba:578459
searchTool=BLAST
blastp=blastp
blastdbcmd=blastdbcmd
queries=10
protdb=/media/universita/SAMSUNG/HGTector-master/scripts/blast/stdb
taxdump=/media/universita/SAMSUNG/HGTector-master/scripts/taxdump
prot2taxid=/media/universita/SAMSUNG/HGTector-master/gi2taxid.txt
getAln=1
identity=45
maxHits=100
selfGroup=4751
closeGroup=2759
graphFp=1
boxPlot=1
histogram=1
densityPlot=1
scatterPlot=1
howCO=4
globalCO=0.25
exOutlier=1
dipTest=1
dipSig=0.05
bwF=0.3
toolExtrema=1
modKCO=1
plotKDE=1
minHits=1
minSize=34
title=wpout
outText=1
outHTML=1"

... and the list of files inside the blast/ directory:

"stdb.00.phr, stdb.00.pin, stdb.00.pnd, stdb.00.pni, stdb.00.pog, stdb.00.psd, stdb.00.psi, stdb.00.psq, stdb.01.phr, stdb.01.pin, stdb.01.pnd, stdb.01.pni, stdb.01.pog, stdb.01.psd, stdb.01.psi, stdb.01.psq, stdb.02.phr, stdb.02.pin, stdb.02.pnd, stdb.02.pni, stdb.02.pog, stdb.02.psd, stdb.02.psi, stdb.02.psq, stdb.03.phr, stdb.03.pin, stdb.03.pnd, stdb.03.pni, stdb.03.pog, stdb.03.psd, stdb.03.psi, stdb.03.psq, stdb.04.phr, stdb.04.pin, stdb.04.pnd, stdb.04.pni, stdb.04.pog, stdb.04.psd, stdb.04.psi, stdb.04.psq, stdb.05.phr, stdb.05.pin, stdb.05.pnd, stdb.05.pni, stdb.05.pog, stdb.05.psd, stdb.05.psi, stdb.05.psq, stdb.06.phr, stdb.06.pin, stdb.06.pnd, stdb.06.pni, stdb.06.pog, stdb.06.psd, stdb.06.psi, stdb.06.psq, stdb.07.phr, stdb.07.pin, stdb.07.pnd, stdb.07.pni, stdb.07.pog, stdb.07.psd, stdb.07.psi, stdb.07.psq, stdb.08.phr, stdb.08.pin, stdb.08.pnd, stdb.08.pni, stdb.08.pog, stdb.08.psd, stdb.08.psi, stdb.08.psq, stdb.09.phr, stdb.09.pin, stdb.09.pnd, stdb.09.pni, stdb.09.pog, stdb.09.psd, stdb.09.psi, stdb.09.psq, stdb.10.phr, stdb.10.pin, stdb.10.pnd, stdb.10.pni, stdb.10.pog, stdb.10.psd, stdb.10.psi, stdb.10.psq, stdb.pal"

Thank you!

Error: Identification of orthology failed.&Execution of analyzer.pl failed.

Hi,Qiyun
I run the HGTector using the following configuration:
selfTax=100787
searchTool=DIAMOND
protdb=/home/sujitmaiti/ver/HGT/HGTector/db/stdb.faa
taxdump=/home/sujitmaiti/ver/HGT/HGTector/db/taxdump
prot2taxid=/home/sujitmaiti/ver/HGT/HGTector/db/prot2taxid.txt
selfRank=genus
graphFp=1
exOutlier=1
dipTest=1
title=vl
byOrthology=1
But I got error:
Error: Identification of orthology failed.
Execution of analyzer.pl failed.
Could you help solve this problem? Thanks!

Can't download the genomes from NCBI via the databaser.py

Dear Qiyun Zhu:
I was trying to build a database automatically through the databaser.py but it can't work after downloaded the genome list, none of genomes were downloaded. error like:

Connecting to the NCBI FTP server... done.
Downloading the NCBI taxonomy database... done.
Reading the NCBI taxonomy database... done.
1585409 taxa read.
Downloading the NCBI representative genome list... done.
Reading the NCBI representative genome list... done.
4380 genomes read.
Reading RefSeq genome list... done.
84212 genomes read.
Subsampling genomes... done.
13254 genomes retained.
Reading RefSeq genomic data... done.
0 genomes to download.
Extracting non-redundant proteomic data from NCBI RefSeq...
done.
0 genomes, 0 protein sequences, and 0 residues downloaded.
0 proteins extracted.
Task completed.

could you help me solve this problem? thank you very much.

BLAST Database error: No database names were found in alias file

Hi,
I have created blast database using databaser.py code. Its takes all most 12 hours and showed database successfully created. But while running HGTector.pl, I got "Database error: No database names were found in alias file". I have set the path for the protdb in the config file. I am wondering what I am missing.

Error: Invalid taxonomy mapping file format.

Hi,
I was constructing the database with this command:

hgtector database --output hgtector_refseq-ref-repr_bac-arch_species-all_2020-12-28 --cats bacteria,archaea --rank species --reference --representative --retries 10

and then manually compiled it with

echo $'accession\taccession.version\ttaxid' > prot.accession2taxid
zcat taxon.map.gz | awk -v OFS='\t' '{split($1, a, "."); print a[1], $1, $2}' >> prot.accession2taxid
module load diamond ncbiblastplus
mkdir diamond blast
gunzip -c taxon.map.gz >taxon.map
makeblastdb -dbtype prot -in db.faa -out blast/db -title db -parse_seqids -taxid_map taxon.map
diamond makedb --threads 8 --in db.faa --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db

Unfortunately, diamond threw an error when it came to loading the prot.accession2taxid.

[...]
Writing accessions...  [3.588s]
Hashing sequences...  [1.296s]
Loading sequences...  [0.001s]
Writing trailer...  [4.612s]
Loading taxonomy...  [0.213s]
Error: Invalid taxonomy mapping file format.

I tried it with a new version of taxdump, but the error persists. Is there a way to make the taxon.map.gz again without running the entire HGTector database command? I didn't keep the single proteomes because of storage reasons.

Here's a snippet of my taxon.map.gz:

DABXDW000000000.1	1313
MW052618.1	1313
NP_052604.1	386585
NP_052605.1	386585
NP_052606.1	386585
NP_052607.1	386585

and here of the prot.accession2taxid:

accession	accession.version	taxid
DABXDW000000000	DABXDW000000000.1	1313
MW052618	MW052618.1	1313
NP_052604	NP_052604.1	386585
NP_052605	NP_052605.1	386585
NP_052606	NP_052606.1	386585
NP_052607	NP_052607.1	386585
NP_052608	NP_052608.1	386585
NP_052609	NP_052609.1	386585

ftplib.error_temp: 425 Unable to build data connection: Connection refused

Hey, here's the script I used and the outcome. Besides the 425 error, I also found the output database was named stdb instead of Arc-test.

SEBASTIANs-MacBook-Pro:scripts FLFLFLLF$ python databaser.py -format=none -range=archaea -subsample=1 -out Arc-test

Databaser: generate taxonomically balanced, non-redundant proteome database.
Author: Qiyun Zhu ([email protected])
Affiliation: J. Craig Venter Institute
Last updated: Oct 24, 2015

Connecting to the NCBI FTP server... done.
Downloading the NCBI taxonomy database... done.
Reading the NCBI taxonomy database... done.
1404643 taxa read.
Downloading the NCBI representative genome list... done.
Reading the NCBI representative genome list... done.
4183 genomes read.
Reading RefSeq genome list... done.
473 genomes read.
Subsampling genomes... done.
313 genomes retained.
Reading RefSeq genomic data... done.
313 genomes to download.
Extracting non-redundant proteomic data from NCBI RefSeq...
25 50 75 100 125 150 175 200
Traceback (most recent call last):
File "databaser.py", line 315, in
ftp.retrbinary('RETR ' + downfile, fout.write, 1024)
File "//anaconda/lib/python2.7/ftplib.py", line 414, in retrbinary
conn = self.transfercmd(cmd, rest)
File "//anaconda/lib/python2.7/ftplib.py", line 376, in transfercmd
return self.ntransfercmd(cmd, rest)[0]
File "//anaconda/lib/python2.7/ftplib.py", line 358, in ntransfercmd
resp = self.sendcmd(cmd)
File "//anaconda/lib/python2.7/ftplib.py", line 249, in sendcmd
return self.getresp()
File "//anaconda/lib/python2.7/ftplib.py", line 222, in getresp
raise error_temp, resp
ftplib.error_temp: 425 Unable to build data connection: Connection refused

databaser: Connecting to the NCBI FTP server

hi, Qiyun,
I met a problem when creating a batadase, it show the error like that:
Traceback (most recent call last):
File "/home/chenzhiduan/liruiqi/build/HGTector-master/scripts/databaser.py", line 67, in
ftp = ftplib.FTP('ftp.ncbi.nlm.nih.gov', 'anonymous', '')
File "/build/Cellar/python/2.7.6/lib/python2.7/ftplib.py", line 120, in init
self.connect(host)
File "/build/Cellar/python/2.7.6/lib/python2.7/ftplib.py", line 135, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "/build/Cellar/python/2.7.6/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -3] Temporary failure in name resolution

I don't know how to resolve the problem, could you help me? please!
thank you!!

Use of uninitialized value in concatenation

Hello Qiyunzhu,

Thks for your excellent software, but when I was runing the sample in this directory,I got into trouble with analyzer.pl:

my code :perl scripts/analyzer.pl sample
and be reported as follow:
-> Analyzer: Identify putative HGT-derived genes based on search results. <-
Warning: Prediction result from a previous analysis is detected.
Press Enter to overwrite, or Ctrl+C to exit:

Reading taxonomic information... done.
Analyzing taxonomic information... done.
All input genomes belong to species Galdieria sulphuraria (TaxID: 130081).
Analysis will work on the following taxonomic ranks:
Self: (user-defined self) Galdieria (TaxID: 83373) (1 members),
Close: (user-defined close) Eukaryota (TaxID: 2759) (1 members),
Distal: all other organisms.
Press Enter to continue, or Ctrl+C to exit:
Reading protein sets...done. 1 sets detected.
Analyzing search results...
0-------------25-------------50------------75------------100%
gsul has 100 proteins. Analyzing...

done.
Raw data are saved in result/statistics/rawdata.txt.
You may conduct further analyses on these data.
Press Enter to continue, or Ctrl+C to exit:

Computing statistics...
All protein sets:
done.
Use of uninitialized value in concatenation (.) or string at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mean"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mean"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"stdev"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"stdev"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"min"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"min"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"max"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"max"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"median"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"median"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mad"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mad"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q1"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q1"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q3"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q3"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"cutoff"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"cutoff"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value in concatenation (.) or string at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mean"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mean"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"stdev"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"stdev"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"min"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"min"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"max"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"max"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"median"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"median"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mad"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mad"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q1"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q1"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q3"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q3"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"cutoff"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"cutoff"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value in concatenation (.) or string at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mean"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mean"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"stdev"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"stdev"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"min"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"min"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"max"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"max"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"median"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"median"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mad"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"mad"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q1"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q1"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q3"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"q3"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"cutoff"} in sprintf at analyzer.pl line 1256, line 3.
Use of uninitialized value $h{"cutoff"} in sprintf at analyzer.pl line 1256, line 3.
Result is saved in result/statistics/fingerprint.txt.
Press Enter to proceed with prediction, or Ctrl+C to exit:
Predicting... done.
Prediction results are saved in result/detail/.

I tried to change version of perl in different service but doesnt work,could you give me some advice?t

Thanks again!

Getting one hit per protein

I am running the sample input file with the config file connecting the http ncbi but i am getting just one hit per protein that too the same one with 100 percent identity

About donor results!

Hello Qiyun,
It is glad to see your HGTector has been updated. This version is so easy to use. Three years ago I used your 2.0 version. And I got good results. However, I did not published my results since I updated my genome version. Currently, I almost am done my genome. I used small sample data get a list. However,
I used the command join -t$'\t' -j1 <(sort hgts/o55h7.txt) <(tail -n+2 scores.tsv | grep ^o55h7$'\t' | cut -f2,8 | sort) > o55h7.donor.txt to get the donor results. The donor results is empty.
The attachment is my hgt list.
o55h7.txt

Analyzer.pl failed

Hello Qiyunzhu,
Thank you for this new tool to analyze HGT that is useful for beginners in computing.
I am working with a genome released this year and looking for HGT on its genus. The first problem i had was the taxID that wasn't recognized by the searcher.pl. Instead this, it performed the BLAST for all proteins on this data set and it was included in the Taxonomy files.

But in Step 2, the script analyzer.pl failed and i don't know why it was happening.

This was the message:
-> Analyzer: Identify putative HGT-derived genes based on search results. <-
Reading taxonomic information... done.
Analyzing taxonomic information... done.
Unknown TaxID 0.
Execution of analyzer.pl failed.

The config.txt was like for 4939 genes

title=FischerellaNIES
selfTax=FisN:1752063
searchTool=BLAST
httpBlast=1
httpDb=nr
nHits=500
evalue=1e-10
identity=50
ignoreSubspecies=1
selfGroup=1189 # Stigonematales
closeGroup=1117 # Cyanobacteria
howCO=4
bwF=0.3
plotKDE=1
modKCO=1
POE=1

I tried to change the directory files if it could the problem, and i changed the config.txt a few times to run with the same NEXUS blast files and it doesn't work.
Thanks for yor help. !

about the GTDB database

Hello, I have another question.
I'm trying to use the GTDB database for HGTector analyses. Is the file "gtdb_r89_rep_genomes.protein.faa.tar.gz" the one that I should dowload? Thanks!

DIAMOND and BLAST version issue

Hi Qiyun,
I am using the HGTector2 pipeline to investigate the HGT event of my 30 bacterial strain. I set up my local diamond database using the file downloaded from your dropbox. I seem that everything is ok. but when I use the search function, the screen always traceback me a messages showed as follow:
Traceback (most recent call last):
File "/home/liub/miniconda3/envs/hgtector/bin/hgtector", line 100, in
main()
File "/home/liub/miniconda3/envs/hgtector/bin/hgtector", line 39, in main
module(args)
File "/home/liub/miniconda3/envs/hgtector/lib/python3.7/site-packages/hgtector/search.py", line 167, in call
else None)
File "/home/liub/miniconda3/envs/hgtector/lib/python3.7/site-packages/hgtector/search.py", line 706, in search_wf
res = self.diamond_search(seqs)
File "/home/liub/miniconda3/envs/hgtector/lib/python3.7/site-packages/hgtector/search.py", line 1542, in diamond_search
raise ValueError('diamond failed with error code {}.'.format(ec))
ValueError: diamond failed with error code 1.
what's wrong with this? waiting for your repy. thank you very much
Regards

hgtector search

Hi
First thanks for your Hgtector!
I meet error when use 'hgtector search',

Command:
conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn
conda activate hgtector
pip install git+https://github.com/DittmarLab/HGTector.git
hgtector search -o ll -i o55h7.faa.gz -d diamond/db.dmnd -t taxdump/ -m diamond

Error:
(hgtector) []$ hgtector search -o ll -i o55h7.faa.gz -d diamond/db.dmnd -t taxdump/ -m diamond
Homology search started at 2020-05-22 08:10:24.825237.
Traceback (most recent call last):
File "/home/usr/envs/hgtector/bin/hgtector", line 96, in
main()
File "/home/usr/envs/hgtector/bin/hgtector", line 35, in main
module(args)
File "/home/usr/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 131, in call
self.args_wf(args)
File "/home/usr/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 330, in args_wf
if not isdir(dir_):
File "/home/usr/envs/hgtector/lib/python3.8/genericpath.py", line 42, in isdir
st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

HGT sources and GTDB

HI,
I just notice HGTector has come to the version2
and I install and run successfully.
But I have a question that could it give the source of the HGT?
I mean those genes may come from some certain order, family and any other level.

What's more, can we just use the result of 'hgtector search', convert the protID to taxID to see which species those genes come from?

Thanks for your help and this update. It is really easy to install and use than before.

Error in diamond makedb related to taxonmap file

Hello,
I've followed the tutorial to build a diamond database hgtector 2.0b2:

taxon_map="hgtector_db/taxon.map.gz
echo $'accession\taccession.version\ttaxid' > prot.accession2taxid
zcat $taxon_map | awk -v OFS='\t' '{split($1, a, "."); print a[1], $1, $2}' >> prot.accession2taxid

diamond makedb --threads 24 --in  db.faa --taxonmap prot.accession2taxid \
--taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db

and I found this error at the end of the process (after diamond has already created a 30GB plus file):

Loading sequences...  [2.659s]
Masking sequences...  [3.314s]
Writing sequences...  [0.993s]
Writing accessions...  [1.747s]
Hashing sequences...  [0.21s]
Loading sequences...  [0s]
Writing trailer...  [2.748s]
Loading taxonomy...  [0.258s]
Error: Tokenizer Exception
Error: Tokenizer Exception

I didn't find much information about this, so I tried a few things and finally using the prot.accession2taxid.gz (6.1GB) from the NCBI solved the problem.

The original prot.accession2taxid (created following the instructions) is 500MB gzipped (still can share it if you want). If that's too much, maybe some hints about what could be causing this problem will be helpful. The file looks OK to me, so I'm not sure to what to look at.

Cheers!

TypeError: init() got an unexpected keyword argument 'required'

Hello,
I created a conda environment (python=3.6.5, and installed the packages pyyaml pandas matplotlib scikit-learn), and then ran the command:
`pip install git+https://github.com/DittmarLab/HGTector.git`
It showed that hgtector installed sucssesfully. But when I typed the command:
`hgtector`
I recieved the error message.

Traceback (most recent call last):
File "/public/home/chenruolin/.conda/envs/vibrant/bin/hgtector", line 96, in
main()
File "/public/home/chenruolin/.conda/envs/vibrant/bin/hgtector", line 30, in main
args = parse_args(modules)
File "/public/home/chenruolin/.conda/envs/vibrant/bin/hgtector", line 58, in parse_args
title='commands', dest='command', required=True)
File "/public/home/chenruolin/.conda/envs/vibrant/lib/python3.6/argparse.py", line 1703, in add_subparsers
action = parsers_class(option_strings=[], kwargs)
TypeError: init**() got an unexpected keyword argument 'required'

Has anyone been in the same situation?
Thanks

stuck at Step 1

Hello,

I'm trying to use HGTector on a terminal and wasn't able to run the sample folder. Here's the error message I got:

Step 1: Searcher - batch protein sequence homology search.
This Perl not built to support threads
Compilation failed in require at /home/brooke/HGTector-0.2.1/scripts/searcher.pl line 5.
BEGIN failed--compilation aborted at /home/brooke/HGTector-0.2.1/scripts/searcher.pl line 5.
Error: Execution of searcher.pl failed. HGTector exists.

Is there anyway I can change my config files to make to work? I tried other thread options but they didn't help.
Thanks!
Brooke

Error while downloading genomes for the database

Hello,
thanks for this software. I'm using hgtector-2.0b3.

I'm getting this ftplib error at different points of the download process. I've run it a few times now and I haven't been able to finish the task.
I'm using:
hgtector database -o $db_dir --default --overwrite --threads 28

Traceback (most recent call last):
  File "/home/miniconda3/envs/bif2020.1/bin/hgtector", line 96, in <module>
    main()
  File "/home/miniconda3/envs/bif2020.1/bin/hgtector", line 35, in main
    module(args)
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/site-packages/hgtector/database.py", line 137, in __call__
    self.download_genomes()
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/site-packages/hgtector/database.py", line 549, in download_genomes
    self.ftp.retrbinary(
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 425, in retrbinary
    with self.transfercmd(cmd, rest) as conn:
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 382, in transfercmd
    return self.ntransfercmd(cmd, rest)[0]
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 348, in ntransfercmd
    resp = self.sendcmd(cmd)
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 275, in sendcmd
    return self.getresp()
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 238, in getresp
    resp = self.getmultiline()
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 224, in getmultiline
    line = self.getline()
  File "/home/miniconda3/envs/bif2020.1/lib/python3.8/ftplib.py", line 212, in getline
    raise EOFError
EOFError

I'm guessing some downloaded file is corrupted, but I don't know how to try to avoid this problem besides using --rewrite

IndexError:list index out of range

Hi everyone, I have run the hgtecor, and got an error message of "IndexError: list index out of range" when analyzing several bacterial genomes. The error was found in analyze.py of line138 (self.read_input), line266(fname,result……) and line311 (data[-1]['hits'].append(line)). How can I fix this?

Cutoff

Hi - I have been playing around with my input parameter for a while now but regardless what I include in the self-, close-, or distant-group, my cutoffs are always <0.03. It also does not matter if several bacterial genomes of the same genus are analyzed simultaneously or a single genome.
Inputs that I tried:
self group: tax ID of the genome of interest; the genus of the single genome of interest
close group: the genus of the single genome of interest; closely related other genera; family-, order,- class-, and phylum-level; and domain
Papers that report on using HGTector all show much higher cutoffs for the close and distant group. Can you please help me identify what this low cutoff means or can we please discuss any analytical errors I may have caused?

Thanks again,

Markus

Databaser.py: I have a problem

hi!!,
I met a problem when creating a batadase, it show the error like that:

germanmtraglia@ubuntu:/media/germanmtraglia/6E5CA6935CA6561D/seq/new_seq_robert/HGTector-master/scripts$ python databaser.py

Databaser: generate taxonomically balanced, non-redundant proteome database.
Author: Qiyun Zhu ([email protected])
Affiliation: J. Craig Venter Institute
Last updated: Jan 28, 2017

Usage:
python scripts/databaser.py [-format=none|blast] [-range=all|microbe|etc] [-represent=no|auto|filename] [-subsample=0|1|2...] [-out=dbname]
Note:
By default, the script will download NCBI RefSeq proteomic data of microbial organisms, keep one proteome per species, plus all NCBI-defined representative proteomes.
Use "-help" to display details of program parameters.
Press Enter to continue or Ctrl+C to exit...
Connecting to the NCBI FTP server... done.
Downloading the NCBI taxonomy database... done.
Reading the NCBI taxonomy database... done.
1591763 taxa read.
Downloading the NCBI representative genome list... done.
Reading the NCBI representative genome list... done.
4380 genomes read.
Reading RefSeq genome list... done.
85021 genomes read.
Subsampling genomes... done.
13379 genomes retained.
Downloading non-redundant proteomic data from NCBI RefSeq...
25 Traceback (most recent call last):
File "databaser.py", line 292, in
files = ftp.nlst()
File "/usr/lib/python2.7/ftplib.py", line 506, in nlst
self.retrlines(cmd, files.append)
File "/usr/lib/python2.7/ftplib.py", line 429, in retrlines
conn = self.transfercmd(cmd)
File "/usr/lib/python2.7/ftplib.py", line 368, in transfercmd
return self.ntransfercmd(cmd, rest)[0]
File "/usr/lib/python2.7/ftplib.py", line 327, in ntransfercmd
conn = socket.create_connection((host, port), self.timeout)
File "/usr/lib/python2.7/socket.py", line 571, in create_connection
raise err
socket.error: [Errno 110] Connection timed out

I don't know how to resolve the problem, could you help me? please!
thank you!!

Summarize HGT Events by gene orthology spawns runaway perl processes

Hello:
Let me first say that I am impressed with the HGTector pipeline. I fired most of it up without any problems -- BLAST works well as does the integration with R and perl . The problem that I am running into is when I choose to summarize HGT events by orthology:

Summarize HGT events by gene orthology

byOrthology=1

All pairRules that I tried failed in the same manner, namely the pipeline fires of multiple perl jobs (probably in the 10's of thousands, very likely much more), consumes all available RAM (in my case 128 GB) and then expands into swap space. The jobs never finish. When I control-C out, there is voluminous bash output ending with

Error: Identification of orthology failed.
Error: Identification of orthology failed.
Error: Identification of orthology failed.
Error: Identification of orthology failed.
Execution of analyzer.pl failed.

I tried to run with different tree builders (fasttree,mafft)

I'm attaching a config file that I've run on input files that I've generated in previous steps (BLAST, Taxonomy, and etc already done). Maybe I'm setting things up wrong. Do you have any suggestions?
Thanks,
Greg

config.txt

No hit is assigned to distal group. Cannot predict HGTs.

Hi everyone
I'm now using HGTector2 to analyse the HGT event of one archaea genome. I set up the local database according to commands provided by author. But there was a problem when I used the Analyze function. I don't know why?

Reporter: Generate reports of HGT prediction results - loop

Hello,
I am working on a set of 86 proteins and during the final step (Report), the following message: "Reporter: Generate reports of HGT prediction results." appear infinite times on the terminal screen. Is this normal for a set of 86 proteins?
Here my configuration file:

interactive=0
selfTax=BCP1:191292
httpBlast=1
taxBlast=1
alnBlast=1
evalue=1e-10
identity=30
nHits=100
ignoreSubspecies=1
selfRank=genus
graphFp=1
exOutlier=1
dipTest=1
bwF=0.3
plotKDE=1
title=report
byOrthology=1

Thank you!

p.s really interesting software!

BrokenPipeError: [Errno 32] Broken pipe : databaser.py

Version: HGTector0.2.2

Dear HGTector,

I am trying to prepare a database, based on the given examples, I used the following command (see below). I have tried various ways of altering this command and I receive different errors. However, this is the closest to the example documented by you. Most error seems to return the error poiting to line 333 File "scripts/databaser.py", line 333, in
ftp.cwd('/genomes/all/GCF/' + dir)

Questions: I know you have tested this on bacteria (your paper), but can the tool work for Eukaryota organisms? (your lab works on fleas, so surely you have used it for insects?). If this database preparation keeps failing can I pass it a BLAST done manually (to skip this step)?

Regards,

Peter Thorpe

Command:

python scripts/databaser.py -format=blast -range=protozoa,fungi,archaea,bacteria,invertebrate,plant,vertebrate_mammalian,vertebrate_other,viral -subsample=1 -represent=auto -out="MyEukarie"

-> Databaser: generate taxonomically balanced, non-redundant genome and proteome database. <-

Connecting to the NCBI FTP server... done.
Downloading the NCBI taxonomy database... done.
Reading the NCBI taxonomy database... done.
2112593 taxa read.
Downloading the NCBI representative genome list... done.
Reading the NCBI representative genome list... done.
5800 genomes read.
Reading RefSeq genome list... done.
166196 genomes read.
Subsampling genomes... done.
33273 genomes retained.
Downloading non-redundant genomic data from NCBI RefSeq...
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600 625 650 675 700 725 750 775 800 825 850 875 900 925 950 975 1000 1025 1050 1075 Traceback (most recent call last):
File "scripts/databaser.py", line 333, in
ftp.cwd('/genomes/all/GCF/' + dir)
File "/shelf/apps/pjt6/conda/envs/HGtector/lib/python3.6/ftplib.py", line 631, in cwd
return self.voidcmd(cmd)
File "/shelf/apps/pjt6/conda/envs/HGtector/lib/python3.6/ftplib.py", line 277, in voidcmd
self.putcmd(cmd)
File "/shelf/apps/pjt6/conda/envs/HGtector/lib/python3.6/ftplib.py", line 199, in putcmd
self.putline(line)
File "/shelf/apps/pjt6/conda/envs/HGtector/lib/python3.6/ftplib.py", line 194, in putline
self.sock.sendall(line.encode(self.encoding))
BrokenPipeError: [Errno 32] Broken pipe

ValueError: Invalid BLAST database: hgtdb/blast/db.

Hi, I am having troubles running HGTector with blast:

I downloaded the database from dropbox and ran this command:
makeblastdb -dbtype prot -in db.faa -out blast/db -title db -parse_seqids -taxid_map taxon.map

However, when running search, I am getting an error:

hgtector search -i proteins.fasta -o test -m blast -p 8 -d hgtdb/blast/db -t hgtdb/taxdump/
Homology search started at 2020-03-17 11:59:51.025748.
Traceback (most recent call last):
  File "/home/user/bin/hgtector", line 4, in <module>
    __import__('pkg_resources').run_script('hgtector==2.0b1', 'hgtector')
  File "/home/user/lib/python3.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/user/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(code, namespace, namespace)
  File "/home/user/lib/python3.7/site-packages/hgtector-2.0b1-py3.7.egg/EGG-INFO/scripts/hgtector", line 100, in <module>
    main()
  File "/home/user/lib/python3.7/site-packages/hgtector-2.0b1-py3.7.egg/EGG-INFO/scripts/hgtector", line 39, in main
    module(args)
  File "/home/user/lib/python3.7/site-packages/hgtector-2.0b1-py3.7.egg/hgtector/search.py", line 131, in __call__
    self.args_wf(args)
  File "/home/user/lib/python3.7/site-packages/hgtector-2.0b1-py3.7.egg/hgtector/search.py", line 285, in args_wf
    blast_db = self.check_blast()
  File "/home/user/lib/python3.7/site-packages/hgtector-2.0b1-py3.7.egg/hgtector/search.py", line 507, in check_blast
    .format(db_))
ValueError: Invalid BLAST database: hgtdb/blast/db.

Also, these commands are failing:

hgtector search -i proteins.fasta -o test -m blast -p 8 -d hgtdb/blast/db. -t hgtdb/taxdump/
hgtector search -i proteins.fasta -o test -m blast -p 8 -d hgtdb/blast/ -t hgtdb/taxdump/
hgtector search -i proteins.fasta -o test -m blast -p 8 -d hgtdb/blast/db.faa -t hgtdb/taxdump/

ftplib.error_temp: 421 Idle timeout (60 seconds): closing control connection

Hi, here's the script I used and the error reports.
[yuanyang@localhost HGTector_nr_manually_2]$ python databaser.py
-> Databaser: generate taxonomically balanced, non-redundant genome and proteome database. <-

Usage:
python scripts/databaser.py [-format=none|blast] [-range=all|microbe|etc] [-represent=no|auto|filename] [-subsample=0|1|2...] [-out=dbname]
Note:
By default, the script will download NCBI RefSeq genomic data of microbial organisms, keep one genome per species, plus all NCBI-defined representative genomes.
Use "-help" to display details of program parameters.
Press Enter to continue or Ctrl+C to exit...
Connecting to the NCBI FTP server... done.
Downloading the NCBI taxonomy database... Traceback (most recent call last):
File "databaser.py", line 103, in
ftp.retrbinary('RETR ' + 'taxdump.tar.gz', fout.write, 1024)
File "/data/data/yuanyang/softwares/lib/python2.7/ftplib.py", line 425, in retrbinary
return self.voidresp()
File "/data/data/yuanyang/softwares/lib/python2.7/ftplib.py", line 231, in voidresp
resp = self.getresp()
File "/data/data/yuanyang/softwares/lib/python2.7/ftplib.py", line 217, in getresp
resp = self.getmultiline()
File "/data/data/yuanyang/softwares/lib/python2.7/ftplib.py", line 203, in getmultiline
line = self.getline()
File "/data/data/yuanyang/softwares/lib/python2.7/ftplib.py", line 193, in getline
if not line: raise EOFError
EOFError

The donor of HGT predicted by HGTector-2.0b2

@qiyunzhu Dear qiyuzhu;
I am liub, who talked with you in November last year.

I notice that you have updated some new funtion in the latest HGTector-2.0b2 version, but the moudle that predicted potential HGT donor at a specific phylogenetic rank is still unavailable. I try to realize the result by using HGTector-0.2.1 or 0.2.2 version. but it awalys show an error message or bug at different steps.
could please provide me with a solution to realize it in the HGTector-2.0b2 version, thanks a lot.

Regard
liub

HTTP BLAST parameters

Hi Qiyunzhu:
Thanks for your excellent software. I intend to use HGTector to infer putative HGTs. I have downloaded the packages for HGTector v0.2.2 and uncompressed it successfully. I have created the configuration file "config.txt" in the working directory. I want to call the NCBI server to perform BLAST and to retrieve sequence / taxonomy information. I want to modify the parameters(# Number of requests sent simultaneously:requests=1, # Time delay (sec) between requests:delay=30, # Time out (sec) for giving up a long wait:timeout=600) to accelerate the searching process. I intend to modify them as follows: # Number of requests sent simultaneously:requests=5, # Time delay (sec) between requests:delay=10, # Time out (sec) for giving up a long wait:timeout=100. I am not sure if it will influence the prediction accuracy for HGT genes. Would it be possible for you to help me? Thanks.

SIncerely,
Yuan

message:urllib.error.HTTPError: HTTP Error 414: Request-URI Too Long

Hi Qiyun,
I am using the HGTector2 pipeline. When I ran my protein imput file.fasta ( 4000 proteins), the screen always traceback me a messages showed as follow with the final error message:urllib.error.HTTPError: HTTP Error 414: Request-URI Too Long
What do you thinck about this problem?
Many thanks for your advice
Best regards
Jean claude

Homology search started at 2020-04-20 14:24:42.412971.
Settings:
Search method: remote.
Self-alignment method: native.
Remote fetch enabled: yes.
Reading input proteins...
XKB1.1-Proteins: 4050 proteins.
Done. Read 4050 proteins from 1 samples.
Dropping sequences shorter than 30 aa... done.
Reading custom taxonomy database... done.
Read 1 taxa.
Batch homology search of XKB1.1-Proteins started at 2020-04-20 14:24:42.796838.
Number of queries: 4046.
Submitting 21 queries for search.Traceback (most recent call last):
File "/home/jogier/anaconda3/envs/hgtector/bin/hgtector", line 100, in
main()
File "/home/jogier/anaconda3/envs/hgtector/bin/hgtector", line 39, in main
module(args)
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 165, in call
res = self.search_wf(
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 701, in search_wf
res = self.remote_search(seqs)
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/search.py", line 1599, in remote_search
with urlopen(url) as response:
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/home/jogier/anaconda3/envs/hgtector/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 414: Request-URI Too Long

Error in running DIAMOND. Please check.

I am trying to use the program for some bacterial proteins.
I have diamond installed:

$ diamond --version
diamond version 0.9.21

But I get the error in the object when I try to run HGTector

$ perl HGTector.pl Test_HGT_input/

+---------------------+
|   HGTector v0.2.2   |
+---------------------+

Validating task...
Done.

Step 1: Searcher - batch protein sequence homology search.

-> Searcher: Batch sequence homology searching and filtering. <-
Reading input data...
  Test: 8 proteins.
Done. 8 proteins from 1 set(s) to query.
Reading taxonomy database... done. 1616518 records read.
Reading protein-to-TaxID dictionary... done. 46930502 records read.
Enter the TaxID of Test, or press Enter if you don't know:258255
Taxonomy of input protein sets:
  Test: XXXXXXXX
Batch homology search of Test (8 queries) started.
Error in running DIAMOND. Please check. at /HGTector-0.2.2/scripts/searcher.pl line 698, <STDIN> line 1.
Error: Execution of searcher.pl failed. HGTector exists.

Any idea where the problem might be?
Thanks

blastdbcmd error

Hi,
I try to use local blast+ 2.4.0 program to annotate my data by HGTector. But an error occourred. The NR database i used was downloaded before, not by databaser.py.

Error: [blastdbcmd] copri_1_1: OID not found
Error: [blastdbcmd] copri_1_1: OID not found
BLAST query/options error: Entry or entries not found in BLAST database

And here is my config.txt:
interactive=0
selfTax=pcopri:165179
protdb=/home/wang/software/ncbi-blast-2.4.0+/nr/nr
taxdump=/home/wang/software/ncbi-blast-2.4.0+/taxdmp
prot2taxid=/home/wang/software/ncbi-blast-2.4.0+
threads=20
blastp=/home/wang/software/ncbi-blast-2.4.0+/bin/blastp
blastdbcmd=/home/wang/software/ncbi-blast-2.4.0+/bin/blastdbcmd
identity=30
coverage=50
ignoreTaxa=vector,synthetic,phage
graphFp=1
exOutlier=1
title=prevotella copri
outExcel=1

Thanks!

qiyunlab / hgtector Goto Github PK

hgtector's Introduction

HGTector2

Documentation

Quick start

License

Citation

hgtector's People

Contributors

Stargazers

Watchers

Forkers

hgtector's Issues

Summarize HGT events by gene orthology

Recommend Projects

Recommend Topics

Recommend Org