zhangrengang / tesorter Goto Github PK

View Code? Open in Web Editor NEW

85.0 7.0 19.0 66.59 MB

TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes

Home Page: https://doi.org/10.1093/hr/uhac017

License: GNU General Public License v3.0

Python 8.00% Shell 0.01% Pep8 91.82% R 0.17%

ltr-retrotransposons classification clade-level

tesorter's Introduction

TEsorter

It is coded for LTR_retriever to classify long terminal repeat retrotransposons (LTR-RTs) at first. It can also be used to classify any other transposable elements (TEs), including Class I and Class II elements which are covered by the REXdb database.

Since version v1.4, a GENOME mode is supported to identify TE protein domains throughout whole genome.

For more details of methods and benchmarking results in classifying TEs, please see the paper in Horticulture Research.

Installation
- Using bioconda
- Old school
Quick Start
Citations
Outputs
Usage
Limitations
Further phylogenetic analyses
Extracting TE sequences from genome for TEsorter

Installation

Using bioconda

conda install -c bioconda tesorter

Old school

Dependencies:

python >3
+ biopython: quickly install by pip install biopython or conda install biopython
+ xopen: quickly install by pip install xopen or conda install xopen
hmmscan 3.3x: be compatible with HMMER3/f database format. quickly install by conda install hmmer
blast+: quickly install by conda install blast
TEsorter:

git clone https://github.com/zhangrengang/TEsorter
cd TEsorter
python setup.py install

Quick Start

# run the example
TEsorter-test
# or
TEsorter TEsorter/test/rice6.9.5.liban

By default, the newly released REXdb (viridiplantae_v3.0 + metazoa_v3) database is used, which is more sensitive and more common and thus is recommended.

For plants (an example), it might be better to use only the plant database (Note that the input file is TE or LTR sequences but not genome sequences: ELEMENT mode):

TEsorter TE.fasta -db rexdb-plant

Classical GyDB can also be used:

TEsorter TE.fasta -db gydb

To speed up, use more processors [default=4]:

TEsorter TE.fasta -p 20

To improve sensitivity, reduce the criteria (coverage and E-value):

TEsorter TE.fasta -p 20 -cov 10 -eval 1e-2

To improve specificity, increase the criteria and disable the pass2 mode:

TEsorter TE.fasta -p 20 -cov 30 -eval 1e-5 -dp2

To improve sensitivity of pass-2, reduce the 80–80–80 rule which may be too strict for superfamily-level classification:

TEsorter TE.fasta -p 20 -rule 70-30-80

To classify TE polyprotein sequences (an example) or gene protein seqeunces:

TEsorter RepeatPeps.lib -st prot -p 20

Since version v1.4, a GENOME mode (input genome sequences) is supported to identify TE protein domains throughout whole genome:

TEsorter genome.fasta -genome -p 20 -prob 0.9

Citations

If you use the TEsorter tool, please cite:

Zhang RG, Li GL, Wang XL et. al. TEsorter: an accurate and fast method to classify LTR retrotransposons in plant genomes [J]. Horticulture Research, 2022, 9: uhac017 https://doi.org/10.1093/hr/uhac017

If you use the REXdb database (-db rexdb/rexdb-plant/rexdb-metazoa), please cite:

Neumann P, Novák P, Hoštáková N et. al. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification [J]. Mobile DNA, 2019, 10: 1 https://doi.org/10.1186/s13100-018-0144-1

If you use the GyDB database (-db gydb), please cite:

Llorens C, Futami R, Covelli L et. al. The Gypsy Database (GyDB) of mobile genetic elements: release 2.0 [J]. Nucleic Acids Research, 2011, 39: 70–74 https://doi.org/10.1093/nar/gkq1061

If you use the AnnoSINE database (-db sine), please cite:

Li Y, Jiang N, Sun Y. AnnoSINE: a short interspersed nuclear elements annotation tool for plant genomes [J]. Plant Physiology, 2022, 188: 955–970 http://doi.org/10.1093/plphys/kiab524

If you use the LINE/RT database (-db rexdb-line), please cite:

Kapitonov VV, Tempel S, Jurka J. Simple and fast classification of non-LTR retrotransposons based on phylogeny of their RT domain protein sequences [J]. Gene, 2009, 448: 207–213 http://doi.org/10.1016/j.gene.2009.07.019

If you use the DNA/TIR database (-db rexdb-pnas), please cite:

Yuan YW, Wessler SR. The catalytic domain of all eukaryotic cut-and-paste transposase superfamilies [J]. Proceedings of the National Academy of Sciences, 2011, 108: 7884–7889 http://doi.org/10.1073/pnas.1104208108

Outputs

rice6.9.5.liban.rexdb.domtbl        HMMScan raw output
rice6.9.5.liban.rexdb.dom.faa       protein sequences of domain, which can be used for phylogenetic analysis.
rice6.9.5.liban.rexdb.dom.tsv       inner domains of TEs/LTR-RTs, which might be used to filter domains based on their scores and coverages.
rice6.9.5.liban.rexdb.dom.gff3      domain annotations in `gff3` format
rice6.9.5.liban.rexdb.cls.tsv       TEs/LTR-RTs classifications
    Column 1: raw id
    Column 2: Order, e.g. LTR
    Column 3: Superfamily, e.g. Copia
    Column 4: Clade, e.g. SIRE
    Column 5: Complete, "yes" means one LTR Copia/Gypsy element with full GAG-POL domains.
    Column 6: Strand, + or - or ?
    Column 7: Domains, e.g. GAG|SIRE PROT|SIRE INT|SIRE RT|SIRE RH|SIRE; `none` for pass-2 classifications
rice6.9.5.liban.rexdb.cls.lib       fasta library for RepeatMasker
rice6.9.5.liban.rexdb.cls.pep       the same sequences as `rice6.9.5.liban.rexdb.dom.faa`, but id is changed with classifications.
rice6.9.5.liban.rexdb.*masked		sequences masking the TE domains

Note: the GENOME mode (-genome) will not output *.cls.* files.

Usage

$ TEsorter  -h
usage: TEsorter [-h] [-v] [-db {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}] [--db-hmm DB_HMM]
                [-st {nucl,prot}] [-pre PREFIX] [-fw] [-p PROCESSORS] [-tmp TMP_DIR] [-cov MIN_COVERAGE] [-eval MAX_EVALUE]
                [-prob MIN_PROBABILITY] [-nocln] [-cite] [-dp2] [-rule PASS2_RULE] [-nolib] [-norc] [-genome]
                [-win_size WIN_SIZE] [-win_ovl WIN_OVL]
                sequence

lineage-level classification of transposable elements using conserved protein domains.

positional arguments:
  sequence              input TE/LTR or genome sequences in fasta format [required]

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -db {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}, --hmm-database {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}
                        the database name used [default=rexdb]
  --db-hmm DB_HMM       the database HMM file used (prior to `-db`) [default=None]
  -st {nucl,prot}, --seq-type {nucl,prot}
                        'nucl' for DNA or 'prot' for protein [default=nucl]
  -pre PREFIX, --prefix PREFIX
                        output prefix [default='{-s}.{-db}']
  -fw, --force-write-hmmscan
                        if False, will use the existed hmmscan outfile and skip hmmscan [default=False]
  -p PROCESSORS, --processors PROCESSORS
                        processors to use [default=4]
  -tmp TMP_DIR, --tmp-dir TMP_DIR
                        directory for temporary files [default=./tmp-e104611e-7ce3-11ed-90b7-0b7f57d69b28]
  -cov MIN_COVERAGE, --min-coverage MIN_COVERAGE
                        mininum coverage for protein domains in HMMScan output [default=20]
  -eval MAX_EVALUE, --max-evalue MAX_EVALUE
                        maxinum E-value for protein domains in HMMScan output [default=0.001]
  -prob MIN_PROBABILITY, --min-probability MIN_PROBABILITY
                        mininum posterior probability for protein domains in HMMScan output [default=0.5]
  -mask {soft,hard} [{soft,hard} ...]
                        output masked sequences (soft: masking with lowercase;
                        hard: masking with N) [default=None]
  -nocln, --no-cleanup  do not clean up the temporary directory [default=False]
  -cite, --citation     print the citation and exit [default=False]

ELEMENT mode (default):
  Input TE/LTR sequences to classify them into clade-level.

  -dp2, --disable-pass2
                        do not further classify the unclassified sequences [default=False for `nucl`, True for `prot`]
  -rule PASS2_RULE, --pass2-rule PASS2_RULE
                        classifying rule [identity-coverage-length] in pass-2 based on similarity [default=80-80-80]
  -nolib, --no-library  do not generate a library file for RepeatMasker [default=False]
  -norc, --no-reverse   do not reverse complement sequences if they are detected in minus strand [default=False]

GENOME mode:
  Input genome sequences to detect TE protein domains throughout whole genome.

  -genome               input is genome sequences [default=False]
  -win_size WIN_SIZE    window size of chunking genome sequences [default=270000]
  -win_ovl WIN_OVL      overlap size of windows [default=30000]

Limitations

For each domain (e.g. RT), only the best hit with the highest score will output, which means: 1) if frame is shifted, only one part can be annotated; 2) for example, if two or more RT domains are present in one query sequence, only one of these RT domains will be annotated.
Many LTR-RTs cannot be classified due to no hit, which might be because: 1) the database is still incompleted; 2) some LTR-RTs may have too many mutations such as frame shifts and stop gains or have lost protein domains; 3) some LTR-RTs may be identified false positively. For the test data set (rice6.9.5.liban), ~84% LTR-RTs (_INT sequences) are classified.
Non-autonomous TEs that lack protein domains, some un-active autonomous TEs that have lost their protein domains and any other elements that contain none protein domains, are excepted to be un-classified.

Further phylogenetic analyses

You may want to use the RT domains to analysis relationships among retrotransposons (LTR, LINE, DIRS, etc.). Here is an example (with mafft and iqtree installed):

# to extract RT domain sequences
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep RT > rice6.9.5.liban.rexdb.cls.pep.RT.aln

# to reconduct the phylogenetic tree with IQTREE or other tools
iqtree -s rice6.9.5.liban.rexdb.dom.RT.aln -bb 1000 -nt AUTO

# Finally, visualize and edit the tree 'rice6.9.5.liban.rexdb.RT.faa.aln.treefile' with FigTree or other tools.

The alignments of LTR-RTs full domains can be generated by (align and concatenate; concatenate_domains.py will convert all special characters to _ to be compatible with iqtree and scripts/LTR_tree.R):

concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep GAG PROT RH RT INT > rice6.9.5.liban.rexdb.cls.pep.full.aln

The alignments of Class I INT and Class II TPase (DDE-transposases) can be generated by:

concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep INT > rice6.9.5.liban.rexdb.cls.pep.INT.aln
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep TPase > rice6.9.5.liban.rexdb.cls.pep.TPase.aln
cat rice6.9.5.liban.rexdb.cls.pep.INT.aln rice6.9.5.liban.rexdb.cls.pep.TPase.aln > rice6.9.5.liban.rexdb.cls.pep.INT_TPase.faa
mafft --auto rice6.9.5.liban.rexdb.cls.pep.INT_TPase.faa > rice6.9.5.liban.rexdb.cls.pep.INT_TPase.aln

Note: the domain names between rexdb and gydb are somewhat different: PROT (rexdb) = AP (gydb), RH (rexdb) = RNaseH (gydb). Please use the actual domain name.

Here, an R script (depending on ggtree) is provided to fast visualize the LTR tree. An example in example_data/:

../scripts/LTR_tree.R rice6.9.5.liban.rexdb.cls.pep_RT_RH_INT.aln.treefile rice6.9.5.liban.rexdb.cls.tsv rice6.9.5.liban.rexdb.cls.pep_RT_RH_INT.aln.treefile.png

Extracting TE sequences from genome for TEsorter

Here are examples to extract TE sequences from outputs of wide-used softwares, when you have only genome sequences.

extract all TE sequences from RepeatMasker output:

# run RepeatMasker, which will generate a *.out file.
RepeatMasker [options] genome.fa

# extract sequences
RepeatMasker.py out2seqs genome.fa.out genome.fa > whole_genome_te.fa

# classify
TEsorter whole_genome_te.fa [options]

extract all intact LTR-RTs sequences from LTR_retriever outputs:

# run LTR_retriever, which generate two *.pass.list files.
LTR_retriever -genome genome.fa [options]

# extract sequences
LTR_retriever.py get_full_seqs genome.fa > intact_ltr.fa

# classify
TEsorter intact_ltr.fa [options]

tesorter's People

Contributors

Stargazers

Watchers

Forkers

hui-liu cgnat altingia nbisweden bioteksampath oluchiaroh hechuweiran anandksrao bioinfx hlkfoz yuzhenpeng shehongbing wenwen012345 jiangchb ankeetkumar chenwx-up xfcypeng 2eyu zengyuan-11

tesorter's Issues

Benchmark with rice, dmel, and maize

Hello @zhangrengang,

Thank you so much for developing this neat package. I improved the LTR_retriever classification scheme based on your suggestions by changing the copia classification ratio to 0.9, and label as LTR_retriever_new.

Original scheme in annotate_TE.pl:

$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3);
$family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

New scheme in annotate_TE.pl:

$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3);
$family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.9);

I then benchmark the classification performance of LTR_classifier, LTR_retriever, and LTR_retriever_new. I used the rice curated library, the dmel repbase database, and the maize TE consortium (MTEC) library for this test.

Species	Method	Database	Total LTR	Copia	Gypsy	others	Unknown	Reclassified unknown	Misclassified as other superfamily	Misclassified as other class
Rice	Curated	Curated	409	159	224	19	7	0	0	0
Rice	LTR_classifier	gydb	308	134	172	0	0	-	2	1
Rice	LTR_classifier	rexdb	330	142	185	0	0	-	3	0
Rice	LTR_retriever	TEfam.hmm	353	69	203	0	0	-	77	0
Rice	LTR_retriever_new	TEfam.hmm	353	138	203	0	4	-	8	0
Dmel	Curated	Curated	142	10	100	17	15	0	0	0
Dmel	LTR_classifier	rexdb	67	5	43	0	0	3	10	6
Dmel	LTR_retriever_new	TEfam.hmm	65	4	41	0	1	4	15	0
Zmays	Curated	Curated	600	185	244	0	171	0	0	0
Zmays	LTR_classifier	rexdb	473	170	224	0	0	73	5	1
Zmays	LTR_retriever_new	TEfam.hmm	460	168	224	0	5	55	7	1

In rice, LTR_retriever_new has significantly improved classification sensitivity for copia elements (thank you for your keen insight!) and has slightly higher overall sensitivity than LTR_classifier with the rexdb database. For maize and drosophila, the two methods, LTR_classifier and LTR_retriever_new, have comparable performance. Besides, LTR_classifier provides accurate classifications for non-LTR and DNA TEs, which make it a general TE classifier. I think you can write a short application note for this nice package. I would love to cite it and incorporate it in the EDTA package.

Thanks again for your work.

Best,
Shujun

Insertion time calculation

Dear developer,

Thanks for your great tools!
Would you please add the function to calculate LTR insertion time in the package so that the user could take the convinence to finish the task all-in-one.

Best ~

Python3 version error

I tried running TEsorter using the python3 version by juke but i keep getting the error below. i installed the anaconda version.

b'\nError: Failed to open binary auxfiles for /mnt/beegfs/apps/dmc/apps/anaconda_3-2020.02/lib/python3.7/site-packages/TEsorter/database/GyDB2.hmm: use hmmpress first\n\n'.

Any idea how to fix this?

How to obtain the nucleic acid sequence of RT-Domain

As mentioned, the resulting sequence contains only the protein-domain sequence file.

Allocation into lineages for metazoan LTR-RTs

Hi Ren-Gang,

This issue may partially overlap with previous questions, but I think it will help if it shows up separately here.

Is there any progress/updates on allocating animal LTR-RTs into lineages (SIRE, Ale, Tekay etc.) as you successfully do in plants, or this is yet not possible?

Related to this, what is the purpose of selecting -db rexdb-metazoa instead of rexdb-plants? I suppose that it is helping towards a better allocation into Copia, Ty3, or unknown LTR-RTs, correct?

Could you also please clarify (and maybe add a note in the main page of what is rexdb-tir and rexdb-pnas? Apologies if this information is somewhere but I've missed it.

Also a request: could you add an output file in TEsorter that the user can easily select the fasta files of the full-length elements (i.e. the original input file) that are SIRE, or ATHILA etc.? That will be very handy if someone is interested in further analyzing a specific lineage.

Thanks,
Alex

What kind of transposon does the mixture belong to

Hi Rengang,
I encountered several problems when using TEsort.
I found a lot of mixtures in my $.rexdb-plant.cls.tsv result file. What kind of TEs does it belong to?
According to the result file $.mod.EDTA.TEanno.gff3 of EDTA, I extracted the fata sequence of the corresponding repeated sequence and used it as the input file of TEsort. However, I found that TEsort has many annotation results that contradict EDTA. How can I judge which ones are accurate?
When using TEsort annotations, will the overlapping sequence of input files affect the annotation results?

#TE Order Superfamily Clade Complete Strand Domains
LTR_retrotransposon::Chr01:12217604-12218469 mixture mixture unknown unknown - RT|pararetrovirus RH|non-chromo-outgroup
LTR_retrotransposon::Chr01:13239443-13240460 mixture mixture unknown unknown - RT|pararetrovirus RH|non-chromo-outgroup
LTR_retrotransposon::Chr01:14185467-14186333 mixture mixture unknown unknown - RT|pararetrovirus RH|non-chromo-outgroup
LTR_retrotransposon::Chr01:14434069-14435285 LTR Gypsy mixture no + RT|Retand RH|Ogre
LTR_retrotransposon::Chr01:14960982-14962159 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup
LTR_retrotransposon::Chr01:21438967-21440147 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup
LTR_retrotransposon::Chr01:21771486-21772792 mixture mixture unknown unknown - RT|pararetrovirus RH|chromo-outgroup
LTR_retrotransposon::Chr01:2440997-2441947 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup
LTR_retrotransposon::Chr01:25935874-25936670 mixture mixture unknown unknown + RT|pararetrovirus RH|non-chromo-outgroup

Best wish !
putao

Conda version available

The conda version is available for both Linux64 and OSX64 thanks for @Juke34's great work!
https://anaconda.org/bioconda/tesorter

Can TEsorter classify Class II elements(DNA transposons) into clade-level?

Hi @zhangrengang
As the title, my TEsorter version is 1.4.6, I use TEsorter to classify my EDTA's outcome(my interested species is an apple species). my code:
TEsorter Gala_genome.fa.mod.EDTA.intact.fa -db rexdb-plant -p 5
and I got this file, Gala_genome.fa.mod.EDTA.intact.fa.rexdb-plant.cls.tsv, I saw Copia and other superfamily-level LTR TEs divided into so many clades, such as Ogre, Ale, etc.. But, none superfamily in TIR Oder could be classified into clade level. If I want to obtain Class II elements in clade-level, could you please give me some advice to achieve it?

Best wishes~

GYDB database classification

Hi,

Thanks for this awesome tool. I am trying to annotate the genome of an invertebrate. I used LTR_retriever to identify intact elements and i am currently trying to use TE sorter to classify them into families.
I used the gydb database in the software to do this since i am working with an invertebrate. However, the output classified most of the element as a_clade and b_clade.I have searched the gydb database to figure out what this means but there wasn't any matching result.

Do you have any idea what a_clade and b_clade mean?

The subfamilies found by TEsorter

Hi @zhangrengang
Thank you so much for developing this great tool. I tried this tool and it worked well. But I am a little confused about the results. Based on some published papers, such as this one about rice, It reported about 300 subfamilies. The TEsorter just report ~40 subfamilies. So waht is the differences between them? Is the different tools or databases caused the different results? Does this kind of classification by TEsorter represent the diversity of all the superfamilies?

Thanks again for your great work!

Best,
YTLogos

How to analyze the effect of transposons on plant traits in de novo transcriptome assembly.

Hi @zhangrengang

The species I am studying (a kind of Rhododendron) does not have a reference genome. Now I want to know how to analyze the effect of transposons ((mainly LTR retrotransposons) on traits in this case. Do you have any suggestions or references?

Most of the literature I read is based on having a reference genome, but I don't currently have one for this species. We did its non-reference transcriptome and performed de novo assembly using Trinity by ourselves.

I had a discussion with Oushujun about how to discover transposons in a non-reference transcriptome, and he recommended TEsorter to me, which is indeed a good method. However, I also found later that it would be more difficult to conduct some studies without a reference genome, such as understanding the insertion position of the LTR retrotransposon.

But I also found that through the analysis of the non-reference transcriptome, it is indeed possible to find the approximate location of the insertion of the LTR retrotransposon (we focus on this) or some key characteristics of itself. For example, by annotating genes and transposons, if transposons are inserted inside genes, some transposons and genes may be transcribed into RNA fragments, which can be detected in the table (as shown below) . I haven't accomplished this working yet, but I think it's basically doable.

Nonetheless, I would also like to ask further, if there is any recommended or better, more comprehensive analysis method. Our purpose is to search for the possible changes in the traits of Rhododendron caused by LTR retrotransposons (such as their changes in expression activity or changes in insertion sites, insertion of certain key genes, etc.) during altering cultivation conditions (such as plant hormone, stress, etc.). I think that if we can find this result, it will be of great reference value for us, that is, the change of traits caused by somatic clonal variation of Rhododendron during the cultivation process.

At the same time, it is also worth considering that although the rhododendron species I studied did not have a reference genome, other rhododendron species had reference genomes, and they belonged to the same subgenus, which means that they are very closely related. . I don't know if it is possible to use this, for example, use the transcriptome of the Rhododendron species we have done, map it to the reference genome of Rhododendron (whose genome has been sequenced), and determine the insertion site of the LTR retrotransposon . I don't think so, but I can't be 100% sure. Want to give advice and give some answers?

Hope the above questions could be answered, thank you!.

python3

Hi,
Will it be possible to re-implement the tool in python3?
Did you tried 2to3 so see if it can be automatically converted in python3?

Best regards

AssertionError

Dear Ren-gang,

I was using TEsorter to identify TE-related genes and encounter this error:

2019-10-08 19:46:02,971 -INFO- total 31808 sequences classified.
2019-10-08 19:46:02,971 -INFO- see classified sequences in mRNA.fa.rexdb.cls.tsv
2019-10-08 19:46:02,971 -INFO- writing library for RepeatMasker in mRNA.fa.rexdb.cls.lib
2019-10-08 19:46:07,722 -INFO- writing classified protein domains in mRNA.fa.rexdb.cls.pep
Traceback (most recent call last):
File "/home/oushujun/las/git_bin/TEsorter/TEsorter.py", line 906, in
pipeline(Args())
File "/home/oushujun/las/git_bin/TEsorter/TEsorter.py", line 203, in pipeline
assert raw_id in d_class
AssertionError

I was only using the -p 16 parameter:
nohup python2 ~/las/git_bin/TEsorter/TEsorter.py mRNA.fa -p 16 &

However, I am still getting the expected results. This error was not persistent, which only occurred on some runs. Please take a look. Thanks!

Best,
Shujun

Header of the .cls.tsv file

In the *.cls.tsv file, the header is listed in the following format. However, in the place of "Order", the true classification should be in the "Subclass" level. For the order level, it probably should be "Transposable elements", "telomere", "knobs", "tandem repeats", and something like that.

Class 1 TE: retrotransposons
--- subclass: LTR, LINE, SINE, ...
Class 2 TE: DNA transposons
--- subclass: TIR, Helitron, ...

#TE Order Superfamily Clade Complete Strand Domains
Chr10_11341966_11353509#DNA/DTC TIR EnSpm_CACTA unknown unknown + TPase|EnSpm_CACTA
Chr10_1407216_1416994#DNA/DTC TIR EnSpm_CACTA unknown unknown - TPase|EnSpm_CACTA
Chr10_15280546_15283837#DNA/DTM LTR Copia Ivana no + GAG|Ivana PROT|Ivana
Chr10_15702600_15707627#DNA/DTM TIR MuDR_Mutator unknown unknown - TPase|MuDR_Mutator
Chr10_18286631_18291104#DNA/DTA LTR Copia Ale no + PROT|Ale
Chr10_19224444_19228830#DNA/DTM TIR MuDR_Mutator unknown unknown + TPase|MuDR_Mutator
Chr11_23324292_23325763#DNA/DTH mixture mixture unknown unknown ? RH|Ale TPase|hAT
Chr11_23650026_23652156#DNA/DTM LTR Gypsy Tekay no + RT|Tekay
Chr11_24975696_24980697#DNA/DTM TIR MuDR_Mutator unknown unknown + TPase|MuDR_Mutator
Chr2_19381852_19383154#DNA/DTC Helitron unknown unknown unknown + HEL2|Helitron
Chr2_21518422_21522564#DNA/DTM LTR Gypsy Reina no + GAG|Reina PROT|Reina

Best,
Shujun

Typo

Hi @zhangrengang ,

Thanks for developing this great package! I am testing it now and found a typo in this test line:
python ../LTR_classifier.py Classifier rice6.9.5.rexdb.liban.gff3 > rice6.9.5.liban.rexdb.gff3.anno

where rice6.9.5.rexdb.liban.gff3 should be rice6.9.5.liban.rexdb.gff3

Best,
Shujun

keyError

Hi~
The following error asserted when I ran TEsorter on my fasta file.

Traceback (most recent call last):
  File "/data/home/xutun/miniconda3/envs/tt/bin/TEsorter", line 10, in <module>
    sys.exit(main())
  File "/data/home/xutun/miniconda3/envs/tt/lib/python3.6/site-packages/TEsorter/app.py", line 1014, in main
    pipeline(Args())
  File "/data/home/xutun/miniconda3/envs/tt/lib/python3.6/site-packages/TEsorter/app.py", line 167, in pipeline
    maxeval = args.max_evalue,
  File "/data/home/xutun/miniconda3/envs/tt/lib/python3.6/site-packages/TEsorter/app.py", line 919, in LTRlibAnn
    prefix=prefix, seqtype=seqtype, mincov=mincov, maxeval=maxeval)
  File "/data/home/xutun/miniconda3/envs/tt/lib/python3.6/site-packages/TEsorter/app.py", line 801, in hmm2best
    gseq = d_seqs[rc.qname].seq[rc.envstart-1:rc.envend]
KeyError: 's004_4:248929-251341(-)|aa1'

I noticed that special symbols in sequence names might lead to strange problems. But I have successfully applied TEsorter to other fasta files that contain simialr type of names and this fasta file was the only exception.
I would be appreciate if you could give any early reply.

Sincerly,
Tun Xu

TEsorter find TE-related gene in BUSCO datasets

Hi, Rengang,

I use the TEsorter to classify the potential TE in BUSCO gene（validate the EDTA masking result）. I test two BUSCO gene sets from animals and plants，found the BUSCO gene have 1% TE-related gene. Is it BUSCO issue or the TEsorter issue?

Here is the command I use

# fa are BUSCO/odb9/tetrapoda_odb9/ancestral and /data/database/BUSCO/odb10/eudicotyledons_odb10/ancestral

python /data/software/TEsorter/TEsorter.py -db rexdb -st prot -p 12 eudicotyledons.odb10.fa
python /data/software/TEsorter/TEsorter.py -db rexdb -st prot -p 12 tetrapoda.obd9.fa

Here is the result from TEsorter

# eudicotyledons 2121 genes
#TE    Order     Superfamily  Clade            Complete  Strand  Domains
12416  LTR       Copia        unknown          no        +       GAG|Ty1-outgroup
13331  LTR       Copia        Alesia           no        +       INT|Alesia
15674  LTR       Copia        Ikeros           no        +       GAG|Ikeros
1703   LTR       Gypsy        chromo-unclass   no        +       RH|chromo-unclass
17251  LINE      unknown      unknown          unknown   +       ENDO|LINE
18920  LINE      unknown      unknown          unknown   +       ENDO|LINE
19331  LTR       Gypsy        Chlamyvir        no        +       PROT|Chlamyvir
2319   LTR       Copia        Tork             no        +       GAG|Tork
23637  LTR       Gypsy        chromo-outgroup  no        +       CHD|chromo-outgroup
24863  LTR       Bel-Pao      unknown          no        +       GAG|Bel-Pao
298    LTR       Gypsy        Athila           no        +       RT|Athila
3490   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
370    LTR       Gypsy        TatIII           no        +       INT|TatIII
39406  LTR       Gypsy        unknown          no        +       GAG|Ty3_gypsy
4185   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
4235   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
42761  LTR       Copia        Gymco-I          no        +       GAG|Gymco-I
5202   LINE      unknown      unknown          unknown   +       ENDO|LINE
5492   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
5537   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
6911   LINE      unknown      unknown          unknown   +       ENDO|LINE
7178   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
7311   Maverick  unknown      unknown          unknown   +       ATPase|Maverick
75     Maverick  unknown      unknown          unknown   +       ATPase|Maverick


# tetrapoda.obd9 3950 genes
#TE          Order           Superfamily    Clade            Complete  Strand  Domains
EOG09070046  LTR             Gypsy          Galadriel        no        +       CHD|Galadriel
EOG0907005G  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090700NV  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090700P2  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090700WL  LTR             Bel-Pao        unknown          no        +       GAG|Bel-Pao
EOG0907011V  TIR             hAT            unknown          unknown   +       TPase|hAT
EOG090701KJ  Maverick        unknown        unknown          unknown   +       ATPase|Maverick HEL2|Helitron
EOG0907023Z  LTR             Copia          Ikeros           no        +       GAG|Ikeros
EOG090702M8  LTR             Gypsy          Tcn1             no        +       GAG|Tcn1
EOG090702OV  LTR             Gypsy          chromo-outgroup  no        +       CHD|chromo-outgroup
EOG090702OY  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG090702YF  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG09070311  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090703FS  Helitron        unknown        unknown          unknown   +       HEL2|Helitron
EOG090703IH  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090703K5  LTR             Copia          unknown          no        +       GAG|Ty1_copia
EOG090703L1  LTR             Gypsy          CRM              no        +       CHD|CRM
EOG090703QV  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090703U9  LTR             Gypsy          chromo-unclass   no        +       RH|chromo-unclass
EOG090703X6  LTR             Gypsy          unknown          no        +       RH|Ty3_gypsy
EOG090703ZG  LTR             Gypsy          unknown          no        +       INT|Ty3_gypsy
EOG0907060L  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG0907061O  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG090706TH  LTR             Copia          TAR              no        +       RH|TAR
EOG090707EW  TIR             PIF_Harbinger  unknown          unknown   +       TPase|PIF_Harbinger
EOG090707UK  LTR             Copia          Gymco-III        no        +       GAG|Gymco-III
EOG0907089R  LTR             Copia          Gymco-I          no        +       PROT|Gymco-I
EOG0907097Q  mixture         mixture        unknown          unknown   +       ATPase|Maverick RT|non-chromo-outgroup
EOG090709EF  LTR             Retrovirus     unknown          unknown   +       RH|Retrovirus
EOG090709MD  LINE            unknown        unknown          unknown   +       ENDO|LINE
EOG09070B54  LTR             Bel-Pao        unknown          no        +       PROT|Bel-Pao
EOG09070B89  Maverick        unknown        unknown          unknown   +       PROT|Maverick
EOG09070B8U  Maverick        unknown        unknown          unknown   +       ATPase|Maverick
EOG09070D5R  LTR             Bel-Pao        unknown          no        +       GAG|Bel-Pao
EOG09070EMS  LTR             Gypsy          chromo-outgroup  no        +       CHD|chromo-outgroup
EOG09070FOZ  pararetrovirus  unknown        unknown          unknown   +       RT|pararetrovirus

How to identify the homology (synteny) of LTRs?

Dear @zhangrengang

TEsorter is a very good tool that we used in our recent research.

Our recent study may wish to identify collinearity (synteny) of LTRs across species (although it's not clear to me whether doing so would necessarily be meaningful). But currently I'm running into some trouble. Our manuscript had some issues pointed out by the reviewers. For example, our approach to identifying collinearity (synteny) of LTRs was inappropriate. The reference for our rough approach came from section 2.6 of (https://onlinelibrary.wiley.com/doi/10.1111/jse.12850) (although we now think that the original method of this article is also problematic.).

The reviewer's opinion is roughly: "The way to identify LTR synteny is problematic. Since LTR retrotransposons can create many identical copies in non-syntenic regions, using BLASTAll to identify the most identical LTR sequence would not help to identify the syntenic locus of the query LTR sequence.....The authors need to verify the flanking sequence of the syntenic LTRs and make sure they are also identical between species compared and may also need to use genes to anchor the calling for syntenic LTRs.”

Our purpose is mainly to prove that LTRs originate from the transmission between species, or the duplication within the genome.

The main questions are: 1 How to obtain the flanking sequences of LTRs, and how long the flanking sequences of LTRs should be obtained. 2 We do not have a well-established method to obtain the flanking sequences of LTRs, because of the very large number of LTRs. And many flanking sequences may also overlap substantially. Whether it is necessary to do this, I can't help but have a lot of entanglements. 3 I tried to find relevant literature, but apart from the literature mentioned above, I rarely saw the identification of collinearity of LTRs literature. Therefore, I doubt the feasibility and significance of the experiment (is there any relevant literature recommended? ). 4 Our purpose may be listed above. But I'm not sure that revealing the collinearity of LTRs will reveal our conjecture.

Therefore, our doubts and confusion mainly stem from this, and we hope to have some suggestions. Thank you so much!

Can REXdb_v3_TIR.hmm be downloaded?

Dear developers,

Thanks for a great tool! I tried running TEsorter -st nucl -p 40 -db rexdb-tir but the matching REXdb_v3_TIR.hmm file is not available in the database directory. Can it be downloaded from somewhere?

Cheers!

/Andreas

Hi~ TEsorter can be worked in animal genoem ?

Hi~ TEsorter can be worked in animal genome ?

Question

I'm Working on TEs belongs to the animal genome, I used a library containing more than 7K consensus sequences, trying to validate protein domains using TEsorter utilizing different database each time, such as Gydb- rexdb-metazoa

My question here when I use my 7k sequences as an input only, 200 out of the 7k will be included in the TSV file, some are considered as unknown.
but what about the rest? are they considered unknown as well, and why did not bein counted?
should the rest be discarded?

How to extract the CDS sequences of all LTRs?

Hello, @zhangrengang

I am studying how to extract the CDS sequence of all LTR, but I don't know how to do it. Reference from: https://onlinelibrary.wiley.com/doi/10.1111/jse.12850. But the above method is not very detailed. The original text reads as follows:
"2.6 Syntenic LTR retrotransposons analysis
We firstly extracted the coding sequences of all LTR retrotransposons from three chromosome-level genomes, PN40024, V. ripara and V. amurensis genomes. Then the coding sequences were translated into amino acid sequences by using TBtoolsv1.098 (Chen et al., 2020)....."

The python version of MCscanX will be used for multi-species LTR collinearity analysis. So use the CDS file (see: https://github.com/tanghaibao/jcvi/wiki/MCscan- (Python - version), "grape. bed grape. cds peach. bed peach. cds"). This process also requires the BED file. However, I observed the dom.gff file generated by TEsorter, and it doesn't seem to contain the information of full coverage of all CDS; Contains only domain location information. These domains are not integrated CDS regions. Take the following example:

CM030788.1 TEsorter CDS 10392696 10393004 48.1 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-GAG;gene=GAG;clade=Ale;evalue=1.3e-14;coverage=100.0;probability=0.87 CM030788.1 TEsorter CDS 10393602 10393808 79.2 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-PROT;gene=PROT;clade=Ale;evalue=2.7e-24;coverage=100.0;probability=0.99 CM030788.1 TEsorter CDS 10394043 10394627 274.0 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-INT;gene=INT;clade=Ale;evalue=7e-84;coverage=100.0;probability=0.98 CM030788.1 TEsorter CDS 10395538 10396209 346.8 + 2 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-RT;gene=RT;clade=Ale;evalue=6.5e-106;coverage=84.0;probability=0.98 CM030788.1 TEsorter CDS 10396473 10396850 191.8 + 1 ID=CM030788.1:10391873..10397114|Copia|Class_I/LTR/Ty1_copia/Ale:Ty1-RH;gene=RH;clade=Ale;evalue=4.4e-59;coverage=100.0;probability=0.99

I looked at the information and found that each location information was only the location information of a separate domain. Pieced together, the CDS sequence is not complete. And neither dom.faa nor cls.lib files seem to contain complete protein sequences translated by CDS, even if splicing is carried out. So I don't know what to do. I would appreciate your advice.
Sorry, I'm new to bioinformatics. I was hoping you could give me some Pointers.

Can't find executable after old school installation

I want to use the latest version so I run:

git clone https://github.com/zhangrengang/TEsorter
cd TEsorter
python setup.py install

Installation finished without a problem, but the downloaded TEsorter folder did not contain an executable. To look for where it is, I run:

which TEsorter
$ ~/las/bin/miniconda2/envs/EDTA/bin/TEsorter

It seems like the program was installed in the conda env. But if the env already has TEsorter installed, the new version won't ovewrite the old one. So I have to uninstall the existing one then redo the old school installation:

conda remove TEsorter -y
python setup.py install

This is not an error but it was a bit confusing to me. I put what I've done here to let others know how to reinstall.

How to obtain the set of distances between LTRs and their adjacent genes?

Dear @zhangrengang

Hello! I always get a satisfactory answer to my questions from you.

This time I would like to know how can I calculate the distance between each LTR and neighboring genes? Although I can think of some ways to do it, it might be more difficult for me to do it myself. So I would like to hear from you, the expert. Although I am aware of some methods to find overlaps between LTRs and genes (e.g. using gffcompare), I am not quite sure how to calculate the distance between an LTR and the nearest neighbour gene. The description of the process in the only available literature is also quite limited (for example,https://www.nature.com/articles/s41467-020-18771-4, https://academic.oup.com/hr/article/10/1/uhac241/6775201). I would therefore be interested in learning some methods here from you. Thanks!

Thank you very much for your valuable time and expertise, and I look forward to your reply and guidance.
Best regards!
wen

get_full_seqs in LTR_retriever.py generate some empty sequences which should be generated.

Hello again!
I was pudating my genome erv annotation due to the version update of the genome. BUT this time I found get_full_seqs didn't generate all the sequences LTR_retriever showed in the pass.list。 AND I'm pretty sure that get_full_seqs detected all the sequences. as shown below :

error

Dear author, I encountered an error when running the software.Encountered a problems，could you please help me check the reason? I have attached the command line and log below, thank you very much

Traceback (most recent call last):
File "/home/appl/anaconda3/envs/TEsorter/bin/TEsorter", line 10, in
sys.exit(main())
File "/home/appl/anaconda3/envs/TEsorter/lib/python3.10/site-packages/TEsorter/app.py", line 1014, in main
pipeline(Args())
File "/home/appl/anaconda3/envs/TEsorter/lib/python3.10/site-packages/TEsorter/app.py", line 174, in pipeline
for rc in Classifier(gff, db=args.hmm_database, fout=fc):
File "/home/appl/anaconda3/envs/TEsorter/lib/python3.10/site-packages/TEsorter/app.py", line 414, in classify
order, superfamily, max_clade, coding = self.identify_rexdb(genes, names)
File "/home/appl/anaconda3/envs/TEsorter/lib/python3.10/site-packages/TEsorter/app.py", line 430, in identify_rexdb
order, superfamily = self._parse_rexdb(max_clade)
File "/home/appl/anaconda3/envs/TEsorter/lib/python3.10/site-packages/TEsorter/app.py", line 470, in _parse_rexdb
logger.warning( 'unknown clade: {}'.format(max_clade) )
NameError: name 'max_clade' is not defined

Add -f for hmmpress

I updated from the last version and use sh build_database.sh to rebuild hmm profiles. I got the following information and yet TEsorter still asked me to rebuid hmm profiles. So I think it would be good to use the -f option for hmmpress to overwrite any prebuilt libraries.

# hmmpress :: prepare an HMM database for faster hmmscan searches
# HMMER 3.2.1 (June 2018); http://hmmer.org/
# Copyright (C) 2018 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmpress [-options]

Options:
-h : show brief help on version and usage
-f : force: overwrite any previous pressed files
SSI index file database/GyDB2.hmm.h3i already exists;
Delete old hmmpress indices first
SSI index file database/REXdb_protein_database_metazoa_v3.hmm.h3i already exists;
Delete old hmmpress indices first
SSI index file database/REXdb_protein_database_viridiplantae_v3.0.hmm.h3i already exists;
Delete old hmmpress indices first
SSI index file database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm.h3i already exists;
Delete old hmmpress i

Best,
Shujun

How to merge the TEsorter repeat libraires

Hey, thanks for the tool. How can I merge the output library of TEsorter with the repeatModeler repeat library to run RepeatMasker? Further, can I directly input the output library of TEsorter in RepeatMasker?

Setup GitHub actions for Continuous integration (CI)

Could you set up Github Actions for CI, to automatically perform the tests when we push modifications?

ValueError: invalid literal for int() with base 10: '0.5'

Hi, I encountered a problem which only showed up in 2 of my 6 LTR datasets. Below is the traceback. Could you please help out with this?
-INFO- generating gene anntations
Traceback (most recent call last):
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/bin/TEsorter", line 10, in
sys.exit(main())
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/lib/python3.9/site-packages/TEsorter/app.py", line 1014, in main
pipeline(Args())
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/lib/python3.9/site-packages/TEsorter/app.py", line 158, in pipeline
gff, geneSeq = LTRlibAnn(
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/lib/python3.9/site-packages/TEsorter/app.py", line 918, in LTRlibAnn
gff, geneSeq = hmm2best(aaSeq, [domtbl], db=hmmdb, nucl_len=d_nucl_len,
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/lib/python3.9/site-packages/TEsorter/app.py", line 760, in hmm2best
for rc in HmmScan(inHmmout):
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/lib/python3.9/site-packages/TEsorter/app.py", line 645, in parse
yield HmmDomRecord(line)
File "/Share/home/louliyi/miniconda3/envs/LTR_retriever/lib/python3.9/site-packages/TEsorter/app.py", line 658, in init
list(map(int, [self.tlen, self.qlen, self.domi, self.domn,
ValueError: invalid literal for int() with base 10: '0.5'

autonomous elements for TIR

Hi Rengang @zhangrengang ,
Thank you for the useful tool!
I'm seeking to classify DNA transposons and identify autonomous elements.
In the output, it seems all the classified TIRs are unknown in the Complete collum, even with the TPase domains.
Do you have any suggestions to identify the autonomous elements?
Thanks!

WARNING: Could not find drmaa library.

Hi, Rengang,
I use the rice6.9.5.liban you provided to run TEsort. There are results output but the log file always shows warning "-WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH. " Will this affect the result?
1.test.sh.log

Best wish !

putao

license

What is the license of TEsorter?
Could you add it in the repo?

Thank you

RepeatMasker.py

Traceback (most recent call last):
File "/data/storage02/chenss/soft/TEsorter/bin/RepeatMasker.py", line 143, in
main()
File "/data/storage02/chenss/soft/TEsorter/bin/RepeatMasker.py", line 130, in main
subcmd = sys.argv[1]
IndexError: list index out of range

Edge-case crash when parsing gff3

Thank you for this software. I'm getting this error:

2020-09-10 20:07:45,719 -INFO- generating gene anntations
Traceback (most recent call last):
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/bin/TEsorter", line 10, in 
    sys.exit(main())
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 976, in main
    pipeline(Args())
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 171, in pipeline
    for rc in Classifier(gff, db=args.hmm_database, fout=fc):
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 391, in classify
    for rc in self.parse():
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 380, in parse
    line = LTRgffLine(line)
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 609, in __init__
    super(LTRgffLine, self).__init__(line)
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 604, in __init__
    self.attributes = self.parse(self.attributes)
  File "/group/pawsey0149/pbayer/anaconda3/envs/tesorter/lib/python3.8/site-packages/TEsorter/app.py", line 606, in parse
    return dict(kv.split('=') for kv in attributes.split(';'))
ValueError: dictionary update sequence element #0 has length 3; 2 is required

when it's generating the '.rexdb-plant.cls.tsv' file. My command is TEsorter -db rexdb-plant -p 28 Lee.pan.renamed.numericID.fa.mod.EDTA.TElib.fa. I'm getting the same error both with the version checked out from github (commit 2189f63 ) and with the conda version.

I'm thinking there's a '=' too much somewhere in the .rexdb-plant.dom.gff3 file for a unusually named RexDB TE that only sometimes pops up, I'm investigating (or am I looking at the wrong file?).

File exists error

Hi Ren-gang,

I want to execute multiple TEsorter runs in the same folder and encounter the file exists error. Is it possible to tolerate this, for example, by creating ./tmp directories with a random number in the name?

2021-01-03 07:54:26,915 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH
2021-01-03 07:54:26,920 -INFO- VARS: {'sequence': 'B73.PLATINUM.pseudomolecules-v1.fasta.mod.EDTA.intact.gff3.LTR.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'B73.PLATINUM.pseudomolecules-v1.fasta.mod.EDTA.intact.gff3.LTR.fa.rexdb', 'force_write_hmmscan': False, 'processors': 36, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0}
2021-01-03 07:54:26,921 -INFO- checking dependencies:
2021-01-03 07:54:26,931 -INFO- hmmer 3.3.1 OK
2021-01-03 07:54:29,064 -INFO- blastn 2.10.0+ OK
Traceback (most recent call last):
File "/home/oushujun/las/bin/miniconda2/envs/EDTA5/bin/TEsorter", line 10, in
sys.exit(main())
File "/home/oushujun/las/bin/miniconda2/envs/EDTA5/lib/python3.6/site-packages/TEsorter/app.py", line 1014, in main
pipeline(Args())
File "/home/oushujun/las/bin/miniconda2/envs/EDTA5/lib/python3.6/site-packages/TEsorter/app.py", line 149, in pipeline
os.makedirs(args.tmp_dir)
File "/home/oushujun/las/bin/miniconda2/envs/EDTA5/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './tmp'

Thanks!
Shujun

ModuleNotFoundError: No module named 'RunCmdsMP'

Hi, rengang

TEsorter is very cool!!!but I have a problem: I use `conda install -c bioconda tesorter` to install TEsorter in my linux server, I installed it successfully. The output of `TEsorter input_file -db rexdb-plant` is normal, but when I ran `concatenate_domains.py HFgenome.fa.mod.LTR.intact.fa.rexdb-plant.cls.pep GAG PROT RH RT INT > HFgenome.fa.mod.LTR.intact.fa.rexdb-plant.cls.pep.full.aln` command, error occurred: `Traceback (most recent call last):

File "~/miniconda3/envs/TEsorter/bin/concatenate_domains.py", line 7, in
from RunCmdsMP import run_cmd
ModuleNotFoundError: No module named 'RunCmdsMP'`. Could you kind enough to tell me why this occurs and how to run the command correctly?

best wishes

Error when thread # is more than input sequence #

Hi Rengang,

I encounter errors when specifying more threads than the number of input sequences. Can you help to take a look?

2020-10-19 04:38:27,617 -WARNING- exit code 1 for CMD 'hmmscan --notextw -E 0.01 --domE 0.01 --noali --domtblout ./tmp/chunk_aaseq.14.fasta.domtbl /opt/conda/lib/python3.7/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm ./tmp/chunk_aaseq.14.fasta'
2020-10-19 04:38:27,617 -WARNING-
STDOUT:
b''
STDERR:
b'\nError: Sequence file ./tmp/chunk_aaseq.14.fasta is empty or misformatted\n\n'

2020-10-19 04:38:27,617 -WARNING- exit code 1 for CMD 'hmmscan --notextw -E 0.01 --domE 0.01 --noali --domtblout ./tmp/chunk_aaseq.15.fasta.domtbl /opt/conda/lib/python3.7/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm ./tmp/chunk_aaseq.15.fasta'
2020-10-19 04:38:27,617 -WARNING-
STDOUT:
b''
STDERR:
b'\nError: Sequence file ./tmp/chunk_aaseq.15.fasta is empty or misformatted\n\n'

2020-10-19 04:38:27,618 -WARNING- exit code 1 for CMD 'hmmscan --notextw -E 0.01 --domE 0.01 --noali --domtblout ./tmp/chunk_aaseq.16.fasta.domtbl /opt/conda/lib/python3.7/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm ./tmp/chunk_aaseq.16.fasta'
2020-10-19 04:38:27,618 -WARNING-
STDOUT:
b''
STDERR:
b'\nError: Sequence file ./tmp/chunk_aaseq.16.fasta is empty or misformatted\n\n'

Best,
Shujun

Problem with concatenate_domains.py

Please could you help:

when I run this command: concatenate_domains.py

I got this:
Traceback (most recent call last):
File "/home/iuliiasolomennikova/miniconda3/envs/LTR_rert/bin/concatenate_domains.py", line 33, in
sys.exit(load_entry_point('TEsorter==1.4.5', 'console_scripts', 'concatenate_domains.py')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/iuliiasolomennikova/miniconda3/envs/LTR_rert/bin/concatenate_domains.py", line 25, in importlib_load_entry_point
return next(matches).load()
^^^^^

Target sequence length > 100K

Hi Rengang,
I have got the following error, and it seems like my sequence is too long?
How to fix it, Split my genome or change the script?

STDOUT:
b''
STDERR:
b'Fatal exception (source file p7_pipeline.c, line 697):\nTarget sequence length > 100K, over comparison pipeline limit.\n(Did you mean to use nhmmer/nhmmscan?)\n'

2022-02-28 12:05:01,977 -WARNING- exit code -6 for CMD 'hmmscan --notextw -E 0.01 --domE 0.01 --noali --domtblout ./tmp/chunk_aaseq.2.fasta.domtbl /data/01/user106/software/anaconda/anaconda3/envs/tesorter/lib/
python3.5/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0.hmm ./tmp/chunk_aaseq.2.fasta'
2022-02-28 12:05:01,977 -WARNING-
STDOUT:
b''

Can not find SINEs

Hi,
I get the output successfully, but I can not find the SINEs, which can be classified in other soft. And I search the files in TEsorter/database/REXdb_protein_database*, it does not seems to exist.
Is there any problem?
Thank you!

Does TEsorter results only contain positive strain?

Hi there! @zhangrengang
I checked my TEsorter results and found the .dom.gff3 results file only contain LTRs from positive strain. My import file is from LTR_retriever with .LTRlib.fa, and I am very sure that it contains negative strain LTRs. And it is certain that there are some high score LTRs from negative stran which should be identified.
Did I miss some arguments which cause this problem?
Yours sincerely.

Assistance with custom installation directory

Hi, I am looking into installing TEsorter into a custom shared python directory. I tried using:
python setup.py install --prefix=$(NEW_PATH)

and this created the following:
${NEW_PATH}/bin/* (containing the TEsorter and the .py scripts)
${NEW_PATH}/lib/python3.9/site-packages/TEsorter-1.4.6-py3.9.egg/*

But when I run TEsorter with the following command, the following error occurred
${NEW_PATH}/bin/TEsorter

Traceback (most recent call last):
File "/home/561/jc4878/test5/bin/TEsorter", line 33, in
sys.exit(load_entry_point('TEsorter==1.4.6', 'console_scripts', 'TEsorter')())
File "/home/561/jc4878/test5/bin/TEsorter", line 22, in importlib_load_entry_point
for entry_point in distribution(dist_name).entry_points
File "/apps/python3/3.9.2/lib/python3.9/importlib/metadata.py", line 524, in distribution
return Distribution.from_name(distribution_name)
File "/apps/python3/3.9.2/lib/python3.9/importlib/metadata.py", line 187, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: TEsorter

It seems to only work when it is installed in the --user default location, which is not ideal.
python setup.py install --user

usage: TEsorter [-h] [-v] [-db {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}] [--db-hmm DB_HMM] [-st {nucl,prot}] [-pre PREFIX] [-fw] [-p PROCESSORS] [-tmp TMP_DIR]
[-cov MIN_COVERAGE] [-eval MAX_EVALUE] [-prob MIN_PROBABILITY] [-nocln] [-cite] [-dp2] [-rule PASS2_RULE] [-nolib] [-norc] [-genome] [-win_size WIN_SIZE]
[-win_ovl WIN_OVL]
sequence
TEsorter: error: the following arguments are required: sequence

Crash when special characters occurred in sequence IDs

Hi Ren-Gang,

I encountered errors when running TEsorter (v1.3) on sequences with special headers. Here is one reproducible example:

>repeat1-1#Unknown
GTCATCAAAGTCTAACACGGTACAGTGCCGGTCTATCAGGATTATAGGAC
CGGAGATGGACCTAGTCCCTCTTTCCTCGGTCTTCAGACCTGAACACCGC
CCTAATAGACAACGAATTGCAGGGCCTCTGGTTCACCCAAACGATAGTGG
AGACGGACCCGCTCTCCCCGGGTTTCCCCAACAACGTCTTCCTGGTCAGG
AAAGGAGGGCGGTTATCGCCCGGTAGTAACTTGAGAAATCTCTCGTAATT
ACCGTATACAGTGACGACATTCTGCGACACAAAGGGTGAGGTTGATTATC
TACTTGGACGACCTACTCTTAATGGCTCGTACCCCTCGCCTGGCGAACGA
ACACGCCTCCGGGACAGTCCCCCATAGATCACGGGGAATCTATCCTCTCC

Save the above sequence in test.fa and run:
TEsorter test.fa

2023-05-02 23:19:38,911 -WARNING- Grid computing is not available because DRMAA not configured properly: Could not find drmaa library. Please specify its full path using the environment variable DRMAA_LIBRARY_PATH
2023-05-02 23:19:38,918 -INFO- VARS: {'sequence': 'test.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': 'test.fa.rexdb', 'force_write_hmmscan': False, 'processors': 4, 'tmp_dir': './tmp', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': False, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False, 'p2_identity': 80.0, 'p2_coverage': 80.0, 'p2_length': 80.0}
2023-05-02 23:19:38,918 -INFO- checking dependencies:
2023-05-02 23:19:38,929 -INFO- hmmer 3.3.1 OK
2023-05-02 23:19:38,994 -INFO- blastn 2.10.0+ OK
2023-05-02 23:19:38,994 -INFO- check database rexdb
2023-05-02 23:19:38,994 -INFO- db path: /work/LAS/mhufford-lab/oushujun/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/database
2023-05-02 23:19:38,994 -INFO- db file: REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm
2023-05-02 23:19:38,995 -INFO- REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm OK
2023-05-02 23:19:38,995 -INFO- Start classifying pipeline
2023-05-02 23:19:39,010 -INFO- total 1 sequences
2023-05-02 23:19:39,010 -INFO- translating test.fa in six frames
/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/Bio/Seq.py:2338: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning,
2023-05-02 23:19:39,013 -INFO- HMM scanning against /work/LAS/mhufford-lab/oushujun/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm
2023-05-02 23:19:39,014 -INFO- use existed non-empty test.fa.rexdb.domtbl and skip hmmscan
2023-05-02 23:19:39,014 -INFO- generating gene anntations
Traceback (most recent call last):
File "/home/oushujun/las/bin/miniconda2/envs/EDTA/bin/TEsorter", line 10, in
sys.exit(main())
File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 1014, in main
pipeline(Args())
File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 167, in pipeline
maxeval = args.max_evalue,
File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 919, in LTRlibAnn
prefix=prefix, seqtype=seqtype, mincov=mincov, maxeval=maxeval)
File "/home/oushujun/las/bin/miniconda2/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 801, in hmm2best
gseq = d_seqs[rc.qname].seq[rc.envstart-1:rc.envend]
KeyError: 'repeat1#Unknown/Unknown|aa1'

RuntimeError: Communication pipe read error

Hi Ren-gang,
I recently install TEsorter and encounter something wrong with the running by executing command "nohup python3.9 /home/bin/TEsorter xx_whole_genome_te.fa &".there are part of report just like
"2021-04-14 11:29:12,545 -INFO- Running on Python 3.9.1 linux
Traceback (most recent call last):
File "/home/bin/TEsorter", line 33, in
sys.exit(load_entry_point('TEsorter==1.2.5.2', 'console_scripts', 'TEsorter')())
File "/home/lib/python3.9/site-packages/TEsorter-1.2.5.2-py3.9.egg/TEsorter/app.py", line 976, in main
pipeline(Args())
File "/home/lib/python3.9/site-packages/TEsorter-1.2.5.2-py3.9.egg/TEsorter/app.py", line 155, in pipeline
gff, geneSeq = LTRlibAnn(
File "/home/lib/python3.9/site-packages/TEsorter-1.2.5.2-py3.9.egg/TEsorter/app.py", line 884, in LTR
libAnn
hmmscan_pp(aaSeq, hmmdb=DB[hmmdb], hmmout=domtbl, tmpdir=tmpdir, processors=processors)
File "/home/lib/python3.9/site-packages/TEsorter-1.2.5.2-py3.9.egg/TEsorter/app.py", line 850, in hmm
scan_pp
jobs = pp_run(cmds, processors=processors)
File "/home/lib/python3.9/site-packages/TEsorter-1.2.5.2-py3.9.egg/TEsorter/modules/RunCmdsMP.py", li
ne 260, in pp_run
job_server = pp.Server(processors, ppservers=ppservers)
File "/home/lib/python3.9/site-packages/pp-1.6.4.4-py3.9.egg/pp.py", line 372, in init
self.set_ncpus(ncpus)
File "/home/lib/python3.9/site-packages/pp-1.6.4.4-py3.9.egg/pp.py", line 540, in set_ncpus
self.__workers.extend([_Worker(self.__restart_on_free,
File "/home/lib/python3.9/site-packages/pp-1.6.4.4-py3.9.egg/pp.py", line 540, in
self.__workers.extend([_Worker(self.__restart_on_free,
File "/home/lib/python3.9/site-packages/pp-1.6.4.4-py3.9.egg/pp.py", line 161, in init
self.start()
File "/home/lib/python3.9/site-packages/pp-1.6.4.4-py3.9.egg/pp.py", line 175, in start
self.pid = int(self.t.receive())
File "/home/lib/python3.9/site-packages/pp-1.6.4.4-py3.9.egg/pptransport.py", line 179, in receive
raise RuntimeError("Communication pipe read error")
RuntimeError: Communication pipe read error"
I really don't know how to solve it. I will appreciate it if you can feedback any suggestions.

error to get phylogeny tree using LTR_tree.R script

Dear zhangrengang

I was using LTR_tree.R script, for getting phylogenetic tree from output files after running iqtree. I encounter the following error:
Error in FUN(X[[i]], ...) : object 'Clade' not found

I am sharing both mapfile and treefile. Please check it and give the suggest.

DNA type transposons cannot be classified?

HI
sorry to bother you. I want to use TEsorter to classify transposons that Repeatmodeler fails to classify, but the results show that the software seems to be unable to classify transposons of DNA type.
DNA type transposons cannot be classified?

Chr1A:3671..3701|rnd-1_family-759#Unknown Chr1A:3671..3701|rnd-1_family-759#DNA/hAT
Chr1A:3702..3972|rnd-1_family-759#Unknown Chr1A:3702..3972|rnd-1_family-759#DNA/hAT
Chr1A:4491..4586|rnd-1_family-544#Unknown Chr1A:4491..4586|rnd-1_family-544#DNA/hAT-Ac
Chr1A:15383..15535|rnd-1_family-544#Unknown Chr1A:15383..15535|rnd-1_family-544#DNA/hAT-Ac
Chr1A:19859..20089|rnd-5_family-3147#Unknown Chr1A:19859..20089|rnd-5_family-3147#DNA/PIF-Harbinger
Chr1A:23324..23637|rnd-1_family-2#Unknown Chr1A:23324..23637|rnd-1_family-2#DNA/PIF-Harbinger
Chr1A:24437..24541|rnd-6_family-12487#Unknown Chr1A:24437..24541|rnd-6_family-12487#DNA/hAT-Tip100
Chr1A:27373..27686|rnd-1_family-2#Unknown Chr1A:27373..27686|rnd-1_family-2#DNA/PIF-Harbinger
Chr1A:28495..28592|rnd-6_family-12487#Unknown Chr1A:28495..28592|rnd-6_family-12487#DNA/hAT-Tip100
Chr1A:29927..29982|rnd-1_family-658#Unknown Chr1A:29927..29982|rnd-1_family-658#DNA/TcMar-Stowaway
Chr1A:30124..30220|rnd-1_family-658#Unknown Chr1A:30124..30220|rnd-1_family-658#DNA/TcMar-Stowaway
Chr1A:30360..30608|rnd-1_family-7#Unknown Chr1A:30360..30608|rnd-1_family-7#DNA/PIF-Harbinger
Chr1A:31505..31617|rnd-5_family-139#Unknown Chr1A:31505..31617|rnd-5_family-139#DNA/PIF-Harbinger
Chr1A:31644..31689|rnd-5_family-139#Unknown Chr1A:31644..31689|rnd-5_family-139#DNA/PIF-Harbinger
Chr1A:34307..34569|rnd-5_family-2428#Unknown Chr1A:34307..34569|rnd-5_family-2428#DNA/PIF-Harbinger
Chr1A:35672..35780|rnd-5_family-1497#Unknown Chr1A:35672..35780|rnd-5_family-1497#DNA/PIF-Harbinger
Chr1A:36038..36284|rnd-5_family-6489#Unknown Chr1A:36038..36284|rnd-5_family-6489#DNA/PIF-Harbinger

error exit

X_nanoraw.txt

Dear author, I encountered an error when running the software, exit 120 is displayed. Could you please help me check the reason? I have attached the command line and log below, thank you very much

command: Exit 120 nohup TEsorter /gpfs1/home/life/dengcl/Sp_TUseq/X_assemblegenome1.fa -db rexdb-plant -p 20 -cov 20 -eval 1e-3 > X_nanoraw.file 2>&1 (wd: ~/zhoujian/TEsorter/Xnanoraw)

Add SINE hmms

Hi Ren-Gang,

There are about 88 SINE families reported by this study. They have included HMMs for these families. Are they included in the GyDB or other collections already? If not, it may be an enhancement to include them in TEsorter. Thank you.

Best,
Shujun

TEsorter genome.fasta -genome -p 20 -prob 0.9

Could I just use this model for type classification? Because I see that a lot of work should first identify repeat et al.

zhangrengang / tesorter Goto Github PK

tesorter's Introduction

TEsorter

Table of Contents

Installation

Using bioconda

Old school

Quick Start

Citations

Outputs

Usage

Limitations

Further phylogenetic analyses

Extracting TE sequences from genome for TEsorter

tesorter's People

Contributors

Stargazers

Watchers

Forkers

tesorter's Issues

Recommend Projects

Recommend Topics

Recommend Org