intabiotechmj / mite-tracker Goto Github PK

MITE Tracker: An accurate approach to identify miniature inverted-repeat transposable elements in large genomes

Python 13.21% Jupyter Notebook 86.79%

mite-tracker's Introduction

About

MITE Tracker: an accurate method for identifying miniature inverted-repeat transposable elements in large genomes.

An efficient and easy to run tool for discovering Miniature Inverted repeats Transposable Elements (MITEs) in genomic sequences. It is written in python 3 and uses ncbi's blast+ for finding inverted repeats and cdhit to do the clustering.

Large genomes can be processed in desktop computers.

Requirements

tested in macOS 10.13.1, Debian 7.6, Ubuntu 16.04, Windows 7
ncbi blast+ (Nucleotide-Nucleotide BLAST 2.6.0+)
python requirements are in requirements.txt file (bipython and pandas)

Installation and running

# clone repo
git clone https://github.com/INTABiotechMJ/MITE-Tracker.git
cd MITE-Tracker

# blast
sudo apt-get install ncbi-blast+ virtualenv
# in macOS: brew install ncbi-blast+ virtualenv


#vsearch
wget https://github.com/torognes/vsearch/archive/v2.7.1.tar.gz
tar xzf v2.7.1.tar.gz
cd vsearch-2.7.1
#might need: sudo apt-get install autoconf
sh autogen.sh
./configure
make

#python dependencies
cd ..
virtualenv -p python3 venv
source venv/bin/activate
#might need: sudo apt-get install python3.6-dev
#if pandas failed to install, run: pip3 install cython
pip3 install -r requirements.txt

# running

python3 -m MITETracker -g /path/to/your/genome.fasta -w 3 -j jobname

# or to run in background

nohup python3 -m MITETracker -g /path/to/your/genome.fasta -w 3 -j jobname &

In order to check the output and progress you can use these command (ctrl+c to exit)

#nohup will have the program output as well as the output from cdhit execution
tail -f nohup.out
#out.log contaings a log file with timing information
tail -f results/[jobname]/out.log

Command line options

Argument	Description	Data type	Required or default
-g	Genome file in fasta format	string	required
-j	Jobname. Result files will be created in results/jobname	string	required
-w	Max number of processes to use simultaneously	int	1
-tsd_min_len	TSD min lenght	int	2
-tsd_max_len	TSD max lenght	int	10
-mite_min_len	MITE min lenght	int	50
-mite_max_len	MITE max lenght	int	650
--task	cluster or candidates	string

Results

All the results are placed in results/[yourjobname]/. Here you will find: families.fasta all the MITEs sequences divided by families (custom format) families_nr.fasta with one MITE per family in fasta format all.fasta all MITEs in fasta format all.gff3 a gff file describing all MITEs found

Troubleshooting

If getting any error while running the BLASTn searches please check you blast+ version

Running large genomes in different computers

This is an example of how we run wheat genome. Each chromosome can be run separately (--task candidates) in a different computers. Results should be merged together using cat and then run the cluster command (--task cluster). Files required for clustering are candidates.csv and candidates.fasta.

21 wheat chromosomes were downloaded in different files.

python3 -m MITETracker -g /media/chr1A.fasta -w 2 -j IWGSC_1A --task candidates
python3 -m MITETracker -g /media/chr1B.fasta -w 2 -j IWGSC_1B --task candidates
python3 -m MITETracker -g /media/chr1D.fasta -w 2 -j IWGSC_1D --task candidates
python3 -m MITETracker -g /media/chr2A.fasta -w 2 -j IWGSC_2A --task candidates
python3 -m MITETracker -g /media/chr2B.fasta -w 2 -j IWGSC_2B --task candidates
python3 -m MITETracker -g /media/chr2D.fasta -w 2 -j IWGSC_2D --task candidates
python3 -m MITETracker -g /media/chr3A.fasta -w 2 -j IWGSC_3A --task candidates
python3 -m MITETracker -g /media/chr3B.fasta -w 2 -j IWGSC_3B --task candidates
python3 -m MITETracker -g /media/chr3D.fasta -w 2 -j IWGSC_3D --task candidates
python3 -m MITETracker -g /media/chr4A.fasta -w 2 -j IWGSC_4A --task candidates
python3 -m MITETracker -g /media/chr4B.fasta -w 2 -j IWGSC_4B --task candidates
...
mkdir results/IWGSC
cat results/IWGSC_*/candidates.csv > results/IWGSC/candidates.csv
cat results/IWGSC_*/candidates.fasta > results/IWGSC/candidates.fasta
python3 -m MITETracker -g none -w 3 -j IWGSC --task cluster --min_copy_number 4

Publication and citing

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2376-y

Please cite with:

Crescente, Juan Manuel, et al. "MITE Tracker: an accurate approach to identify miniature inverted-repeat transposable elements in large genomes." BMC Bioinformatics 19.1 (2018): 348.

Or for bibtex users:

@article{crescente2018mite, title={MITE Tracker: an accurate approach to identify miniature inverted-repeat transposable elements in large genomes}, author={Crescente, Juan Manuel and Zavallo, Diego and Helguera, Marcelo and Vanzetti, Leonardo Sebasti{\'a}n}, journal={BMC Bioinformatics}, volume={19}, number={1}, pages={348}, year={2018}, publisher={Springer} }

Note:

Due to a problem with additionals files in the publication we have added those files in this repository under supplementary_materials/

rice_mites.fasta: Database of non-redundant MITE family database obtained from the rice genome

wheat_mites.fasta: Database of non-redundant MITE family database obtained from the wheat genome

tools_comparison.csv: Execution summary of MITE Tracker and other tools using several genomes

wheat_genes.csv: Wheat genes containing MITEs within its coding region.

Additional notes

Annotating all arabidopsis MITEs as an example

#### Clone MITETracker and install dependencies

#clone
git clone [email protected]:INTABiotechMJ/MITE-Tracker.git
#enter program directory
cd MITE-Tracker
#create virtual enviornment with python3
virtualenv -p python3 venv
#activate virtual environment
source venv/bin/activate
#install requirements
pip3 install -r requirements.txt
#run MITE Tracker
python3 MITETracker.py  -g TAIR10_chr_all.fas -j ata

With this version of TAIR genome we get a total of 38 distinct MITE families.

I'm gonna use the all.fasta file to map MITEs genome-wide because it contains all found elements.

blastn -task blastn -query results/ata/all.fasta  -subject ../data/tair10/TAIR10_chr_all.fas -outfmt "6 qseqid sseqid qstart qend sstart send score length mismatch gaps gapopen nident pident evalue qlen slen qcovs" > results/ata/blast_families_ata.csv

Let's run out notebook for filtering blast results, run till the end. This will explain at each step how filtering is done and what are the results.

jupyter lab

Ultimately, convert the blast filtered output to gff

python blast2gff.py -i results/ata/blast_families_ata.filtered.csv  -o results/ata/mitesInGenome.gff3 -n MITE_TRACKER

This is our resulting annotated file

results/ata/mitesInGenome.gff3

mite-tracker's People

Contributors

Stargazers

Watchers

Forkers

weijiasu anandksrao juancresc luoluo690 altingia tw7649116 chaimol lacademic baohua-chen

mite-tracker's Issues

filter blastn output

Hi,
This tool seems very interesting, thanks for developing it.

I could follow until blastn of candidates MITEs with the genome fasta sequence. I have the blast_families_ata.csv file. But I am not clear how to filter the blast output to create gff.

'''
Let's run out notebook for filtering blast results, run till the end. This will explain at each step how filtering is done and what are the results.

jupyter lab

''''

Plz clarify
Thanks in advance.
sam

no result after 10h for a 300Mb diploid insect genome - how to improve speed?

Hello, we are keen to use MITE-Tracker on our insect genomes (size=300Mb, diploid genome). We tested the tool on a genome, expecting the run to be as short as the rice genome example of the BMC Bioinformatics article. However, 10hours are not enough and only 2 out of 210 sequences have been screened (in log.out). Can you tell us what we could explore to make the analysis quicker? Many thanks!

I pasted below the script sent to the job scheduler on my university cluster, given that I am using the Transposon annotation tool reasonaTE:

#!/bin/bash -l

# Request wallclock time 
#$ -l h_rt=20:0:0

# Request RAM.
#$ -l mem=10G

# Request TMPDIR space (default is 10 GB).
#$ -l tmpfs=10G

# annotating TE in genome
# Miniature inverted-repeat transposable element 

# annotate the genome for TE 
singularity exec /home/username/Scratch/.singularity/pull/reasonate_env1_latest.sif reasonaTE -mode annotate -projectFolder ${species} -projectName test${species} -tool mitetracker

MITE tracker ran within 10 mins successfully but there is candidate at all. i ran rice genome just for reference. still no zero candidate

i ran rice genome just for reference. still no zero candidate.

python -m MITETracker -g  /home/naveen/HYD/assembly.fasta -w 3 -j trial

out.log

2023-04-27 21:00:40,386 Adding NC_001751.1 58/58 (100% of total sequences in 119.044594 secs)
2023-04-27 21:00:40,391 Candidates: 0
2023-04-27 21:00:40,480 Clustering
2023-04-27 21:00:40,480 ./vsearch-2.7.1/bin/vsearch --cluster_fast results/trial/candidates.fasta --threads 3 --strand both --clusters results/trial/temp/clust --iddef 1 --id 0.8
2023-04-27 21:00:40,637 Clustering done
2023-04-27 21:00:40,638 Filtering clusters
2023-04-27 21:00:40,640 Initial clusters: 0
2023-04-27 21:00:40,641 Clusters: 0
2023-04-27 21:00:40,658 119.316502 secs

Error in running MITETracker

Hi,

I am trying yo install and run MITETracker using a virtualenviornment in Ubuntu 14.04.lts. But I am unable to run the MITETracker.py. I follwed the below steps:

First installed and created a virtualenviornment in home dir venv
virtualenv -p python3 venv
source venv/bin/activate
Installed and upgraded python in the venv dir
python --version
Python 3.6.3
Cloned MITETracker from venv:
git clone https://github.com/INTABiotechMJ/MITE-Tracker.git
pip install -r /home/icar/MITE-Tracker/requirements.txt
(This installed all required packages including Biopython, numpy and pandas)
then tried:
nohup python3 -m MITETracker -g /home/icar/Jute_genomes/LLWS01.1.fa -w 8 -j 524_MITEtracker_out &

but ended up with the following error:
(venv) icar@icar-crijaf:~$ nohup: ignoring input and appending output to ‘nohup.out’

The nohup output says:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/icar/MITE-Tracker/MITETracker.py", line 3, in
from Bio import SeqIO
ModuleNotFoundError: No module named 'Bio'

KINDLY help me to fix the issue and run the MITETracker properly. I am novice in working with python.
Your help in solving the issue in simpler terms and step by step process will be highly appreciated.

Regards
Dip

the MITETracker.py can NOT make directory results/jobname

The run error. The MITETracker.py can NOT make subdirectory , here is the "results/xgenome".

my run command: ( in linux)
python3 /usr/local/apps/gb/MITE-Tracker/0.1/MITETracker.py -g $srcdir/$seqf -w $threads -j xgenome

It may be caused by where is the parent dir . Does this script call the current directory or use the directory information from input seq file? It seems the parent dir info passing is wrong.

Error information:
Traceback (most recent call last):
File "/usr/local/apps/gb/MITE-Tracker/0.1/MITETracker.py", line 55, in
os.mkdir("results/" + args.jobname)
FileNotFoundError: [Errno 2] No such file or directory: 'results/xgenome'

Error pandas

im working on MAC Mojave Machine
i have blast earlir in my pc
but brew install virtualenv No longer available so i download it through pip3 install virtualenv
Vsearch working fine

when it comes to python dependencies quite confusing
virtualenv -p python3 venv
source venv/bin/activate

pip3 install cython
Requirement already satisfied: cython in ./vsearch-2.7.1/venv/lib/python3.7/site-packages


(venv) ohon_ad@ohons-Mac-Pro:~/Documents/Tools/MITE-Tracker$ pip3 install -r requirements.txt
Collecting biopython==1.70 (from -r requirements.txt (line 1))
Collecting numpy==1.9.1 (from -r requirements.txt (line 2))
  Using cached https://files.pythonhosted.org/packages/41/39/45791d98f1c82789b96d7bdc36f34792d0106b44680fb946d5de9cd5c979/numpy-1.9.1.tar.gz
Collecting pandas==0.19.0 (from -r requirements.txt (line 3))
  Using cached https://files.pythonhosted.org/packages/21/93/5b5c84a92db8bdd9748960003bcfed3df173e1d3f0cca393512ea98d14cb/pandas-0.19.0.tar.gz
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/numpy/: Tunnel connection failed: 407 Proxy Authentication Required -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    No local packages or working download links found for numpy>=1.7.0
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/f6/mqmlplts36n7qxzghph_v9m00000gn/T/pip-install-5rfs2bnm/pandas/setup.py", line 680, in <module>
        **setuptools_kwargs)
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/setuptools/__init__.py", line 144, in setup
        _install_setup_requires(attrs)
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/setuptools/__init__.py", line 139, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/setuptools/dist.py", line 724, in fetch_build_eggs
        replace_conflicting=True,
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 782, in resolve
        replace_conflicting=replace_conflicting
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1065, in best_match
        return self.obtain(req, installer)
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1077, in obtain
        return installer(requirement)
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/setuptools/dist.py", line 791, in fetch_build_egg
        return cmd.easy_install(req)
      File "/Users/ohon_ad/Documents/Tools/MITE-Tracker/vsearch-2.7.1/venv/lib/python3.7/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('numpy>=1.7.0')
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/f6/mqmlplts36n7qxzghph_v9m00000gn/T/pip-install-5rfs2bnm/pandas/

anyone have an idea please
your response highly appreciate it

Vsearh clustering file not found error

I am having some trouble running MITE Tracker properly. It seems to successfully identify MITE's but once it proceeds to the clustering steps it returns the error. I have tried re-installing Vsearch and I can locate all of the files specified in the traceback in my MITE-Tracker
directory folder. Any suggestions? Thank you very much.

representative seq from valid families

hallo!
i wanted to ask if there is a easy way to find out what families cluster together?
what i want to know is which other sequences the representative from families_nr.fasta belongs to.

also im curious whats the difference between the all.fasta and families.txt?
and are the all.fasta all full-length MITEs?

i really like the program and hopefully you can help me :)

can not find MITE in my test genome?

hi,
I try to use MITE-Tracker，it run faster than MITE-Hunter.
I split my genome which contain 738 Contigs ~300Mbp into 38 cuts file, I can find the candidate MITE in each cut, but can not find the final result. like following:

-rw-r--r-- 1 tanyt BGenome      2453 May  6 16:33 036/families.fasta
-rw-r--r-- 1 tanyt BGenome       822 May  6 16:33 036/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:23 037/all.fasta
-rw-r--r-- 1 tanyt BGenome   2624345 May  6 16:19 037/candidates.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:23 037/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:23 037/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:56 038/all.fasta
-rw-r--r-- 1 tanyt BGenome    904702 May  6 16:55 038/candidates.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:56 038/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:56 038/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7 11:06 atest/all.fasta
-rw-r--r-- 1 tanyt BGenome 170023347 May  7 11:06 atest/candidates.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7  2019 atest/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7  2019 atest/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome      2453 May  6 16:33 036/families.fasta
-rw-r--r-- 1 tanyt BGenome       822 May  6 16:33 036/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:23 037/all.fasta
-rw-r--r-- 1 tanyt BGenome   2624345 May  6 16:19 037/candidates.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:23 037/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:23 037/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:56 038/all.fasta
-rw-r--r-- 1 tanyt BGenome    904702 May  6 16:55 038/candidates.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:56 038/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  6 16:56 038/families_nr.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7 11:06 atest/all.fasta
-rw-r--r-- 1 tanyt BGenome 170023347 May  7 11:06 atest/candidates.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7  2019 atest/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7  2019 atest/families_nr.fasta

there is nothing in

-rw-r--r-- 1 tanyt BGenome         0 May  7  2019 atest/families.fasta
-rw-r--r-- 1 tanyt BGenome         0 May  7  2019 atest/families_nr.fasta

can you help me?
is that the key of the parameter --min_copy_number ? I set it to 4, like you. is that means tetraploid in your test wheat. my species is diploid , should i change it to 2 ?

(Queery) Posible effectiveness at identifying autonomos class II TE?

This looks like a great resource! I'm looking at prokaryotic transposons and I'm interested to know if MITE-Tracker has ever been tried on larger, autonomous, class II TEs? And specifically whether specifying a large maximum length (>10kb) would have any obvious issues?
Thanks!

Python version and denpendencies

Which Python version should be used？I currently use 3.8.10，and unable to install numpy and pandas from requirements.txt.

TIRs longer than Mites

Hi,
I used your software to detect MITEs in a plant genome assembly. I was surprised to find elements for which TIR length was longer than MITE length. Considering that a MITE consists of two TIRs and some internal region, this seems quite strange. The following example is from your results:
MITE_T_1|chr12|14935570|14935993|CC|430|F1 TSD_IN:no MITE_LEN:423 TIR_LEN:430 CANDIDATE_ID:MITE_CAND_446133 COMMON_TSD:aaa
How would you explain a result like the one above where the MITE itself seems to be shorter than the TIRs it bears?
Thanks for your explanation!
Marius

Results not complete

Hello..
I and my friends are trying to use MITE Tracker to find MITEs in Solanum lycopersicum. I would like to ask 2 questions:

After several hours waited, We were so glad that the running was successful. In README file, it is said that there are 4 results in the "results" folder: 3 files in .fasta format, and 1 file in .gff3 format. Unfortunately, the result showed only MITE candidates in .fasta and .csv.

What should we do to get all the four files in the results?

In .csv file, there is a column entitled "tsd_in" with "yes" or "no" description in every mite candidate lines. What is "tsd_in" means?

Thank you..