gonzalezlab / mchelper Goto Github PK

View Code? Open in Web Editor NEW

25.0 4.0 2.0 356.51 MB

MCHelper: An automatic tool to curate transposable element libraries

License: GNU General Public License v3.0

Python 81.24% R 8.51% Shell 10.26%

annotation curation genomics transposable-elements

mchelper's Introduction

MCHelper

MCHelper: An automatic tool to curate transposable element libraries

Introduction
Installation
- Linux/Windows
- MacOS
Testing
Usage
Inputs
Outputs
Citation

Introduction

The number of species with high quality genome sequences continues to increase, in part due to scaling up of multiple large scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of transposable element sequences is still technically challenging. Several de novo transposable element identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high quality genome annotations. Manual curation is time-consuming, and thus impractical for largescale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool, MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from two de novo transposable element identification tools, RepeatModeler2 and REPET, in fruit fly, rice, and zebrafish, we show a substantial improvement in the quality of the transposable element libraries and genome annotations. MCHelper libraries are less redundant, with up to 54% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and also have up to ~45% fewer “unclassified/unknown” transposable element consensus sequences. Genome-wide transposable element annotations were also improved, including larger unfragmented insertions.

Installation

Linux/Windows

For Windows Systems is necessarity to have a functional installation of Windows Subsystem for Linux (WSL) version 2, the Poppler package installed (sudo apt-get install poppler-utils), as well as the QT package (sudo apt-get install qtbase5-dev).

It is recommended to install the dependencies in an Anaconda environment.

git clone https://github.com/gonzalezlab/MCHelper.git

Then, locate the MCHelper folder and find the file named "MCHelper.yml". Then, install the environment:

conda env create -f MCHelper/MCHelper.yml

Now, unzip all the databases needed by MCHelper:

cd MCHelper/db
unzip '*.zip'
conda activate MCHelper
makeblastdb -in allDatabases.clustered_rename.fa -dbtype nucl

Then, download the pfam database released by REPET group and renamed it:

wget https://urgi.versailles.inrae.fr/download/repet/profiles/ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz

tar xvf ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz
mv ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm Pfam35.0.hmm

And that's it. You have now installed MCHelper.

MacOS

These installation instructions have been tested on MacOS with M1/M2 architectures (Apple Silicon, arch: arm64). Therefore, these instructions are not compatible with MacOS with Intel Core processors).

Set up Rosetta.

Download and install the iTerm (or duplicate it if you have already installed, then rename it to, for example, iTerm_X86_64).
Right click on the icon iTerm (or iTerm_X86_64 if you renamed it), and select the option Get Info, and check box: Open using Rosetta
Open the new terminal iTerm (or iTerm_X86_64)
Verify the architecture: uname -m. It should appear: x86_64

Install Mambaforge Using the same iTerm (or iTerm_X86_64) we configured earlier, download the Mambaforge script and install it:

wget https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-MacOSX-x86_64.sh 
chmod +x Mambaforge-23.11.0-0-MacOSX-x86_64.sh

./Mambaforge-23.11.0-0-MacOSX-x86_64.sh

Follow the prompts, install, and initialize conda.

Install MCHelper conda environment using the special YML file for Mac (MCHelper_Mac.yml), using the Rosetta iTerm (or iTerm_X86_64):

git clone https://github.com/gonzalezlab/MCHelper.git

conda env create -f MCHelper/MCHelper_Mac.yml

conda activate MCHelper_Mac

Download and rename the TRF binary for MacOS:

cd MCHelper/tools
wget https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.macosx

rm -f trf409.linux64
mv trf409.macosx trf409.linux64
chmod +x trf409.linux64
cd -

Now, unzip all the databases needed by MCHelper:

cd MCHelper/db
unzip '*.zip'
makeblastdb -in allDatabases.clustered_rename.fa -dbtype nucl

Then, download the pfam database released by REPET group and renamed it:

wget https://urgi.versailles.inrae.fr/download/repet/profiles/ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz

tar xvf ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz
mv ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm Pfam35.0.hmm

And that's it. You have now installed MCHelper.

Testing

To test MCHelper, we provide some example inputs and also the expected results (located at Test_dir/) to allow you to compare with your own outputs. To check MCHelper is running properly, you can do:

First, activate the anaconda enviroment, if it isn't activated yet:

conda activate MCHelper

Then, be sure you are in the main folder (this one where MCHelper.py is located) and unzip the D. melanogaster genome:

unzip Test_dir/repet_input/Dmel_genome.zip -d Test_dir/repet_input/

Next step is download and format the host genes from BUSCO

wget https://busco-data.ezlab.org/v4/data/lineages/diptera_odb10.2020-08-05.tar.gz
mv diptera_odb10.2020-08-05.tar.gz Test_dir/repet_input/ 
cd Test_dir/repet_input/
tar xvf diptera_odb10.2020-08-05.tar.gz
cat diptera_odb10/hmms/*.hmm > diptera_odb10.hmm
cd -

Now, run the MCHelper script:

mkdir Test_dir/repet_output_own

python3 MCHelper.py -r A -t 8 -i Test_dir/repet_input/ -o Test_dir/repet_output_own -g Test_dir/repet_input/Dmel_genome.fasta --input_type repet -b Test_dir/repet_input/diptera_odb10.hmm -a F -n Dmel

This test will take the REPET's output and will do the curation automatically, using most of the parameters by default. If you want to run the test for the fasta input, you can execute:

unzip Test_dir/fasta_input/Dmel_genome.zip -d Test_dir/fasta_input/

mkdir Test_dir/fasta_output_own

python3 MCHelper.py -r A -t 8 -l Test_dir/fasta_input/Dmel-families.fa -o Test_dir/fasta_output_own -g Test_dir/fasta_input/Dmel_genome.fna --input_type fasta -b Test_dir/repet_input/diptera_odb10.hmm -a F

Usage

Be sure you have activated the anaconda environment:

conda activate MCHelper

Then, execute MCHelper with default parameters. For REPET input (see Testing for a practical example):

python3 MCHelper.py -i path/to/repet_output -o path/to/MCHelper_output -g path/to/genome -n repet_name_project --input_type repet -b path/to/reference_genes.hmm -a F

For fasta input:

python3 MCHelper.py -l path/to/TE_library_in_fasta -o path/to/MCHelper_output -g path/to/genome --input_type fasta -b path/to/reference_genes.hmm -a F

To see the full help documentation run:

python3 MCHelper.py --help

Full list of parameters include:

-h, --help show this help message and exit
-r MODULE, --module MODULE: module of curation [A, C, U, T, E, M]. Required*
-i INPUT_DIR, --input INPUT_DIR: Directory with the files required to do the curation (REPET output directory). Required*
-g GENOME, --genome GENOME: Genome used to detect the TEs. Required*
-o OUTPUTDIR, --output OUTPUTDIR: Path to the output directory. Required*
--te_aid TE_AID: Do you want to use TE-aid? [Y or N]. Default=Y.
-a AUTOMATIC: Level of automation: F: fully automated, S: semi-automated, M: fully manual?. Default=F.
-n PROJ_NAME: REPET project name. Required for repet input*
-t CORES: cores to execute some steps in parallel. Default=all available cores.
-m REF_LIBRARY_UNCLASSIFIED_MODULE: Path to the sequences to be used as references in the unclassified module.
-v VERBOSE Verbose? [Y or N]. Default=N.
--input_type INPUT_TYPE: Input type: fasta or REPET.
-l USER_LIBRARY: User defined library to be used with input type fasta.
-b BUSCO_LIBRARY: Reference/BUSCO genes to filter out TEs (HMM format required).
-z MINBLASTHITS Minimum number of blast hits to process an element.
-c MINFULLLFRAGMENTS: Minimum number of full-length fragments to process an element. Default=1
-s PERC_SSR: Maximum length covered by single repetitions (in percentage between 0-100) allowed for a TE not to be removed. Default=60.
-e EXT_NUCL Number of nucleotides to extend each size of the element. Default=500.
-x NUM_ITE Number of iterations to extend the elements. Default=16.
--version show program's version number and exit.

MCHelper can be run in three different modes: Fully automatic (F), semi-automatic (S) and manual (M). The way you can control this is with the parameter -a [F,S or M]. Notice that the fully automatic mode will make all the decision by you and, at the end, will generate different outputs curated and non-curated sequences. In contrast, the semi-automatic mode runs the structural check and allows the user to inspect the consensus sequences that do not fit the structural requirements. The manual mode does not run the structural check and sends all the consensus sequences to manual inspection.

MCHelper is a modular pipeline (see figure below), which can be run in a integrated way or module by module. You can control this with the -r or --module parameter, indicating which of the four modules you want to run. If you want to run the whole pipeline, select -r A. Otherwise, if you want just run one of them, select the letter corresponding to the module: consensus extension module=E (Figure A), Manual Inpection module=M (Figure B), and TE classification module=U (Figure C). You can also only run TE-Aid in parallel using the parameter -r T.

Inputs

The input files required by MCHelper will depend of the tool you used to create the TE library. If you used REPET, then you will need the following files:

the genome assembly
the library created by the TEdenovo pipeline. This library is named as projName_refTEs.fa, where projName is the name of your own REPET project.
the table with features created by PASTEC and is normally named projName_denovoLibTEs_PC.classif. Again, projName is the name of your own REPET project.
a folder containing coverage plots created with the REPET tool "plotCoverage.py". This folder must be named "plotCoverage" and must be placed in the input folder specified in the -i parameter.
a folder containing the gff files generated by the REPET tool "CreateGFF3sForClassifFeatures.py". This folder must be named "gff_reversed" and must be placed in the input folder specified in the -i parameter.

If you used any other tool that generates TE libraries in fasta format, then you will need the following:

the genome assembly
the library created by the tool.

In the lastest case, MCHelper will find the information required to do the curation process. This information include:

How many full length copies and fragment has each consensus
structural features such as terminal repeats, and coding domains
BLASTn, BLASTx, and tBLASTx with TE databases

Outputs

Outputs generated by MCHelper depend on the modules chosen to run. The tool will create an independent folder for the Classified and Unclassified modules. Inside, it will save some temporal as well as final files. The final processed sequences will be stored at the file named "curated_sequences_NR.fa" (Non Redundant version) and "curated_sequences_R.fa" (Redundant version).

The rest of the files are the following:

ClassifiedModule folder
- MSA_Plots: Folder containing the MSA graphs generated by CIAlign (F and S modes).
- MSA_seeds: Folder containing the MSA files generated by CIAlign (F and S modes). Those files can be used to visualize the MSA and also to construct HMMs.
- te_aid: Folder containing the TE+Aid plots generated. They are used by MCHelper in the manual inspection step (S and M modes), but also can be usefull for checking a specific TE by the user.
- cons_flf.fa: File containing only the TEs with more than a certain threhold of full length fragments in the genome. This threshold is handle by the variable -c. A copy is considered as full length fragment when it covers at least 94% of the consensus length.
- denovoLibTEs_PC.classif: Tabular file containing coding and structural information of each consensus.
- fullLengthFrag.txt: Tabular file containing information about number of number of fragments and full length fragments.
- input_to_unclassified_module_seqs.fa: Sequences that will be used in the Unclassified module (only when -r A is selected).
- kept_seqs_classified_module_curated.fa: File containing the sequences considered that beloging a complete TEs.
- kept_seqs_classified_module_non_curated.fa: File containing the sequences considered as incomplete, fragmented or that doesn't satisfy the structural conditions.
- kept_seqs_classified_module.fa: File containing the sequences that have been kept in the module. It is a merge between the two previous described files (kept_seqs_classified_module_curated.fa and kept_seqs_classified_module_non_curated.fa). At the end, MCHelper will join this file with kept_seqs_unclassified_module.fa to create the curated_sequences_R.fa final file.
UnclassifiedModule folder
- MSA_plots: Folder containing the MSA graphs generated by CIAlign.
- MSA_seeds: Folder containing the MSA files generated by CIAlign. Those files can be used to visualize the MSA and also to construct HMMs.
- cons_flf.fa: File containing only the TEs with more than a certain threhold of full length fragments in the genome. This threshold is handle by the variable -c. A copy is considered as full length fragment when it covers at least 94% of the consensus length.
- denovoLibTEs_PC.classif: Tabular file containing coding and structural information of each consensus.
- extended_cons.fa; File containing the extended TE.
- kept_seqs_unclassified_module.fa: File containing the sequences that have been kept in the module. At the end, MCHelper will join this file with kept_seqs_classified_module.fa to create the curated_sequences_R.fa final file.
- new_user_lib.fa: Intermidiate file containing fomated sequences needed to run with MCHelper.

Citation

if you use this software, please cite us as following: Orozco-Arias, S., Sierra, P., Durbin, R., Gonzalez, J. (2023). MCHelper automatically curates transposable element libraries across species. https://doi.org/10.1101/2023.10.17.562682. bioRxiv.

mchelper's People

Stargazers

Watchers

Forkers

jakeelamb ningshuang-yao

mchelper's Issues

makeblastdb

Hello,
Please I have a problem when setting the database in the "makeblastdb -in allDatabases.clustered_rename.fa -dbtype nucl" step:

Could you please help me fix it?

The error:

Building a new DB, current time: 02/08/2024 15:59:14
New DB name: /mnt/beegfs/mnt/work/bioinfo/home/dar21/MCHelper/db/allDatabases.clustered_rename.fa
New DB title: allDatabases.clustered_rename.fa
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /mnt/beegfs/mnt/work/bioinfo/home/dar21/MCHelper/db/allDatabases.clustered_rename.fa
Keep MBits: T
Maximum file size: 1000000000B
Bus error (core dumped)

Depreciation flags

There appears to be a depreciation of function within the code. When running, this is thrown repeatedly although the code still appears to run.

bin/MCHelper/MCHelper.py:1661: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
fasta_table_ite = pd.concat([fasta_table_ite, pd.DataFrame({"seq": name, "cons": s, "cons_size": [len(s)], "class": te_class, "subfamilies": [num_subfamilies], "end_l": end_l_te, "end_r": end_r_te})], ignore_index=True)
bin/MCHelper/MCHelper.py:1875: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
fasta_table_ite = pd.concat([fasta_table_ite, pd.DataFrame({"seq": fasta_table.loc[index, "seq"], "cons": fasta_table.loc[index, "cons"], "cons_size": [fasta_table.loc[index, "cons_size"]], "class": fasta_table.loc[index, "class"], "subfamilies": fasta_table.loc[index, "subfamilies"], "end_l": True, "end_r": True})], axis=0, ignore_index=True)

ref_tes' is not defined

Thank for this effort, I'm trying to annotate a TE library that I got from several TE de-novo pipelines

Here my line
python Programs/MCHelper/MCHelper.py
-r E
-g $Reference_genome
-o $Out_working_directory
-l $Database #of TE in_multifasta_format
-t 2

and I get the error:

Traceback (most recent call last):
File "Programs/MCHelper/MCHelper.py", line 2935, in
run_extension_by_saturation_parallel(genome, ref_tes, ext_nucl, num_ite,
NameError: name 'ref_tes' is not defined

What I'm missing?

Questions about process files

Hi, thank you very much for developing this software. I have some questions about its operation that I would like to ask.

My command is "python3 ~/MCHelper/MCHelper.py -r A -t 20 -l Eb_families.fa -o Eb.fasta_output -g Eb.genome.chr.v3.fasta --input_type fasta -b ./mammalia_odb10.hmm -a F "

I have been running for two days, but the result file only contains some files, and a folder “classifiedModule” which has some empty files. Is there any wrong ? Thank you for your help!
18M Apr 21 17:07 candidate_tes.fa
20 Apr 23 15:52 classifiedModule
5.8M Apr 21 17:10 non_redundant_lib.fa
1.9K Apr 21 17:07 sequences_with_problems.txt

classifiedModule：
0 Apr 22 10:15 Chr10:4533629..4534041_LTR.copies.fa
0 Apr 21 23:18 Chr1:32470769..32471133_LTR.copies.fa
0 Apr 22 06:29 Chr14:17413840..17414191_LTR.copies.fa
0 Apr 21 23:07 Chr17:10292218..10292603_LTR.copies.fa
0 Apr 21 17:19 Chr20:31121217..31121594_LTR.copies.fa
0 Apr 22 06:26 Chr2:33735733..33736113_LTR.copies.fa
0 Apr 21 17:19 Chr26:17425524..17425860_LTR.copies.fa
0 Apr 21 17:52 Chr3:126413529..126413871_LTR.copies.fa
0 Apr 22 23:51 Chr4:135925862..135926222_LTR.copies.fa
0 Apr 22 02:06 Chr7:4387625..4388046_LTR.copies.fa
0 Apr 23 07:09 rnd-1_family-265.copies.fa
0 Apr 22 05:44 rnd-1_family-270.copies.fa
0 Apr 21 17:19 rnd-1_family-351.copies.fa
0 Apr 21 17:19 rnd-1_family-505.copies.fa
0 Apr 22 15:58 rnd-4_family-1061.copies.fa
0 Apr 23 14:17 rnd-5_family-1512.copies.fa
0 Apr 23 13:23 rnd-6_family-3870.copies.fa
0 Apr 21 17:19 rnd-6_family-779.copies.fa
0 Apr 23 15:52 scaffold1:18121064..18121635_LTR.copies.fa
0 Apr 22 08:47 scaffold28:5584399..5584726_LTR.copies.fa

ReadMe Installation Errors.

The instructions for the installation are wrong due to unintended capitalizations in the paths. I fixed this with a fork and pull request but was ignored. The git clone command saves the directory as mchelper , yet the following commands reference a capitalized MCHelper directory.

conda env create -f MCHelper/MCHelper.yml
cd MCHelper/db

These should be changed to:
conda env create -f mchelper/MCHelper.yml
cd mchelper/db

or the repository should be changed to MCHelper from mchelper

Thanks for the awesome tool! Great work :)

BLAST error preventing MCHelper from completeing

Hi,

I'm trying to run MCHelper on a RepeatModeler library, however there's what appears to be a BLAST error in the step directly after filtering out SSRs and genes. I'm unsure if this is an error I'be made or that's in the code. I've pasted the command used to run MCHelper below and the relevant text at the end of the log file.

Thanks,
James

Command:
python3 ~/Programs/MCHelper/MCHelper.py -r A -t 16 -l outputs/earlgrey_rm/Aphidoletes_aphidimyza_consensi.fa.classified -o outputs/mchelper -g genome_seq/GCA_030463065.1_ASM3046306v1_genomic.fna --input_type fasta -b data/busco/diptera_odb10.hmm -v Y --te_aid N -a F

Log:

MESSAGE: The library was reduced to 747 after SSR, genes and RNA filtering [830.2472195625305 seconds]
USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
    [-negative_taxidlist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-perc_identity float_value] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
    [-min_raw_gapped_score int_value] [-template_type type]
    [-template_length int_value] [-dust DUST_options]
    [-filtering_db filtering_database]
    [-window_masker_taxid window_masker_taxid]
    [-window_masker_db window_masker_db] [-soft_masking soft_masking]
    [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
    [-best_hit_score_edge float_value] [-subject_besthit]
    [-window_size int_value] [-off_diagonal_range int_value]
    [-use_index boolean] [-index_name string] [-lcase_masking]
    [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format]
    [-show_gis] [-num_descriptions int_value] [-num_alignments int_value]
    [-line_length line_length] [-html] [-sorthits sort_hits]
    [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-remote] [-version]

DESCRIPTION
   Nucleotide-Nucleotide BLAST 2.10.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Too many positional arguments (1), the offending value:
Error:  (CArgException::eSynopsis) Too many positional arguments (1), the offending value:
MESSAGE: TE Feature table was created [196.02899551391602 seconds]
Traceback (most recent call last):
  File "/ceph/users/jgalbraith/Programs/MCHelper/MCHelper.py", line 2767, in <module>
    new_module1(plots_dir, new_ref_tes, gff_files, outputdir, proj_name, te_aid, automatic, minDomLTR,
  File "/ceph/users/jgalbraith/Programs/MCHelper/MCHelper.py", line 1275, in new_module1
    keep_seqs, orders = run_blast(library_path, ref_tes_bee, cores, 80, 80, 80)
  File "/ceph/users/jgalbraith/Programs/MCHelper/MCHelper.py", line 2336, in run_blast
    blastresult = open(ref_tes + ".blast", "r").readlines()
FileNotFoundError: [Errno 2] No such file or directory: '/data/ross/misc/analyses/earlgrey_testrainer_testing/outputs/mchelper/classifiedModule//putative_TEs.fa.blast'

How to obtain or build Reference/BUSCO genes

Hi,

I have a new genome assembly which was annotated with all annotation gff and fasta files. How can I create Reference/BUSCO genes in hmm format to put it in -b option?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.