MISPhyl (Mutual Information Supermatrix Phylogenomics) Pipeline

Phylogenomics analyses are commonly applied to answer various research questions pertaining to relationships of species and events on Earth. Although phylogenomic tree reconstructions have been widely used in research, it is still a great challenge for many researchers to deal with its multi-step technical procedure and data handling, especially on genomic data. Herein we present MISPhyl, a user-friendly pipeline which utilises supermatrix-based procedure to yield phylogenomic tree. While a supermatrix phylogenomic tree aims to amplify phylogenetic signals, there are chances to include phylogenetic noises into the tree reconstruction. To address the issue, this automated pipeline has also implemented a Mutual Information (MI) approach to perform systematic selection of genes with optimal phylogenetic signals for phylogenomic inference. The MI approach has been previously discussed for its ability to generate a reliable phylogenomic tree and identify species-specific markers in Mycobacterium abscessus Complex (Tan et al., 2013).

Software and Script
Prerequisites
Getting Started
Options
Main Output Folders
Highlight

Software and Script

All the dependencies needed for the script are included within the tarball file.

ProteinOrtho : 6.0.24 (Perl: 5 version 32 , Python: 3.8.3 , BLAST: 2.9.0-2 , Diamond: 2.0.4)
Pal2Nal: v14.1
Mafft : v7.490
Mutual Information script: R: 3.6.3
ModelTest-NG-static: 0.1.7 (Please note that modeltest-ng-static binary file relies on compatible hardware)
RAXML-NG : 1.0.1

Prerequisites

Linux 64-bit system is required. To run this script, you need:

Perl v5.08 or higher (test by typing "perl -v" in terminal)
Python v3.0 or higher (test by typing "python -V" in terminal)
Biopython module

	$ sudo apt install python3-pip
	$ pip3 install biopython

R language (test by typing "Rscript --version" in terminal)
R seqinr and parallel library

	$sudo apt-get update -y
	$sudo apt-get install -y r-cran-seqinr

If u do not have the sudo right, please contact your system administrator.

Getting Started

How to run this program? [CRITICAL]

Users are required to change each sequence header/description in all input files corresponding to the species.
Format:
>[speciesName]_[accessionID]....

Users could utilize the renameInput.py script provided in dependencies/ to rename their files in working child directory. Ensure there is NO '_', underscore in your species name and accession ID.

	$python3 ./dependencies/renameInput.py

Ensure there is no invalid character in your input files.
If you require help, removeInvalidCharacter.py is provided in dependencies/ folder.
For codon based alignment, ENSURE:
1. Same IDs are used in both protein and nucleotide input files
2. Amino acid files in main directory inputfolder/ whereas nucleotide files in directory ntfolder/ .
3. Value for option -f is set to "aa".

Run all steps from 1 to 4

Users are required to create a folder inputfolder/.

	$mkdir inputfolder

Put the files in folder inputfolder/.
In your current directory, run the pipeline script. Eg:

	python3 MISPhyl.py [option]

a) Run in default mode which accepts input files as amino acid sequence and utilize Diamond as blast program

	$python3 MISPhyl.py -f faa -i aa

b) Run with nucleotide input files, blastn program and mutual information mode ON (ensure blastn is present in your system)

	$python3 MISPhyl.py -f fna -i nt -algo blastn

c) Run codon-based alignment with mutal information mode ON. Amino acid files with .faa file extension in inputfolder/ and nucleotide files in ntfolder/.

	$python3 MISPhyl.py -codon -i aa -f faa

Note: Codon Alignment (AA and corresponding NT files MUST have same filename, file extension need not to be)

Step 1 : ProteinOrtho & (optional) Codon Alignment

Users are required to create a folder inputfolder/.

	$mkdir inputfolder

Put the files in folder inputfolder/. Example:
a) Run step 1 with quiet mode ON and prefix for proteinortho as "project1"

	$python3 MISPhyl.py -s 1 -p project1 -f fa -i aa

3.Slight difference if codon based alignment is ENABLED, ENSURE:
i) Same IDs are used in both protein and nucleotide input files

Example Condition	Protein	Nucleotide
Same ID	>H.sapiens_ACE1180 ACDACDACD >H.sapiens_ACD12739 ACDDCACDDC	>H.sapiens_ACE1180 GCUUGUGAUGCUUGUGAUGCUUGUGAU >H.sapiens_ACD12739 GCUUGUGAUGAUUGUGCUUGUGAUGAUUGU
Same Tag	>H.sapiens_ACE80_1 ACDACDACD >H.sapiens_ACD12739_2 ACDDCACDDC	>H.sapiens_ACE1180_1 GCUUGUGAUGCUUGUGAUGCUUGUGAU > H.sapiens_ACDS2_2 GCUUGUGAUGAUUGUGCUUGUGAUGAUUGU

ii) amino acid files in main directory inputfolder/ whereas nucleotide files in directory ntfolder/ .
iii) RUN step 1 with codon alignment ENABLED will automatically finish up until the step 2, MSA.
Example:
a) Run all steps with codon alignment, mutual information mode ON.

	$python3 MISPhyl.py -f fasta -i aa -codon -cpus 4

b) Run step 1 to step 2 with codon alignment, mutual information mode ON

	$python3 MISPhyl.py -f fasta -i aa -codon -cpus 4 -s 1

Step 2 : Multiple Sequence Alignment

Make a directory orthologFamily/.

	$mkdir orthologFamily/

Move your .fasta files into folder orthologFamily/. 3.Example:
1. Run step 2 with mafft program

	$python3 MISPhyl.py -s 2

Step 3 : (Optional) Mutual Information & Concatenation

Make a directory msa/.

	$mkdir msa

Move aligned files into folder msa/. 3.Example: a) Mutual information mode ON with 10 median ranked genes and a aligned output file named aligned.fa

	$python3 MISPhyl.py -s 3 -msa aligned.fa -r 10

b) Step 3 with codon alignment and mutual information mode ON

	$python3 MISPhyl.py -s 3

Step 4 : Model Selection & Tree Construction

Put your MSA file in current directory.
Ensure you have your partition.txt file in your current directory. If you run all the steps from 1 to 4, you need not to worry for this. Partition.txt is produced in step 3.
Example:
a) Run step 4 with input file MSA.fa, nucleotide, output files prefix "tree1", 2 cpus, boostrapping of 250 and a partition file named "partition.txt".
```
$python3 MISPhyl.py -s 4 -x tree1 -cpus 2 -b 250 -msa MSA.fa -i nt -partition partition.txt
```

Options

Argument	Type	Default	Description
-h	N/A	N/A	show help message
-s	int	0 (all)	select step to be run 0:all (from step 1 to 3) 1:proteinortho 2:msa(muscle/mafft) 3:raxml-ng
-cpus	int	-1 (all available)	number of cpu / threads to be utilized
-f	string	faa	input file extension {fasta,faa,fna,fa}
-i	string	N/A	type of input sequences {protein:aa / nucleotide:nt}
-mi	N/A	ON	mutual information : select optimal phylogenetical signal genes for phylogenomic interference
-r	int	50	number of median-ranked range genes in MI_genes.csv to be concatenated
-p	int	myproject	prefix for proteinOrtho resulting file names
-codon	N/A	OFF	codon based alignment, translate protein alignments to nucleotide alignments
-algo	string	protein [diamond] nucleotide [blastn]	blast program available for proteinOrtho
-path	string	./dependencies/	binpath for proteinOrtho blast program selection
-msa	string	MSA.fa	multiple sequence alignment output filename in FASTA format
-maxiter	int	0	number of maximum iterations in mafft
-partition	string	partition.txt	partition filename
-n	string	modeltest	modelTest-NG output file prefix
-x	string	T1	prefix for raxml-ng output files
-model	string	bic	model selection for tree construction {bic,aic,aicc}
-b	int	500	number of bootstrap replicates for raxml-ng
-t	N/A	ON	Have minimum of four input files; required for tree construction

Main Output Folders

Step 1 Folders:

orthologFamily/: core orthologous proteins/genes
nt_orthologFamily/: corresponding core orthologous nucleotides (codon alignment)
codonAlignment/: codon aligned nucleotides

Step 2 Folder:

msa/: multiple sequence alignment files

Step 3 Files:

MSA.fa: concatenated MSA file
MI_genes.csv: Mutual Information file
partition.txt

Step 4 Folder:

treeConstruction/: constructed tree files

Highlight

If unfortunately, you encounter this error when reaching tree construction step:

ERROR: modeltest-ng-static binary file seems to be not compatible with your hardware. But no worries.
There are two recommended ways to solve this issue: 
1. Download the source files from modeltest-ng github \'https://github.com/ddarriba/modeltest/wiki/Download-and-Install\' and run the partition file using modeltest-ng instead of modeltest-ng-static AND comment the try and except block in MISPhyl.py script (line 495 to 499). Rerun step 4, tree construction.
2.Comment the try and except block in MISPhyl.py script (line 495 to 499) AND stick to one substitution model for all the genes (kindly make use of the argument \'-model\' to provide the wanted subtitution model). Rerun tree construction, step 4.

Please follow the suggested ways to resolve it. Have fun!

References

Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x
Darriba, Di. et.al. (2020). ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models. Molecular Biology and Evolution, 37, 291-294. https://doi.org/10.1093/molbev/msz189
Katoh, K. et.al. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30, 3059-3066. https://doi.org/10.1093/nar/gkf436
Kozlov, A. M. et.al. (2019). RAxML-NG: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35, 4453-4455. https://doi.org/10.1093/bioinformatics/btz305
Lechner, M. et.al. (2011). Proteinortho: Detection of (Co-)orthologs in large-scale analysis. BMC Bioinformatics, 12, 124. https://doi.org/10.1186/1471-2105-12-124
Suyama, M. et.al. (2006). PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research, 34, W609-W612. https://doi.org/10.1093/nar/gkl315
Tan, J. L. et.al. (2013). A phylogenomic approach to bacterial subspecies classification: Proof of concept in Mycobacterium abscessus. BMC Genomics, 14, 879. https://doi.org/10.1186/1471-2164-14-879

pendy05 / misphyl Goto Github PK

misphyl's Introduction