heche-psb / wgd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from arzwa/wgd

21.0 1.0 0.0 192.97 MB

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication

Home Page: https://wgdv2.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 100.00%

genomics paleobiology timing wgd

wgd's Introduction

`wgd v2` : a suite tool of WGD inference and timing

Hengchi Chen, Arthur Zwaenepoel, Yves Van de Peer

Bioinformatics & Evolutionary Genomics Group, VIB-UGent Center for Plant Systems Biology

wgd v2 is a python package upgraded from the original wgd package aiming for the inference and timing of ancient whole-genome duplication (WGD) events. For the propose of illustrating the principle and usage of wgd v2, we compiled this documentation. Below we first gave an introduction over the scope and mechanism of wgd v2 and then the practical information pertaining to the installation and usage. An examplar workflow is provided in the tutorial section on how to seek evidence for a putative WGD event and perform proper timing with a freshly obtained genome assembly in hand. For those who are interested in more theoretical details, we recommend turning to our paper and book chapter for more detailed description and insightful discussions. The key improved features of wgd v2 are demonstrated in our latest manuscript on Bioinformatics. If you use wgd v2 in your research, please cite us as suggested in Citation section.

Introduction

Polyploidizations, the evolutionary process that the entire genome of an organism is duplicated, also named as whole-genome duplications (WGDs), occur recurrently across the tree of life. There are two modes of polyploidizations, autopolyploidizations and allopolyploidizations. Autopolyploidizations are the duplication of the same genome, resulting in two identical subgenomes at the time it emerged. While the allopolyploidizations are normally achieved in two steps, first the hybridization between two different species, resulting in the arising of transient homoploidy,second the duplication of the homoploidy, resulting in the emergence of allopolyploidy. Due to the unstability and unbalanced tetrasomic inheritance, for instance the nuclear-cytoplasmic incompatibility, the polyploidy genome will then experience a process called diploidization, also named as fractionation, during which a large portion of gene duplicates will get lost and only a fraction can be retained. The traces of polyploidizations can be thus unearthed from these retained gene duplicates. Three approaches based on gene duplicates, namely, K_S method, gene tree - species tree reconciliation method and synteny method, are commonly used in detecting evidence for WGDs. The gene tree - species tree reconciliation method is not within the scope of wgd v2, but we kindly refer readers who are interested to the phylogenomic program developed by Arthur Zwaenepoel named WHALE and the associated paper for more technical and theoretical details.

The K_S method is established on a model of gene family evolution that each gene family is allowed to evolve via gene duplication and loss. Note that the gene family here is assumed to be the cluster of all genes descended from an ancestral gene in a single genome. Recovering the gene tree of such gene family informs the timing, scilicet the age, of gene duplication events. The age refered here, is not in real geological time, but in the unit of evolutionary distance, i.e., the number of substitutions per site. When the evolutionary rate remains approximately constant, the evolutionary distance is then supposed to be proportional to the real evolutionary time. The synonymous distance K_S, the number of synonymous substitutions per synonymous site, is such candidate that synonymous substitutions would not incur the change of amino acid and are thus regarded as neutral, which according to the neutral theory should occur in constant rate. Given a model of gene family that allows the gene to duplicate and get lost in a fixed rate, one can derive that the probability density function of the K_S age distribution of retained gene duplicates is a quasi-exponential function that most retained gene duplicates are recently borned with ~0 age while as the age going older the associated number of retained gene duplicates decay quasi-exponentially. Therefore, the occurance of large-scale gene duplication events, for instane WGDs, with varied retention rate, will leave an age peak from the burst of gene duplicates in a short time-frame upon the initial age distribution, and can be unveiled from mixture modeling analysis. However, WGDs identified from the paralogous K_S age distributions can only inform the WGD timing in the time-scale of that specific species, which is not comparable in the phylogenetic context. Only with the orthologous K_S age distributions, which convert the estimated body from paralogues to orthologues and inform the relative timing of speciation events, can we decipher the phylogenetic placement of WGDs after proper rate correction. wgd v2 is such program that helps users construct paralogous and orthologous K_S age distributions and realize both the identification and placement of WGDs.

In the premise of phylogenetically located WGDs, the absolute age (in geological time) of WGDs can also be inferred from those WGD-retained gene duplicates, although there has been no easy or straightforward pipeline for this job so far. In wgd v2, we developed a feasible integrated pipleline for absolute dating of WGDs. The pipeline can be roughly divided into three main steps. 1) The construction of anchor K_S distribution and the delineation of crediable K_S range adopted for phylogenetic dating, using wgd dmd, wgd ksd, wgd syn and wgd peak. Note that here we only consider genome assembly because for transcriptome assembly it's impossible to distinguish WGD-derived duplicates from small-scale duplication-derived duplicates, which happened in a continuous time-frame instead of only a separate short time-frame and thus reflects the duration of that branch rather than the time at which WGD occurred. 2) The formulation of a starting tree used in the phylogenetic dating, composed of a few species and annotated with fossil calibration information. This step is essential for the result of absolute WGD dating that we suggest users of taking great caution to assure the tree topology and proper bounds for fossil calibrations. 3) The construction of orthogroups consisting of collinear duplicates of the focal species and their reciprocal best hits (RBHs) against other species in the starting tree and the phylogenetic dating using a molecular dating program for instance mcmctree, via program wgd dmd and wgd focus. We recommend the usage of Bayesian molecular dating program mcmctree, which provides a variety of different substitution and rate models. Nonetheless, we urge users to set prior distribution of different parameters with caution and assure adequate sampling of different parameters.

Installation

The easiest way to install wgd v2 is using PYPI. Note that if you want to get the latest update, we suggest installing from the source, since the update on PYPI will be delayed compared to here of source. Nonetheless, to install a stable version that has been well tested, we recommend version 2.0.38 by now. We suggest of adding the installation of numpy version 1.19.0 before wgd because pip can not resolve the requirements very well that the installation order of different dependencies might lead to incompatibility. Python version 3.6.5 or 3.8.0 (or nearby versions) are well tested to be compatible. We strongly recommend creating a virtual environment no matter you install from source, PYPI or bioconda, which can be achieved by the command below.

git clone https://github.com/heche-psb/wgd
cd wgd
virtualenv -p=python3 ENV (or python3/python -m venv ENV)
source ENV/bin/activate
pip install numpy==1.19.0
pip install wgd==2.0.38

To install from source, the following command lines could be used.

git clone https://github.com/heche-psb/wgd
cd wgd
virtualenv -p=python3 ENV (or python3 -m venv ENV)
source ENV/bin/activate
pip install numpy==1.19.0
pip install -r requirements.txt
pip install .

When met with permission problem in installation, please try the following command line.

pip install -e .

If multiple versions of wgd were installed in the system, please add the right path of interested version into the environment variables, for example

export PATH="$PATH:~/.local/bin/wgd"

Note that the version of numpy is important (for many other packages are the same of course), especially for fastcluster package. In our test, the numpy 1.19.0 works fine on python3.6/8. If you met some errors or warnings about numpy, maybe considering pre-install numpy as 1.19.0 or other close-by versions before you install wgd. wgd relies on external softwares including diamond and mcl for wgd dmd, paml v4.9j, mafft (muscle and prank if set), fasttree (or iqtree if set) for wgd ksd and optionally mrbayes for the phylogenetic inference function in wgd dmd and wgd focus (mafft, muscle and prank as well when the analysis requires sequence alignment). Some other optional softwares including paml v4.9j, r8s, beast, eggnog, diamond, interproscan, hmmer v3.1b2 and astral-pro (hmmer v3.1b2 is also for the orthogroup assignment function in wgd dmd and astral-pro is also for the function --collinearcoalescence in wgd dmd) are for the molecular dating, gene family function annotation or phylogenetic inference in wgd focus.

Pipelines

To quickly get familiar with wgd v2, we provided some common pipelines for users with a fresh genome assembly in hand.

Pipeline 1

Simple construction of age distribution

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd

#what will be in the result directory
-wgd_dmd
--Aquilegia_coerulea.tsv
-wgd_ksd
--Aquilegia_coerulea.tsv.ks.tsv Aquilegia_coerulea.tsv.ks.svg/pdf

The resulting Aquilegia_coerulea.tsv is the whole paranome family file, Aquilegia_coerulea.tsv.ks.tsv file is the K_S distribuiton file, Aquilegia_coerulea.tsv.ks.svg and Aquilegia_coerulea.tsv.ks.pdf are the K_S plots.

Pipeline 2

Simple construction of age distribution with collinearity

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn

# what will be in the wgd_syn result directory
-wgd_syn
--anchors.csv
--families.tsv
--gene-table.csv
--iadhore.conf
--scaffold_length.tsv
--segments_coordinates.tsv
--Segprofile.csv
--Syndepth.pdf/svg
--Aquilegia_coerulea_Aquilegia_coerulea_multiplicons_level.pdf/png/svg
--Aquilegia_coerulea_gene_order_perchrom.tsv
--Aquilegia_coerulea.tsv.anchors.ks.tsv
--Aquilegia_coerulea.tsv.ksd.pdf/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea.dot.pdf/png/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea.dot_unit_gene.pdf/png/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot.pdf/png/svg
--Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot_unit_gene.pdf/png/svg
--iadhore-out
---alignment.txt
---anchorpoints.txt
---baseclusters.txt
---genes.txt
---list_elements.txt
---multiplicon_pairs.txt
---multiplicons.txt
---segments.txt

The anchors.csv, families.tsv, gene-table.csv, scaffold_length.tsv, segments_coordinates.tsv, Segprofile.csv, Aquilegia_coerulea_gene_order_perchrom.tsv are the basic files summarizing the gene order, family, anchors and collinear segments. The Aquilegia_coerulea.tsv.anchors.ks.tsv is the K_S distribution file of anchor pairs. The Syndepth.pdf/svg shows the collinear ratio of multiplicons. The Aquilegia_coerulea_Aquilegia_coerulea_multiplicons_level.pdf/png/svg is the "dupStack" plot showing multiplicons with different levels (defined as the number of segments within). The Aquilegia_coerulea.tsv.ksd.pdf/svg is the K_S plot with anchor pairs annotated. The Aquilegia_coerulea-vs-Aquilegia_coerulea.dot.pdf/png/svg is the dot plot without K_S annotation and with coordinates in the number of base. The Aquilegia_coerulea-vs-Aquilegia_coerulea.dot_unit_gene.pdf/png/svg is the dot plot without K_S annotation and with coordinates in the number of gene. The Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot.pdf/png/svg is the dot plot with K_S annotation and with coordinates in the number of base. The Aquilegia_coerulea-vs-Aquilegia_coerulea_Ks.dot_unit_gene.pdf/png/svg is the dot plot with K_S annotation and with coordinates in the number of gene. The iadhore.conf is the configuration file for i-adhore. The iadhore-out subfolder contains the original collinear results from i-adhore (please refer to the manual therein for detailed description).

Pipeline 3

Construction of age distribution and ELMM analysis

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd viz -d wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_ELMM

# what will be in the wgd_ELMM result directory
-wgd_ELMM
--Aquilegia_coerulea.tsv.ks.tsv.ksd.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv.spline_node_averaged.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv.spline_weighted.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv_peak_detection_node_averaged.pdf/svg
--Aquilegia_coerulea.tsv.ks.tsv_peak_detection_weighted.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_node_averaged.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_weighted.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_node_averaged.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_weighted.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_node_averaged.pdf/svg
--elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_weighted.pdf/svg
--elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_node_averaged.pdf/svg
--elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_weighted.pdf/svg

The Aquilegia_coerulea.tsv.ks.tsv.ksd.pdf/svg is the basic K_S plot. The Aquilegia_coerulea.tsv.ks.tsv.spline_node_averaged.pdf/svg and Aquilegia_coerulea.tsv.ks.tsv.spline_weighted.pdf/svg are the spline plots used for peak detection. The Aquilegia_coerulea.tsv.ks.tsv_peak_detection_node_averaged.pdf/svg and Aquilegia_coerulea.tsv.ks.tsv_peak_detection_weighted.pdf/svg are the results of peak detection. The elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_node_averaged.pdf/svg and elmm_Aquilegia_coerulea.tsv.ks.tsv_best_models_weighted.pdf/svg are the results from the best model informed by BIC. The elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_node_averaged.pdf/svg and elmm_Aquilegia_coerulea.tsv.ks.tsv_models_data_driven_weighted.pdf/svg are the model results from data-driven initiation. The elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_node_averaged.pdf/svg and elmm_Aquilegia_coerulea.tsv.ks.tsv_models_random_weighted.pdf/svg are the model results from random initiation. The elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_node_averaged.pdf/svg and elmm_BIC_Aquilegia_coerulea.tsv.ks.tsv_weighted.pdf/svg are the BIC plots for each model.

Pipeline 4

Construction of age distribution with collinearity and peak finding analysis

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
wgd peak --heuristic wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sm wgd_syn/iadhore-out/segments.txt -le wgd_syn/iadhore-out/list_elements.txt -mp wgd_syn/iadhore-out/multiplicon_pairs.txt -n 1 4 -kc 3 -o wgd_peak

# what will be in the wgd_peak result directory
-wgd_peak
--AnchorKs_FindPeak
---AnchorKs_PeakCI_Aquilegia_coerulea.tsv.ks.tsv_node_averaged/weighted.pdf
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted_format.tsv
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted.tsv
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_node_averaged_format.tsv
---Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_node_averaged.tsv
--AnchorKs_GMM
---GMM_Elbow-Loss_original_Ks.pdf
---GMM_Original_AnchorKs_Clustering_Silhouette_Coefficient.pdf
---Original_AnchorKs_GMM_1/2/3/4components_prediction.tsv
---Original_AnchorKs_GMM_AIC_BIC.pdf
---Original_AnchorKs_GMM_Component1/2/3/4_node_averaged_Lognormal.pdf
---Original_AnchorKs_GMM_Component1/2/3/4_node_averaged.pdf
---LogGMM_CI
----GMM_1/2/3/4components_C0/1/2/3_95%CI.tsv
----GMM_Component1/2/3/4_node_averaged_Lognormal.pdf
---HighMass_CI
----GMM_1/2/3/4components_C0/1/2/3_HighMass_95%CI.tsv
----GMM_1/2/3/4components_HighMass_95%CI.pdf
--SegmentGuideKs_GMM
---GMM_Elbow-Loss_Segment_Ks.pdf
---GMM_Segment_Ks_Clustering_Silhouette_Coefficient.pdf
---Segment-guided_AnchorKs_GMM_1/2/3/4components_prediction.tsv
---Segment_Ks_Clusters_GMM_Component1/2/3/4.pdf
---Segment_Ks_Clusters_Lognormal_GMM_Component1/2/3/4.pdf
---Segment_Ks_GMM_AIC_BIC.pdf
---Segment_Ks.tsv
---HighMass_CI
----Segment_guided_1/2/3/4components_C0/1/2/3_HighMass_95%CI.tsv
----Segment_guided_1/2/3/4components_HighMass_95%CI.pdf
---HDR_CI
----Segment_guided_95%HDR_AP_1/2/3/4components_C0/1/2/3.tsv
----Segment_guided_AnchorKs_GMM_Component1/2/3/4_node_averaged_kde.pdf
----Segment_guided_AnchorKs_GMM_Component1/2/3/4_node_averaged.pdf
--SegmentKs_FindPeak
---SegmentKs_PeakCI_Aquilegia_coerulea.tsv.ks.tsv.pdf
---Peak_1/2_Segment_guided_Aquilegia_coerulea.tsv.ks.tsv_95%CI_MP_for_dating_format.tsv
---Peak_1/2_Segment_guided_Aquilegia_coerulea.tsv.ks.tsv_95%CI_MP_for_dating.tsv

Four result subfloders will be produced, namely AnchorKs_FindPeak, AnchorKs_GMM, SegmentGuideKs_GMM and SegmentKs_FindPeak. The AnchorKs_FindPeak subfloder contains results of the detected peaks by the signal module of SciPy library and the assumed highest mass part (referred to as HighMass hereafter) of each peak, which can be used for further WGD dating. The AnchorKs_GMM shows the GMM results upon the original anchor K_S distribution by the mixture module of scikit-learn library and two subfloders, LogGMM_CI containing the results of 95% CI of each component, HighMass_CI containing the HighMass of each component, which can be used for further WGD dating. The SegmentGuideKs_GMM subfolder presents results of segment K_S GMM which are mapped back to the residing anchor pairs and the associated 95% HDR and HighMass of each segment cluster in subfloders of HDR_CI and HighMass_CI. The SegmentKs_FindPeak subfolder is similar to AnchorKs_FindPeak but with segment K_S instead. The K_S in Multiplicon can also be calculated in place of Segment using the option --guide as such the result title, label, file and folder names will be changed accordingly.

Pipeline 5

Construction of age distribution with collinearity and rate correction

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
wgd dmd --globalmrbh Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_globalmrbh
wgd ksd wgd_globalmrbh/global_MRBH.tsv Aquilegia_coerulea Protea_cynaroides Acorus_americanus Vitis_vinifera -o wgd_globalmrbh_ks
wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -fa Aquilegia_coerulea -epk wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.nw -o wgd_viz_mixed_Ks --plotelmm --plotapgmm --reweight

# what will be in the wgd_viz_mixed_Ks result directory
-wgd_viz_mixed_Ks
--All_pairs.ks.node.weighted.pdf
--Focus_sister_pairs.ks.node.weighted.pdf
--global_MRBH.tsv.ks.tsv.ksd.pdf/svg
--Mixed.ks.Aquilegia_coerulea.node.weighted.pdf
--spair.corrected.ks.info.tsv
--spair.original.ks.info.tsv
--Simple_Ks_Distributions
---Acorus_americanus/Aquilegia_coerulea/Protea_cynaroides/Vitis_vinifera__Aquilegia_coerulea/Protea_cynaroides/Vitis_vinifera.ks.node.weighted.pdf

The All_pairs.ks.node.weighted.pdf is the K_S plot of all species pairs. The Focus_sister_pairs.ks.node.weighted.pdf is the K_S plot of all focal-sister species pairs. The global_MRBH.tsv.ks.tsv.ksd.pdf/svg is the K_S plot of the datafile global_MRBH.tsv.ks.tsv. The Mixed.ks.Aquilegia_coerulea.node.weighted.pdf is the final result of rate correction (with mixture modeling results if set). The spair.corrected.ks.info.tsv and spair.original.ks.info.tsv document the K_S information of all species pairs before and after rate correction. The subfolder Simple_Ks_Distributions contains the single K_S plots of all species pairs.

Pipeline 6

Construction of age distribution with collinearity and WGD dating

wgd dmd Aquilegia_coerulea -o wgd_dmd
wgd ksd wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea -o wgd_ksd
wgd syn -f mRNA -a Name wgd_dmd/Aquilegia_coerulea.tsv Aquilegia_coerulea.gff3 -ks wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -o wgd_syn
wgd peak --heuristic wgd_ksd/Aquilegia_coerulea.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sm wgd_syn/iadhore-out/segments.txt -le wgd_syn/iadhore-out/list_elements.txt -mp wgd_syn/iadhore-out/multiplicon_pairs.txt -n 1 4 -kc 3 -o wgd_peak
wgd dmd -f Aquilegia_coerulea -ap wgd_peak/AnchorKs_FindPeak/Aquilegia_coerulea.tsv.ks.tsv_95%CI_AP_for_dating_weighted_format.tsv -o wgd_dmd_ortho Potamogeton_acutifolius Spirodela_intermedia Amorphophallus_konjac Acanthochlamys_bracteata Dioscorea_alata Dioscorea_rotundata Acorus_americanus Acorus_tatarinowii Tetracentron_sinense Trochodendron_aralioides Buxus_austroyunnanensis Buxus_sinica Nelumbo_nucifera Telopea_speciosissima Protea_cynaroides Aquilegia_coerulea
wgd focus --protcocdating --aamodel lg wgd_dmd_ortho/merge_focus_ap.tsv -sp dating_tree.nw -o wgd_dating -d mcmctree -ds 'burnin = 2000' -ds 'sampfreq = 1000' -ds 'nsample = 20000' Potamogeton_acutifolius Spirodela_intermedia Amorphophallus_konjac Acanthochlamys_bracteata Dioscorea_alata Dioscorea_rotundata Acorus_americanus Acorus_tatarinowii Tetracentron_sinense Trochodendron_aralioides Buxus_austroyunnanensis Buxus_sinica Nelumbo_nucifera Telopea_speciosissima Protea_cynaroides Aquilegia_coerulea

# what will be in the wgd_dating result directory
-wgd_dating
--Concatenated.paln
--Concatenated.paln.paml
--G2S.Map
--GF00000001.paln
--..
--GF00000187.paln
--mcmctree
---Concatenated
----pep
-----Concatenated.paln.paml
-----dating_tree.nw
-----FigTree.tre
-----in.BV
-----lg.dat
-----lnf
-----mcmctree.ctrl
-----mcmctree.out
-----mcmc.txt
-----rates
-----rst
-----rst1
-----rub
-----tmp0001.ctl
-----tmp0001.out
-----tmp0001.trees
-----tmp0001.txt

The Concatenated.paln and Concatenated.paln.paml are the concatenated protein alignments in fasta and paml format. The G2S.Map is the map between gene and species names. The GF00000001.paln,.. and GF00000187.paln are the protein alignments for each gene family. The mcmctree subfolder contains the dating results for the concatenated family (and per gene family if set). The deeper Concatenated subfolder contains the dating results of concatenated protein alignment (or nucleotide alignment if set). The deepest pep subfolder contains the final dating results (of concatenated protein alignment in this case). Please refer to mcmctree manual for detailed description of each file produced by mcmctree. The important result files are FigTree.tre, mcmctree.out and mcmc.txt which document the final date estimation, log information and posterior samples for each node respectively.

Parameters

There are 7 main programs in wgd v2: dmd,focus,ksd,mix,peak,syn,viz. Hereafter we will provide a detailed elucidation on each of the program and its associated parameters. Please refer to the Usage for the scenarios to which each parameter applies.

The program wgd dmd can realize the delineation of whole paranome, RBHs (Reciprocal Best Hits), MRBHs (Multiple Reciprocal Best Hits), orthogroups and some other orthogroup-related functions, including circumscription of nested single-copy orthogroups (NSOGs), unbiased test of single-copy orthogroups (SOGs) over missing inparalogs, construction of BUSCO-guided single-copy orthogroups (SOGs),and the collinear coalescence inference of phylogeny.

wgd dmd sequences (option)
--------------------------------------------------------------------------------
-o, --outdir, the output directory, default wgd_dmd
-t, --tmpdir, the temporary working directory, default None, if None was given, the tmpdir would be assigned random names in current directory and automately removed at the completion of program, else the tmpdir would be kept
-p, --prot, flag option, whether using protein or nucleotide sequences
-c, --cscore, the c-score to restrict the homolog similarity of MRBHs, default None, if None was given, the c-score funcion wouldn't be activated, else expecting a decimal within the range of 0 and 1
-I, --inflation, the inflation factor for MCL program, default 2.0, with higher value leading to more but smaller clusters
-e, --eval, the e-value cut-off for similarity in diamond and/or hmmer, default 1e-10
--to_stop, flag option, whether to translate through STOP codons, if the flag was set, translation will be terminated at the first in-frame stop codon, else a full translation continuing on passing any stop codons would be initiated
--cds, flag option, whether to only translate the complete CDS that starts with a valid start codon and only contains a single in-frame stop codon at the end and must be dividable by three, if the flag was set, only the complete CDS would be translated
-f, --focus, the species to be merged on local MRBHs, default None, if None was given, the local MRBHs wouldn't be inferred
-ap, --anchorpoints, the anchor points data file from i-adhore for constructing the orthogroups with anchor pairs, default None
-sm, --segments, the segments datafile used in collinear coalescence analysis if initiated, default None
-le, --listelements, the listsegments data file used in collinear coalescence analysis if initiated, default None
-gt, --genetable, the gene table datafile used in collinear coalescence analysis if initiated, default None
-coc, --collinearcoalescence, flag option, whether to initiate the collinear coalescence analysis, if the flag was set, the analysis would be initiated
-kf, --keepfasta, flag option, whether to output the sequence information of MRBHs, if the flag was set, the sequences of MRBHs would be in output
-kd, --keepduplicates, flag option, whether to allow the same gene to occur in different MRBHs (only meaningful when the cscore was used), if the flag was set, the same gene could be assigned to different MRBHs
-gm, --globalmrbh, flag option, whether to initiate global MRBHs construction, if the flag was set, the --focus option would be ignored and only global MRBHs would be built
-n, --nthreads, the number of threads to use, default 4
-oi, --orthoinfer, flag option, whether to initiate orthogroup infernece, if the flag was set, the orthogroup infernece program would be initiated
-oo, --onlyortho, flag option, whether to only conduct orthogroup infernece, if the flag was set, only the orthogroup infernece pipeline would be performed while the other analysis wouldn't be initiated
-gn, --getnsog, flag option, whether to initiate the searching for nested single-copy gene families (NSOGs) (only meaningful when the orthogroup infernece pipeline was activated), if the flag was set, additional NSOGs analysis would be performed besides the basic orthogroup infernece
-tree, --tree_method, which gene tree inference program to invoke (only meaningful when the collinear coalescence, gene-to-family assignment or NSOGs analysis were activated), default fasttree
-ts, --treeset, the parameters setting for gene tree inference, default None, this option can be provided multiple times
-mc, --msogcut, the ratio cutoff for mostly single-copy family (meaningful when activating the orthogroup infernece pipeline) and species representation in collinear coalescence analysis, default 0.8.
-ga, --geneassign, flag option, whether to initiate the gene-to-family assignment analysis, if the flag was set, the analysis would be initiated
-sa, --seq2assign, the queried sequences data file in gene-to-family assignment analysis, default None, this option can be provided multiple times
-fa, --fam2assign, the queried familiy data file in gene-to-family assignment analysis, default None
-cc, --concat, flag option, whether to initiate the concatenation pipeline for orthogroup infernece, if the flag was set, the analysis would be initiated
-te, --testsog, flag option, whether to initiate the unbiased test of single-copy gene families, if the flag was set, the analysis would be initiated
-bs, --bins, the number of bins divided in the gene length normalization, default 100
-np, --normalizedpercent, the percentage of upper hits used for gene length normalization, default 5
-nn, --nonormalization, flag option, whether to call off the normalization, if the flag was set, no normalization would be conducted
-bsog, --buscosog, flag option, whether to initiate the busco-guided single-copy gene family analysis, if the flag was set, the analysis would be initiated
-bhmm, --buscohmm, the HMM profile datafile in the busco-guided single-copy gene family analysis, default None
-bctf, --buscocutoff, the HMM score cutoff datafile in the busco-guided single-copy gene family analysis, default None
-of ,--ogformat, flag option, whether to add index to the RBH families

The program wgd focus can realize the concatenation-based and coalescence-based phylogenetic inference and phylogenetic dating of WGDs etc.

wgd focus families sequences (option)
--------------------------------------------------------------------------------
-o, --outdir, the output directory, default wgd_focus
-t, --tmpdir, the temporary working directory, default None, if None was given, the tmpdir will be assigned random names in current directory and automately removed at the completion of program, else the tmpdir would be kept
-n, --nthreads, the number of threads to use, default 4
--to_stop, flag option, whether to translate through STOP codons, if the flag was set, translation will be terminated at the first in-frame stop codon, else a full translation continuing on past any stop codons would be initiated
--cds, flag option, whether to only translate the complete CDS that starts with a valid start codon and only contains a single in-frame stop codon at the end and must be dividable by three, if the flag was set, only the complete CDS would be translated
--strip_gaps, flag option, whether to drop all gaps in multiple sequence alignment, if the flag was set, all gaps would be dropped
-a, --aligner, which alignment program to use, default mafft
-tree, --tree_method, which gene tree inference program to invoke, default fasttree
-ts, --treeset, the parameters setting for gene tree inference, default None, this option can be provided multiple times
--concatenation, flag option, whether to initiate the concatenation-based species tree inference, if the flag was set, concatenation-based species tree would be infered
--coalescence, flag option, whether to initiate the coalescence-based species tree inference, if the flag was set, coalescence-based species tree would be infered
-sp, --speciestree, species tree datafile for dating, default None
-d, --dating, which molecular dating program to use, default none
-ds, --datingset, the parameters setting for dating program, default None, this option can be provided multiple times
-ns, --nsites, the nsites information for r8s dating, default None
-ot, --outgroup, the outgroup information for r8s dating, default None
-pt, --partition, flag option, whether to initiate partition dating analysis for codon, if the flag was set, an additional partition dating analysis would be initiated
-am, --aamodel, which protein model to be used in mcmctree, default poisson
-ks, flag option, whether to initiate Ks calculation for homologues in the provided orthologous gene family
--annotation, which annotation program to use, default None
--pairwise, flag option, whether to initiate pairwise Ks estimation, if the flag was set, pairwise Ks values would be estimated
-ed, --eggnogdata, the eggnog annotation datafile, default None
--pfam, which option to use for pfam annotation, default None
--dmnb, the diamond database for annotation, default None
--hmm, the HMM profile for annotation, default None
--evalue, the e-value cut-off for annotation, default 1e-10
--exepath, the path to the interproscan executable, default None
-f, --fossil, the fossil calibration information in Beast, default ('clade1;clade2', 'taxa1,taxa2;taxa3,taxa4', '4;5', '0.5;0.6', '400;500')
-rh, --rootheight, the root height calibration info in Beast, default (4,0.5,400)
-cs, --chainset, the parameters of MCMC chain in Beast, default (10000,100)
--beastlgjar, the path to beastLG.jar, default None
--beagle, flag option, whether to use beagle in Beast, if the flag was set, beagle would be used
--protcocdating, flag option, whether to only initiate the protein-concatenation-based dating analysis, if the flag was set, the analysis would be initiated
--protdating, flag option, whether to only initiate the protein-based dating analysis, if the flag was set, the analysis would be initiated

The program wgd ksd can realize the construction of K_S age distribution and rate correction.

wgd ksd families sequences (option)
--------------------------------------------------------------------------------
-o, --outdir, the output directory, default wgd_ksd
-t, --tmpdir, the temporary working directory, default None, if None was given, the tmpdir will be assigned random names in current directory and automately removed at the completion of program, else the tmpdir would be kept
-n, --nthreads, the number of threads to use, default 4
--to_stop, flag option, whether to translate through STOP codons, if the flag was set, translation will be terminated at the first in-frame stop codon, else a full translation continuing on past any stop codons would be initiated
--cds, flag option, whether to only translate the complete CDS that starts with a valid start codon and only contains a single in-frame stop codon at the end and must be dividable by three, if the flag was set, only the complete CDS would be translated
--pairwise, flag option, whether to initiate pairwise Ks estimation, if the flag was set, pairwise Ks values would be estimated
--strip_gaps, flag option, whether to drop all gaps in multiple sequence alignment, if the flag was set, all gaps would be dropped
-a, --aligner, which alignment program to use, default mafft 
-tree, --tree_method, which gene tree inference program to invoke, default fasttree
--tree_options, options in tree inference as a comma separated string, default None
--node_average, flag option, whether to initiate node-average way of de-redundancy instead of node-weighted, if the flag was set, the node-averaging de-redundancy would be initiated
-sr, --spair, the species pair to be plotted, default None, this option can be provided multiple times
-sp, --speciestree, the species tree to perform rate correction, default None, if None was given, the rate correction analysis would be called off
-rw, --reweight, flag option, whether to recalculate the weight per species pair, if the flag was set, the weight would be recalculated
-or, --onlyrootout, flag option, whether to only conduct rate correction using the outgroup at root as outgroup, if the flag was set, only the outgroup at root would be used as outgroup
-epk, --extraparanomeks, extra paranome Ks data to plot in the mixed Ks distribution, default None
-ap, --anchorpoints, anchorpoints.txt file to plot anchor Ks in the mixed Ks distribution, default None
-pk, --plotkde, flag option, whether to plot kde curve of orthologous Ks distribution over histogram in the mixed Ks distribution, if the flag was set, the kde curve would be plotted
-pag, --plotapgmm, flag option, whether to perform and plot mixture modeling of anchor Ks in the mixed Ks distribution, if the flag was set, the mixture modeling of anchor Ks would be plotted
-pem, --plotelmm, flag option, whether to perform and plot elmm mixture modeling of paranome Ks in the mixed Ks distribution, if the flag was set, the elmm mixture modeling of paranome Ks would be plotted
-c, --components, the range of the number of components to fit in anchor Ks mixture modeling, default (1,4)
-xl, --xlim, the x axis limit of Ks distribution
-yl, --ylim, the y axis limit of Ks distribution
-ado, --adjustortho, flag option, whether to adjust the histogram height of orthologous Ks as to match the height of paralogous Ks, if the flag was set, the adjustment would be conducted
-adf, --adjustfactor, the adjustment factor of orthologous Ks, default 0.5
-oa, --okalpha, the opacity of orthologous Ks distribution in mixed plot, default 0.5
-fa, --focus2all, set focal species and let species pair to be between focal and all the remaining species, default None
-ks, --kstree, flag option, whether to infer Ks tree, if the flag was set, the Ks tree inference analysis would be initiated
-ock, --onlyconcatkstree, flag option, whether to only infer Ks tree under concatenated alignment, if the flag was set, only the Ks tree under concatenated alignment would be calculated
-cs, --classic, flag option, whether to draw mixed Ks plot in a classic manner where the full orthologous Ks distribution is drawed, if the flag was set, the classic mixed Ks plot would be drawn
-ta, --toparrow, flag option, whether to adjust the arrow to be at the top of the plot, instead of being coordinated as the KDE of the orthologous Ks distribution, if the flag was set, the arrow would be set at the top
-bs, --bootstrap, the number of bootstrap replicates of ortholog Ks distribution in mixed plot

The program wgd mix can realize the mixture model clustering analysis of K_S age distribution.

wgd mix ks_datafile (option)
--------------------------------------------------------------------------------
-f, --filters, the cutoff alignment length, default 300
-r, --ks_range, the Ks range to be considered, default (0, 5)
-b, --bins, the number of bins in Ks distribution, default 50
-o, --outdir, the output directory, default wgd_mix
--method, which mixture model to use, default gmm
-n, --components, the range of the number of components to fit, default (1, 4)
-g, --gamma, the gamma parameter for bgmm models, default 0.001
-ni, --n_init, the number of k-means initializations, default 200
-mi, --max_iter, the maximum number of iterations, default 200

The program wgd peak can realize the search of crediable K_S range used in WGD dating.

wgd peak ks_datafile (option)
--------------------------------------------------------------------------------
-ap, --anchorpoints, the anchor points datafile, default None
-sm, --segments, the segments datafile, default None
-le, --listelements, the listsegments datafile, default None 
-mp, --multipliconpairs, the multipliconpairs datafile, default None
-o, --outdir, the output directory, default wgd_peak
-af, --alignfilter, cutoff for alignment identity, length and coverage, default 0.0, 0, 0.0
-r, --ksrange, range of Ks to be analyzed, default (0, 5)
-bw, --bin_width, bandwidth of Ks distribution, default 0.1
-ic, --weights_outliers_included, flag option, whether to include Ks outliers, if the flag was set, Ks outliers would be included in the analysis
-m, --method, which mixture model to use, default gmm
--seed, random seed given to initialization, default 2352890
-ei, --em_iter, the number of EM iterations to perform, default 200
-ni, --n_init, the number of k-means initializations, default 200
-n, --components, the range of the number of components to fit, default (1, 4)
-g, --gamma, the gamma parameter for bgmm models, default 1e-3
--boots, the number of bootstrap replicates of kde, default 200
--weighted, flag option, whether to use node-weighted method of de-redundancy, if the flag was set, the node-weighted method would be used
-p, --plot, the plotting method to be used, default identical
-bm, --bw_method, the bandwidth method to be used in analyzing the peak of WGD dates, default silverman
--n_medoids, the number of medoids to fit, default 2
-km, --kdemethod, the kde method to be used in analyzing the peak of WGD dates, kmedoids analysis or the basic Ks plotting, default scipy
--n_clusters, the number of clusters to plot Elbow loss function, default 5
-gd, --guide, the regime residing anchors, default Segment
-prct, --prominence_cutoff, the prominence cutoff of acceptable peaks in peak finding steps, default 0.1
-rh, --rel_height, the relative height at which the peak width is measured, default 0.4
-kd, --kstodate, the range of Ks to be dated in heuristic search, default (0.5, 1.5)
-xl, --xlim, the x axis limit of GMM Ks distribution
-yl, --ylim, the y axis limit of GMM Ks distribution
--manualset, flag option, whether to output anchor pairs with manually set Ks range, if the flag was set, manually set Ks range would be outputted
--ci, the confidence level of log-normal distribution to date, default 95
--hdr, the highest density region (HDR) applied in the segment-guided anchor pair Ks distribution, default 95
--heuristic, flag option, whether to initiate heuristic method of defining CI for dating, if the flag was set, the heuristic method would be initiated
-kc, --kscutoff, the Ks saturation cutoff in dating, default 5
--keeptmpfig, flag option, whether to keep temporary figures in peak finding process, if the flag was set, those figures would be kept

The program wgd syn can realize the intra- and inter-specific synteny inference.

wgd syn families gffs (option)
--------------------------------------------------------------------------------
-ks, --ks_distribution, ks distribution datafile, default None
-o, --outdir, the output directory, default wgd_syn
-f, --feature, the feature for parsing gene IDs from GFF files, default gene
-a, --attribute, the attribute for parsing the gene IDs from the GFF files, default ID
-atg, --additionalgffinfo, the feature and attribute information of additional gff3 files if different in the format of (feature;attribute)', default None
-ml, --minlen, the minimum length of a scaffold to be included in dotplot, default -1, if -1 was set, the 10% of the longest scaffold would be set
-ms, --maxsize, the maximum family size to be included, default 200
-r, --ks_range, the Ks range in colored dotplot, default (0, 5)
--pathiadhore, the path to the i-adhore executable, which can be simply igored if the i-adhore can already be properly called, default None
--iadhore_options, the parameter setting in iadhore, default as a string of length zero
-mg, --minseglen, the minimum length of segments to include in ratio if <= 1, default 10000
-kr, --keepredun, flag option, whether to keep redundant multiplicons, if the flag was set, the redundant multiplicons would be kept
-mgn, --mingenenum, the minimum number of genes for a segment to be considered, default 30
-ds, --dotsize, the dot size in dot plot, default 0.3
-aa, --apalpha, the opacity of anchor dots, default 1
-ha, --hoalpha, the opacity of homolog dots, default 0
-srt, --showrealtick, flag option, whether to show the real tick in genes or bases, if the flag was set, the real tick would be showed
-tls, --ticklabelsize, the label size of tick, default 5
-gr, --gistrb, flag option, whether to use gist_rainbow as color map of dotplot
-n, --nthreads, the number of threads to use in synteny inference, default 4

The program wgd viz can realize the visualization of K_S age distribution and synteny.

wgd viz (option)
--------------------------------------------------------------------------------
-d, --datafile, the Ks datafile, default None
-o, --outdir, the output directory, default wgd_viz
-sr, --spair, the species pair to be plotted, default None, this option can be provided multiple times
-fa, --focus2all, set focal species and let species pair to be between focal and all the remaining species, default None
-gs, --gsmap, the gene name-species name map, default None
-sp, --speciestree, the species tree to perform rate correction, default None, if None was given, the rate correction analysis would be called off
-pk, --plotkde, flag option, whether to plot kde curve upon histogram, if the flag was set, kde curve would be added
-rw, --reweight, flag option, whether to recalculate the weight per species pair, if the flag was set, the weight would be recalculated
-or, --onlyrootout, flag option, whether to only conduct rate correction using the outgroup at root as outgroup, if the flag was set, only the outgroup at root would be used as outgroup
-iter, --em_iterations, the maximum EM iterations, default 200
-init, --em_initializations, the maximum EM initializations, default 200
-prct, --prominence_cutoff, the prominence cutoff of acceptable peaks, default 0.1
-rh, --rel_height, the relative height at which the peak width is measured, default 0.4
-sm, --segments, the segments datafile, default None
-ml, --minlen, the minimum length of a scaffold to be included in dotplot, default -1, if -1 was set, the 10% of the longest scaffolds will be set
-ms, --maxsize, the maximum family size to be included, default 200
-ap, --anchorpoints, the anchor points datafile, default None
-mt, --multiplicon, the multiplicons datafile, default None
-gt, --genetable, the gene table datafile, default None
-mg, --minseglen, the minimum length of segments to include, in ratio if <= 1, default 10000
-mgn, --mingenenum, the minimum number of genes for a segment to be considered, default 30
-kr, --keepredun, flag option, whether to keep redundant multiplicons, if the flag was set, the redundant multiplicons would be kept
-epk, --extraparanomeks, extra paranome Ks data to plot in the mixed Ks distribution, default None
-pag, --plotapgmm, flag option, whether to conduct and plot mixture modeling of anchor Ks in the mixed Ks distribution, if the flag was set, the mixture modeling of anchor Ks would be conducted and plotted
-pem, --plotelmm, flag option, whether to conduct and plot elmm mixture modeling of paranome Ks in the mixed Ks distribution, if the flag was set, the elmm mixture modeling of paranome Ks would be conducted and plotted
-c, --components, the range of the number of components to fit in anchor Ks mixture modeling, default (1,4)
-psy, --plotsyn, flag option, whether to initiate the synteny plot, only when the flag was set, the synteny plot would be produced
-ds, --dotsize, the dot size in dot plot, default 0.3
-aa, --apalpha, the opacity of anchor dots, default 1
-ha, --hoalpha, the opacity of homolog dots, default 0
-srt, --showrealtick, flag option, whether to show the real tick in genes or bases, if the flag was set, the real tick would be showed
-tls, --ticklabelsize, the label size of tick, default 5
-xl, --xlim, the x axis limit of Ks distribution
-yl, --ylim, the y axis limit of Ks distribution
-ado, --adjustortho, flag option, whether to adjust the histogram height of orthologous Ks as to match the height of paralogous Ks, if the flag was set, the adjustment would be conducted
-adf, --adjustfactor, the adjustment factor of orthologous Ks, default 0.5
-oa, --okalpha, the opacity of orthologous Ks distribution in mixed plot, default 0.5
-cs, --classic, flag option, whether to draw mixed Ks plot in a classic manner where the full orthologous Ks distribution is drawed, if the flag was set, the classic mixed Ks plot would be drawn
-ta, --toparrow, flag option, whether to adjust the arrow to be at the top of the plot, instead of being coordinated as the KDE of the orthologous Ks distribution, if the flag was set, the arrow would be set at the top
-na, --nodeaveraged, flag option, whether to use node-averaged method for de-redundancy, if the flag was set, the node-averaged method would be initiated
-bs, --bootstrap, the number of bootstrap replicates of ortholog Ks distribution in mixed plot
-gr, --gistrb, flag option, whether to use gist_rainbow as color map of dotplot
-n, --nthreads, the number of threads to use in bootstrap sampling, default 1

Usage

Here we provided the basic usage for each program and the relevant parameters and suggestions on parameterization. A reminder that the given cores and threads can significantly impact the run time and thus we added some report information pertaining to the system of users to facilitate the efficient setting of threads and memory. The logical CPUs reported represents the number of physical cores multiplied by the number of threads that can run on each core, also known as Hyper Threading. The number of logical CPUs may not necessarily be equivalent to the actual number of CPUs the current process can use. The available memory refers to the memory that can be given instantly to processes without the system going into swap and reflects the actual memory available. The free memory refers to the memory not being used at all (zeroed) that is readily available. The description above refers to the documentation of psutil.

wgd dmd

The delineation of whole paranome

wgd dmd Aquilegia_coerulea -I 2 -e 1e-10 -bs 100 -np 5 (-nn) (--to_stop) (--cds) (-n 4) (-o wgd_dmd) (-t working_tmp)

Note that we don't provide the data of this coding sequence (cds) file Aquilegia_coerulea but it can be easily downloaded at Phytozome (same for other Usage doc). A reminder that in the issues some users didn't download the Acoerulea_322_v3.1.cds_primaryTranscriptOnly.fa.gz but instead the Acoerulea_322_v3.1.cds.fa.gz file. For the construction of whole paranome K_S distribution, only one transcript (the primary one) per gene should be included such that the K_S is really indicating the age of gene duplication event, instead of alternative splicing. Transcriptome data should be carefully treated with de-redundancy so as to reduce the false positive duplication bump caused by pervasive alternative splicing. Here the inflation factor parameter, given by -I or --inflation, affects the granularity or resolution of the clustering outcome and implicitly controlls the number of clusters, with low values such as 1.3 or 1.4 leading to fewer but larger clusters and high values such as 5 or 6 leading to more but smaller clusters. We set the default value as 2 as suggested by MCL. The e-value cut-off for sequence similarity, given by -e or --eval, which denotes the expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database, is the key parameter measuring the significance of a hit, which is set here as default 1e-10. Note that DIAMOND itself by default only reports all alignments with e-value < 0.001. The percentage of upper hits used for gene length normalization, given by -np or --normalizedpercent, which determines the upper percentile of hits per bin (categorized by gene length) used in the fit of linear regression, considering that not all hits per bin show apparent linear relationship, is set as default 5, indicating the usage of top 5% hits per bin. The number of bins divided in gene length normalization, given by -bs or --bins, determines the number of bins to categorize the gene length, is set as default 100. The parameter -nn or --nonormalization can be set to call off the normalization process, although it's suggested to conduct the normalization to acquire more accurate gene family clustering result. The parameters --to_stop and --cds control the behaviour of translating coding sequence into amino acid sequence. If the --to_stop was set, the translation would be terminated at the first in-frame stop codon, otherwise the translation would simply skip any stop codons. If the --cds was set, sequences that doesn't start with a valid start codon, or contains more than one in-frame stop codon, or is not dividable by 3, would be simply dropped, such that only strict complete coding sequences would be included in the subsequent analysis. The exact behaviour of --to_stop and --cds is defined and described in the biopython library. The number of parallel threads by the option -n or --nthreads can be set to accelerate the calculation within diamond. The directory of output or intermediate files is determined by the parameter -o or --outdir, and -t or --tmpdir, which will be created by the program itself and be overwritten if the folder has already been created. Note that the software diamond should be pre-installed and set in the environment path in all the analysis performed by wgd dmd except for the collinear coalescence analysis.

We suggest that the default setting in which the inflation factor is set as 2, e-value cut-off as 1e-10 and other parameters in default is a good starting point, unless you specifically want to explore the effects of different parameters. Such that the command for the delineation of whole paranome is simply as below.

wgd dmd Aquilegia_coerulea

The delineation of RBHs

wgd dmd sequence1 sequence2 -e 1e-10 -bs 100 -np 5 (-nn) (-n 4) (-c 0.9) (--ogformat) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

To delineate RBHs between two cds sequence files, the relevant parameter is mostly the same as whole paranome inference, except for the parameter -c or --cscore, which ranges between 0 and 1 and is used to relax the similarity cutoff from the exclusive reciprocal best hits to a certain ratio as to the best hits. For instance, if the gene b1 from genome B has the best hit gene a1 from genome A with the bit score as 100, which is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with larger values indicating higher similarities, given the -c 0.9, genes from genome A which has the bit score with gene b1 higher than 0.9x100 will also be written in the result file, which in a sense are not RBHs anymore of course, but the highly similar homologue pairs. If more than 2 sequence files were provided, every pair-wise RBHs would be calculated except for querying the same sequence itself. The number of parallel threads to booster the running speed can be set by the option -n or --nthreads which is suggested to be set as (N-1)N/2 where N is the number of cds files to achieve the highest efficiency. The option --ogformat can be set to add index (for instance GF00000001) to the output RBH gene families which can be further used in the K_S calculation by wgd ksd.

The suggested command to start with is also under the default setting with the command shown below

wgd dmd sequence1 sequence2

The delineation of local MRBHs

wgd dmd sequence1 sequence2 sequence3 -f sequence1 -e 1e-10 -bs 100 -np 5 (-nn) (-n 4) (-kf) (-kd) (-c 0.9) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

The distinction between local and global (hereunder) MRBHs is that local MRBHs are the results of merged RBHs on a joint focal species, for instance in a three species system (A,(B,C)), the local MRBHs of A only require the calculation of RBHs between A and C (denoted as AC) and AB and then the merging of AB and AC at the axis of A, while the gloabl MRBHs are independent of focal species in that it just calls the calculations of all possible species pair (not self to self species pair), such that AB, AC, and BC are all to be calculated and merged.

Two types of MRBHs as intepretated above can be delineated by wgd dmd, the local MRBHs and the global MRBHs. The local MRBHs are constructed by merging all the relevant RBHs only with the focal species, which is set by -f or --focus. The parameter -kf or --keepfasta can be set to retain the sequence information of each MRBH. The parameter -kd or --keepduplicates determines whether the same genes can appear in different local MRBHs. Normally there will be no duplicates in the local MRBHs but if users set the -c as 0.9 (or any value smaller than 1), it's likely that the same gene will have chance to appear multiple times in different local MRBHs. That is to say, the parameter -kd is meaningful only when it's set together with the parameter -c. The number of parallel threads is suggested to be set as the number of cds files minus 1.

A suggested starting run is under the default parameter with the command shown as below.

wgd dmd sequence1 sequence2 sequence3 -f sequence1

The delineation of global MRBHs

wgd dmd sequence1 sequence2 sequence3 -gm -e 1e-10 -bs 100 -np 5 (-nn) (-n 4) (-kf) (-kd) (-c 0.9) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

The global MRBHs is constructed by exhaustively merging all the possible pair-wise RBHs except for querying the sequence itself, which can be initiated by add the flag -gm or --globalmrbh. The rest of relevant parameters stays the same as the local MRBHs. The number of parallel threads is suggested to be set as (N-1)N/2 too where N is the number of cds files to achieve the highest efficiency.

A suggested starting run is under the default parameter with the command shown as below.

wgd dmd sequence1 sequence2 sequence3 -gm

The delineation of orthogroups

wgd dmd sequence1 sequence2 sequence3 -oi -oo -e 1e-10 -bs 100 -np 5 (-nn) (-cc) (-te) (-mc 0.8) (-gn) (-tree 'fasttree') (-ts '-fastest') (-n 4) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

In wgd v2, we also implemented an algorithm of delineating orthogroups, which can be initiated with the parameter -oi or --orthoinfer. Two ways of delineation can be chosen, the concatenation way (set by the parameter -cc or --concat) or the non-concatenation (default) way. In brief, the concatenation way of delineating orthogroups starts with concatenating all the sequences into a single sequence file and then inferring the whole paranome of this single sequence file with the clustering results mapped back to the belonging species. While the non-concatenation way starts with respective pair-wise diamond search (including querying the same sequence itself) and then all the sequence similarity tables will be concatenated and clustered into orthogroups. Some other possibly useful post-clustering functions can be initiated, including the parameter -te or --testsog, which can be set to start the unbiased test of single-copy gene families (note that this function needs hmmer (v3.1b2) to be installed in the environment path), the parameter -mc or --msogcut, ranging between 0 to 1, which can be set to search the so-called mostly single-copy family which has higher than certain cut-off percentage of species coverage, the parameter -gn or --getnsog, which can be set to search for nested single-copy gene families (NSOGs) which is originally multiy-copy but has a (mostly) single-copy branch (which requires the chosen tree-inference program set by -tree or --tree_method to be pre-installed in the environment path with the parameters setting for gene tree inference controlled by -ts or --treeset). The program wgd dmd would still conduct the RBHs calculation unless the parameter -oo or --onlyortho was set. If one only wants to infer the orthogroups, it's suggested to add the flag -oo to just implement the orthogroups delineation analysis. The number of parallel threads is suggested to be set as (N+1)N/2 where N is the number of cds files to achieve the highest efficiency since the self-comparison is also included.

The default setting of parameters is a reasonable starting point with the command as below.

wgd dmd sequence1 sequence2 sequence3 -oi -oo

The collinear coalescence inference of phylogeny

wgd dmd sequence1 sequence2 sequence3 -ap apdata -sm smdata -le ledata -gt gtdata -coc (-tree 'fasttree') (-ts '-fastest') (-n 4) (--to_stop) (--cds) (-o wgd_dmd) (-t working_tmp)

A novel phylogenetic inference method named "collinear coalescence inference" is also implemented in wgd v2. For this analysis, users need to provide the anchor points file by -ap or --anchorpoints, the collinear segments file by -sm or --segments, the listsegments file by -le or --listelements, and the gene table file by -gt or --genetable, all of which can be produced in the program wgd syn. The parameter -coc or --collinearcoalescence needs to be set to start this analysis. The tree-inference program and the associated parameters can be set just as above by -tree or --tree_method and -ts or --treeset. Please also make sure the chosen tree-inference program is installed in the environment path. The program astral-pro is required to be installed in the environment path too. Note that there should be no duplicated gene IDs in the sequence file. The parallel threads here are to accelerate the sequence alignment and gene tree inference for each gene family and thus suggested to be set as much as the number of gene families.

A suggested starting run of this analysis is with the simple command below.

wgd dmd sequence1 sequence2 sequence3 -ap apdata -sm smdata -le ledata -gt gtdata -coc

wgd ksd

The construction of whole paranome K_S age distribution

wgd ksd families sequence (-o wgd_ksd -t wgd_ksd_tmp --nthreads 4 --to_stop --cds --pairwise --strip_gaps --aligner mafft --tree_method fasttree --node_average)

The program wgd ksd, as impied by its name, is for the construction of K_S age distribution. Except for the aforementioned parameters such as --to_stop and --cds, there are some important parameters that have crucial impact on the K_S estimation. The option --pairwise is a very important parameter for the K_S estimation, with which the CODEML will calculate the K_S for each gene pair separately based on the alignment of only these two genes instead of the whole alignment of the family, such that less gaps are expected and thus the alignment in the consideration of CODEML will be longer (because CODEML will automately skip every column with gap, regardless of whether it's an overall or partial gap), without which the CODEML will calculate the K_S based on the whole alignment of the family, which might have no K_S result at all if the stripped alignment length (removing all gap-containing columns) was zero, a cause of different number of K_S estimates between "pairwise mode" and "non-pairwise mode". It's difficult to say which mode is more ideal, although the "non-pairwise mode" (default setting) which runs on the whole alignment instead of a local alignment, might be more biologically conserved in that it assures the evolution of each column to be started from the root of the family and all the gene duplicates are taken into account in the K_S estimation process. The option --strip_gaps can remove all the gap-containing columns, with or without which the result of "non-pairwise mode" won't be affected, while with which the result of "pairwise mode" will be altered. The option --aligner and --aln_options which decide which alignment program to be used and which parameter to be set, will have impact on the K_S results, noted that the default program is mafft and the parameter is --auto. The option --tree_method and --tree_options decide which gene tree inference program to be used and which parameter to be set, won't affect the K_S estimation itself but the result of de-redundancy, noted that we implemented a built-in gene tree inference method based on the Average Linkage Clustering (ALC) (thus a distance-based tree) with the --tree_method set as "cluster". Two methods of de-redundancy are implemented in wgd v2, namely node-weighted and node-averaged methods. The node-weighted method achieves the de-redundancy via weighing the K_S value associated with each gene pair such that the weight of a single duplication event sums up to 1 (noted that the number of K_S estimation remains the same) while the node-averaged method realizes the de-redundancy via calculating per gene tree node one averaged K_S value to represent the age of each duplication event. The option --node_average can be set to choose the node-averaged way of de-redundancy. Different methods of de-redundancy will have impact on the detection of WGD signals, which has been investigated in this literature. The parallel threads here are to parallelize the analysis for each gene family and thus suggested to be set as much as the number of gene families.

A suggested starting run can use command as below

wgd ksd families sequence

The construction of orthologous K_S age distribution

wgd ksd families sequence1 sequence2 sequence3 (--reweight)

From paralogous to orthologous K_S age distribution, users only need to provide more cds files. Note that with orthologous gene families the weighting method can be set to be calculated per species pair instead of considering the whole family because when plotting orthologous K_S age distribution between two species the weight calculated from this specific species pair should be conserved while the one calculated from the whole family will vary with the number of species. To initiate the weighting per species pair, the option --reweight can be set.

A suggested starting run can use command as below

wgd ksd families sequence1 sequence2 sequence3

The construction of K_S age distribution with rate correction

wgd ksd families sequence1 sequence2 sequence3 --focus2all sequence1 -sp spdata --extraparanomeks paranomeKsdata (--plotelmm --plotapgmm --anchorpoints apdata --reweight --onlyrootout)

Inspired by the rate correction algorithm in ksrates, we implemented the rate correction analysis also in wgd v2, which is mostly the same as ksrates but differs in the calculation of the standard deviation of rescaled K_S ages. To perform the rate correction analysis, users can use both the wgd ksd and wgd viz program. For the wgd ksd program, users need to provide a species tree via the option --speciestree on which rate correction can be conducted. Note that unnecessary brackets might lead to unexpected errors, for instance a tree (A,(B,C)); should not be represented as (A,((B,C)));. The set of species pairs to be shown is flexible that the most convenient option is --focus2all which simply shows all the possible focal-sister species pairs, or users can manually set the species pairs via the option --spair. Note that if the species pairs were manually set, it would be needed to co-set the option --classic. The option --onlyrootout can be set to only consider outgroup at the root, instead of all the possible outgroups per focal-sister species pair, which has impact on the final corrected K_S ages. We suggest of using all the possible outgroups per focal-sister species pair as for a less biased result. The paranome K_S data should be provided via the option --extraparanomeks.

Some other options which have no impact on the rate correction but add more layers or change the appearance on the mixed K_S age distribution, include --plotapgmm and --plotelmm etc. The option --plotapgmm can be set to call the GMM analysis upon the anchor pair K_S age distribution and plot the clustering result upon the mixed K_S age distribution, which has to be co-set with the option --anchorpoints providing the anchor pairs information. The option --plotelmm can be set to call the ELMM analysis upon the whole paranome K_S age distribution. Note that the species names present in the species tree file should match the names of the corresponding sequence files For instance, given the cds file names 'A.cds','B.cds','C.cds', the species tree could be '(A.cds,(B.cds,C.cds));' rather than '(A,(B,C));'. There is no requirement for the name of the paranome K_S datafile which can be named in whatever manner users prefer. The 'GMM' is the abbreviation of Gaussian Mixture Modeling while the 'ELMM' refers to Exponential-Lognormal Mixture Modeling as ksrates interprets.

There are 21 columns in the result .ks.tsv file besides the index columns pair as the unique identifier for each gene pair. The N, S, dN, dN/dS, dS, l and t are from the codeml results, representing the N estimate, the S estimate, the dN estimate, the dN/dS (omega) estimate, the dS estimate, the log-likelihood and the t estimate, respectively. The alignmentcoverage, alignmentidentity and alignmentlength are the information pertaining to the alignment for each family, representing the ratio of the stripped alignment length compared to the full alignment length, the ratio of columns with identical nucleotides compared to the overall columns of the stripped alignment, and the length of the full alignment, respectively.

wgd syn

The intra-specific synteny inference

wgd syn families gff (--ks_distribution ksdata -f gene -a ID --minlen -1 --minseglen 10000 --mingenenum 30)

The program wgd syn is mainly dealing with collinearity or synteny (both referred to as synteny hereafter) analysis. Two input files are essential, the gene family file and the gff3 file. The gene family file is in the format as OrthoFinder. The software i-adhore is a prerequisite. With default parameters, the program basically conducts 1) filtering gene families based on maximum family size 2) retrieving gene position and scaffold information from gff3 file 3) producing the configuration file and associated datafiles for i-adhore 4) calling i-adhore given the parameters set to infer synteny 5) visualizing the synteny in "dotplot" in the unit of genes and bases, in "Syndepth" plot showing the distribution of different categories of collinearity ratios within and between species, in "dupStack" plot showing multiplicons with different multiplication levels. 6) if with K_S data, a "K_S dotplot" with dots annotated in K_S values and a K_S distribution with anchor pairs denoted will be produced. The gene information in the gene family file and gff3 file should be matched which requires users to set proper --feature and --attribute. The maximum family size to be included can be set via the option --maxsize, noted that this filtering is mainly to drop those huge tandem duplicates family and transposable elements (TEs) family, and not mandatory. Users can filter those fragmentary scaffolds via the option --minlen. The minimum length and number of genes for a segment to be considered can be set via the option --minseglen and --mingenenum. Redundant multiplicons can be kept by set the flag option --keepredun.